Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 50

Search results for: Large Data

50 Multi-Level Air Quality Classification in China Using Information Gain and Support Vector Machine

Authors: Bingchun Liu, Pei-Chann Chang, Natasha Huang, Dun Li

Abstract:

Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.

Keywords: Machine learning, air quality classification, air quality index, information gain, support vector machine, cross-validation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 331
49 FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule

Authors: Lu Si, Jie Yu, Shasha Li, Jun Ma, Lei Luo, Qingbo Wu, Yongqi Ma, Zhengji Liu

Abstract:

Instance selection (IS) technique is used to reduce the data size to improve the performance of data mining methods. Recently, to process very large data set, several proposed methods divide the training set into some disjoint subsets and apply IS algorithms independently to each subset. In this paper, we analyze the limitation of these methods and give our viewpoint about how to divide and conquer in IS procedure. Then, based on fast condensed nearest neighbor (FCNN) rule, we propose a large data sets instance selection method with MapReduce framework. Besides ensuring the prediction accuracy and reduction rate, it has two desirable properties: First, it reduces the work load in the aggregation node; Second and most important, it produces the same result with the sequential version, which other parallel methods cannot achieve. We evaluate the performance of FCNN-MR on one small data set and two large data sets. The experimental results show that it is effective and practical.

Keywords: Instance selection, data reduction, MapReduce, kNN.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 574
48 Predicting Groundwater Areas Using Data Mining Techniques: Groundwater in Jordan as Case Study

Authors: Faisal Aburub, Wael Hadi

Abstract:

Data mining is the process of extracting useful or hidden information from a large database. Extracted information can be used to discover relationships among features, where data objects are grouped according to logical relationships; or to predict unseen objects to one of the predefined groups. In this paper, we aim to investigate four well-known data mining algorithms in order to predict groundwater areas in Jordan. These algorithms are Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbor (kNN) and Classification Based on Association Rule (CBA). The experimental results indicate that the SVMs algorithm outperformed other algorithms in terms of classification accuracy, precision and F1 evaluation measures using the datasets of groundwater areas that were collected from Jordanian Ministry of Water and Irrigation.

Keywords: Classification, data mining, evaluation measures, groundwater.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1473
47 Investigation of Regional Differences in Strong Ground Motions for the Iranian Plateau

Authors: Farhad Sedaghati, Shahram Pezeshk

Abstract:

Regional variations in strong ground motions for the Iranian Plateau have been investigated by using a simple statistical method called Analysis of Variance (ANOVA). In this respect, a large database consisting of 1157 records occurring within the Iranian Plateau with moment magnitudes of greater than or equal to 5 and Joyner-Boore distances up to 200 km has been considered. Geometric averages of horizontal peak ground accelerations (PGA) as well as 5% damped linear elastic response spectral accelerations (SA) at periods of 0.2, 0.5, 1.0, and 2.0 sec are used as strong motion parameters. The initial database is divided into two different datasets, for Northern Iran (NI) and Central and Southern Iran (CSI). The comparison between strong ground motions of these two regions reveals that there is no evidence for significant differences; therefore, data from these two regions may be combined to estimate the unknown coefficients of attenuation relationships.

Keywords: ANOVA, attenuation relationships, Iranian Plateau, PGA, regional variation, SA, strong ground motion.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 889
46 Estimating Bridge Deterioration for Small Data Sets Using Regression and Markov Models

Authors: Yina F. Muñoz, Alexander Paz, Hanns De La Fuente-Mella, Joaquin V. Fariña, Guilherme M. Sales

Abstract:

The primary approach for estimating bridge deterioration uses Markov-chain models and regression analysis. Traditional Markov models have problems in estimating the required transition probabilities when a small sample size is used. Often, reliable bridge data have not been taken over large periods, thus large data sets may not be available. This study presents an important change to the traditional approach by using the Small Data Method to estimate transition probabilities. The results illustrate that the Small Data Method and traditional approach both provide similar estimates; however, the former method provides results that are more conservative. That is, Small Data Method provided slightly lower than expected bridge condition ratings compared with the traditional approach. Considering that bridges are critical infrastructures, the Small Data Method, which uses more information and provides more conservative estimates, may be more appropriate when the available sample size is small. In addition, regression analysis was used to calculate bridge deterioration. Condition ratings were determined for bridge groups, and the best regression model was selected for each group. The results obtained were very similar to those obtained when using Markov chains; however, it is desirable to use more data for better results.

Keywords: Concrete bridges, deterioration, Markov chains, probability matrix.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 813
45 Applying Hybrid Graph Drawing and Clustering Methods on Stock Investment Analysis

Authors: Mouataz Zreika, Maria Estela Varua

Abstract:

Stock investment decisions are often made based on current events of the global economy and the analysis of historical data. Conversely, visual representation could assist investors’ gain deeper understanding and better insight on stock market trends more efficiently. The trend analysis is based on long-term data collection. The study adopts a hybrid method that combines the Clustering algorithm and Force-directed algorithm to overcome the scalability problem when visualizing large data. This method exemplifies the potential relationships between each stock, as well as determining the degree of strength and connectivity, which will provide investors another understanding of the stock relationship for reference. Information derived from visualization will also help them make an informed decision. The results of the experiments show that the proposed method is able to produced visualized data aesthetically by providing clearer views for connectivity and edge weights.

Keywords: Clustering, force-directed, graph drawing, stock investment analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1218
44 Development of an Automatic Calibration Framework for Hydrologic Modelling Using Approximate Bayesian Computation

Authors: A. Chowdhury, P. Egodawatta, J. M. McGree, A. Goonetilleke

Abstract:

Hydrologic models are increasingly used as tools to predict stormwater quantity and quality from urban catchments. However, due to a range of practical issues, most models produce gross errors in simulating complex hydraulic and hydrologic systems. Difficulty in finding a robust approach for model calibration is one of the main issues. Though automatic calibration techniques are available, they are rarely used in common commercial hydraulic and hydrologic modelling software e.g. MIKE URBAN. This is partly due to the need for a large number of parameters and large datasets in the calibration process. To overcome this practical issue, a framework for automatic calibration of a hydrologic model was developed in R platform and presented in this paper. The model was developed based on the time-area conceptualization. Four calibration parameters, including initial loss, reduction factor, time of concentration and time-lag were considered as the primary set of parameters. Using these parameters, automatic calibration was performed using Approximate Bayesian Computation (ABC). ABC is a simulation-based technique for performing Bayesian inference when the likelihood is intractable or computationally expensive to compute. To test the performance and usefulness, the technique was used to simulate three small catchments in Gold Coast. For comparison, simulation outcomes from the same three catchments using commercial modelling software, MIKE URBAN were used. The graphical comparison shows strong agreement of MIKE URBAN result within the upper and lower 95% credible intervals of posterior predictions as obtained via ABC. Statistical validation for posterior predictions of runoff result using coefficient of determination (CD), root mean square error (RMSE) and maximum error (ME) was found reasonable for three study catchments. The main benefit of using ABC over MIKE URBAN is that ABC provides a posterior distribution for runoff flow prediction, and therefore associated uncertainty in predictions can be obtained. In contrast, MIKE URBAN just provides a point estimate. Based on the results of the analysis, it appears as though ABC the developed framework performs well for automatic calibration.

Keywords: Automatic calibration framework, approximate Bayesian computation, hydrologic and hydraulic modelling, MIKE URBAN software, R platform.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1256
43 Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity

Authors: Hoda A. Abdel Hafez

Abstract:

Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.

Keywords: Mining Big Data, Big Data, Machine learning, Data Streams, Telecommunication.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1452
42 Content-Based Color Image Retrieval Based On 2-D Histogram and Statistical Moments

Authors: Khalid Elasnaoui, Brahim Aksasse, Mohammed Ouanan

Abstract:

In this paper, we are interested in the problem of finding similar images in a large database. For this purpose we propose a new algorithm based on a combination of the 2-D histogram intersection in the HSV space and statistical moments. The proposed histogram is based on a 3x3 window and not only on the intensity of the pixel. This approach overcome the drawback of the conventional 1-D histogram which is ignoring the spatial distribution of pixels in the image, while the statistical moments are used to escape the effects of the discretisation of the color space which is intrinsic to the use of histograms. We compare the performance of our new algorithm to various methods of the state of the art and we show that it has several advantages. It is fast, consumes little memory and requires no learning. To validate our results, we apply this algorithm to search for similar images in different image databases.

Keywords: 2-D histogram, Statistical moments, Indexing, Similarity distance, Histograms intersection.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1288
41 Computational Model for Predicting Effective siRNA Sequences Using Whole Stacking Energy (% G) for Gene Silencing

Authors: Reena Murali, David Peter S.

Abstract:

The small interfering RNA (siRNA) alters the regulatory role of mRNA during gene expression by translational inhibition. Recent studies show that upregulation of mRNA because serious diseases like cancer. So designing effective siRNA with good knockdown effects plays an important role in gene silencing. Various siRNA design tools had been developed earlier. In this work, we are trying to analyze the existing good scoring second generation siRNA predicting tools and to optimize the efficiency of siRNA prediction by designing a computational model using Artificial Neural Network and whole stacking energy (%G), which may help in gene silencing and drug design in cancer therapy. Our model is trained and tested against a large data set of siRNA sequences. Validation of our results is done by finding correlation coefficient of experimental versus observed inhibition efficacy of siRNA. We achieved a correlation coefficient of 0.727 in our previous computational model and we could improve the correlation coefficient up to 0.753 when the threshold of whole tacking energy is greater than or equal to -32.5 kcal/mol.

Keywords: Artificial Neural Network, Double Stranded RNA, RNA Interference, Short Interfering RNA.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2249
40 Organization of the Purchasing Function for Innovation

Authors: Jasna Prester, Ivana Rašić Bakarić, Božidar Matijević

Abstract:

Innovations not only contribute to competitiveness of the company but have also positive effects on revenues. On average, product innovations account to 14 percent of companies’ sales. Innovation management has substantially changed during the last decade, because of growing reliance on external partners. As a consequence, a new task for purchasing arises, as firms need to understand which suppliers actually do have high potential contributing to the innovativeness of the firm and which do not. Proper organization of the purchasing function is important since for the majority of manufacturing companies deal with substantial material costs which pass through the purchasing function. In the past the purchasing function was largely seen as a transaction-oriented, clerical function but today purchasing is the intermediate with supply chain partners contributing to innovations, be it product or process innovations. Therefore, purchasing function has to be organized differently to enable firm innovation potential. However, innovations are inherently risky. There are behavioral risk (that some partner will take advantage of the other party), technological risk in terms of complexity of products and processes of manufacturing and incoming materials and finally market risks, which in fact judge the value of the innovation. These risks are investigated in this work. Specifically, technological risks which deal with complexity of the products, and processes will be investigated more thoroughly. Buying components or such high edge technologies necessities careful investigation of technical features and therefore is usually conducted by a team of experts. Therefore it is hypothesized that higher the technological risk, higher will be the centralization of the purchasing function as an interface with other supply chain members. Main contribution of this research lies is in the fact that analysis was performed on a large data set of 1493 companies, from 25 countries collected in the GMRG 4 survey. Most analyses of purchasing function are done by case study analysis of innovative firms. Therefore this study contributes with empirical evaluations that can be generalized.

Keywords: Purchasing function organization, innovation, technological risk, GMRG 4 survey.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3326
39 Identification of Nonlinear Systems Using Radial Basis Function Neural Network

Authors: C. Pislaru, A. Shebani

Abstract:

This paper uses the radial basis function neural network (RBFNN) for system identification of nonlinear systems. Five nonlinear systems are used to examine the activity of RBFNN in system modeling of nonlinear systems; the five nonlinear systems are dual tank system, single tank system, DC motor system, and two academic models. The feed forward method is considered in this work for modelling the non-linear dynamic models, where the KMeans clustering algorithm used in this paper to select the centers of radial basis function network, because it is reliable, offers fast convergence and can handle large data sets. The least mean square method is used to adjust the weights to the output layer, and Euclidean distance method used to measure the width of the Gaussian function.

Keywords: System identification, Nonlinear system, Neural networks, RBF neural network.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2359
38 Vehicle Type Classification with Geometric and Appearance Attributes

Authors: Ghada S. Moussa

Abstract:

With the increase in population along with economic prosperity, an enormous increase in the number and types of vehicles on the roads occurred. This fact brings a growing need for efficiently yet effectively classifying vehicles into their corresponding categories, which play a crucial role in many areas of infrastructure planning and traffic management.

This paper presents two vehicle-type classification approaches; 1) geometric-based and 2) appearance-based. The two classification approaches are used for two tasks: multi-class and intra-class vehicle classifications. For the evaluation purpose of the proposed classification approaches’ performance and the identification of the most effective yet efficient one, 10-fold cross-validation technique is used with a large dataset. The proposed approaches are distinguishable from previous research on vehicle classification in which: i) they consider both geometric and appearance attributes of vehicles, and ii) they perform remarkably well in both multi-class and intra-class vehicle classification. Experimental results exhibit promising potentials implementations of the proposed vehicle classification approaches into real-world applications.

Keywords: Appearance attributes, Geometric attributes, Support vector machine, Vehicle classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3429
37 Identification of Coauthors in Scientific Database

Authors: Thiago M. R Dias, Gray F. Moita

Abstract:

The analysis of scientific collaboration networks has contributed significantly to improving the understanding of how does the process of collaboration between researchers and also to understand how the evolution of scientific production of researchers or research groups occurs. However, the identification of collaborations in large scientific databases is not a trivial task given the high computational cost of the methods commonly used. This paper proposes a method for identifying collaboration in large data base of curriculum researchers. The proposed method has low computational cost with satisfactory results, proving to be an interesting alternative for the modeling and characterization of large scientific collaboration networks.

Keywords: Extraction and data integration, Information Retrieval, Scientific Collaboration.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1323
36 Clustering in WSN Based on Minimum Spanning Tree Using Divide and Conquer Approach

Authors: Uttam Vijay, Nitin Gupta

Abstract:

Due to heavy energy constraints in WSNs clustering is an efficient way to manage the energy in sensors. There are many methods already proposed in the area of clustering and research is still going on to make clustering more energy efficient. In our paper we are proposing a minimum spanning tree based clustering using divide and conquer approach. The MST based clustering was first proposed in 1970’s for large databases. Here we are taking divide and conquer approach and implementing it for wireless sensor networks with the constraints attached to the sensor networks. This Divide and conquer approach is implemented in a way that we don’t have to construct the whole MST before clustering but we just find the edge which will be the part of the MST to a corresponding graph and divide the graph in clusters there itself if that edge from the graph can be removed judging on certain constraints and hence saving lot of computation.

Keywords: Algorithm, Clustering, Edge-Weighted Graph, Weighted-LEACH.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1992
35 On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation

Authors: Xiaohua Liu, Juan F. Beltran, Nishant Mohanchandra, Godfried T. Toussaint

Abstract:

Support vector machines (SVMs) are considered to be the best machine learning algorithms for minimizing the predictive probability of misclassification. However, their drawback is that for large data sets the computation of the optimal decision boundary is a time consuming function of the size of the training set. Hence several methods have been proposed to speed up the SVM algorithm. Here three methods used to speed up the computation of the SVM classifiers are compared experimentally using a musical genre classification problem. The simplest method pre-selects a random sample of the data before the application of the SVM algorithm. Two additional methods use proximity graphs to pre-select data that are near the decision boundary. One uses k-Nearest Neighbor graphs and the other Relative Neighborhood Graphs to accomplish the task.

Keywords: Machine learning, data mining, support vector machines, proximity graphs, relative-neighborhood graphs, k-nearestneighbor graphs, random sampling, training data condensation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1552
34 Multiclass Support Vector Machines for Environmental Sounds Classification Using log-Gabor Filters

Authors: S. Souli, Z. Lachiri

Abstract:

In this paper we propose a robust environmental sound classification approach, based on spectrograms features driven from log-Gabor filters. This approach includes two methods. In the first methods, the spectrograms are passed through an appropriate log-Gabor filter banks and the outputs are averaged and underwent an optimal feature selection procedure based on a mutual information criteria. The second method uses the same steps but applied only to three patches extracted from each spectrogram.

To investigate the accuracy of the proposed methods, we conduct experiments using a large database containing 10 environmental sound classes. The classification results based on Multiclass Support Vector Machines show that the second method is the most efficient with an average classification accuracy of 89.62 %.

Keywords: Environmental sounds, Log-Gabor filters, Spectrogram, SVM Multiclass, Visual features.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1401
33 Parallel Algorithm for Numerical Solution of Three-Dimensional Poisson Equation

Authors: Alibek Issakhov

Abstract:

In this paper developed and realized absolutely new algorithm for solving three-dimensional Poisson equation. This equation used in research of turbulent mixing, computational fluid dynamics, atmospheric front, and ocean flows and so on. Moreover in the view of rising productivity of difficult calculation there was applied the most up-to-date and the most effective parallel programming technology - MPI in combination with OpenMP direction, that allows to realize problems with very large data content. Resulted products can be used in solving of important applications and fundamental problems in mathematics and physics.

Keywords: MPI, OpenMP, three dimensional Poisson equation

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1313
32 NOHIS-Tree: High-Dimensional Index Structure for Similarity Search

Authors: Mounira Taileb, Sami Touati

Abstract:

In Content-Based Image Retrieval systems it is important to use an efficient indexing technique in order to perform and accelerate the search in huge databases. The used indexing technique should also support the high dimensions of image features. In this paper we present the hierarchical index NOHIS-tree (Non Overlapping Hierarchical Index Structure) when we scale up to very large databases. We also present a study of the influence of clustering on search time. The performance test results show that NOHIS-tree performs better than SR-tree. Tests also show that NOHIS-tree keeps its performances in high dimensional spaces. We include the performance test that try to determine the number of clusters in NOHIS-tree to have the best search time.

Keywords: High-dimensional indexing, k-nearest neighborssearch.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1112
31 Solving Facility Location Problem on Cluster Computing

Authors: Ei Phyo Wai, Nay Min Tun

Abstract:

Computation of facility location problem for every location in the country is not easy simultaneously. Solving the problem is described by using cluster computing. A technique is to design parallel algorithm by using local search with single swap method in order to solve that problem on clusters. Parallel implementation is done by the use of portable parallel programming, Message Passing Interface (MPI), on Microsoft Windows Compute Cluster. In this paper, it presents the algorithm that used local search with single swap method and implementation of the system of a facility to be opened by using MPI on cluster. If large datasets are considered, the process of calculating a reasonable cost for a facility becomes time consuming. The result shows parallel computation of facility location problem on cluster speedups and scales well as problem size increases.

Keywords: cluster, cost, demand, facility location

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1072
30 Multidimensional Data Mining by Means of Randomly Travelling Hyper-Ellipsoids

Authors: Pavel Y. Tabakov, Kevin Duffy

Abstract:

The present study presents a new approach to automatic data clustering and classification problems in large and complex databases and, at the same time, derives specific types of explicit rules describing each cluster. The method works well in both sparse and dense multidimensional data spaces. The members of the data space can be of the same nature or represent different classes. A number of N-dimensional ellipsoids are used for enclosing the data clouds. Due to the geometry of an ellipsoid and its free rotation in space the detection of clusters becomes very efficient. The method is based on genetic algorithms that are used for the optimization of location, orientation and geometric characteristics of the hyper-ellipsoids. The proposed approach can serve as a basis for the development of general knowledge systems for discovering hidden knowledge and unexpected patterns and rules in various large databases.

Keywords: Classification, clustering, data minig, genetic algorithms.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1378
29 Application and Limitation of Parallel Modelingin Multidimensional Sequential Pattern

Authors: Mahdi Esmaeili, Mansour Tarafdar

Abstract:

The goal of data mining algorithms is to discover useful information embedded in large databases. One of the most important data mining problems is discovery of frequently occurring patterns in sequential data. In a multidimensional sequence each event depends on more than one dimension. The search space is quite large and the serial algorithms are not scalable for very large datasets. To address this, it is necessary to study scalable parallel implementations of sequence mining algorithms. In this paper, we present a model for multidimensional sequence and describe a parallel algorithm based on data parallelism. Simulation experiments show good load balancing and scalable and acceptable speedup over different processors and problem sizes and demonstrate that our approach can works efficiently in a real parallel computing environment.

Keywords: Sequential Patterns, Data Mining, ParallelAlgorithm, Multidimensional Sequence Data

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1021
28 A Bayesian Hierarchical 13COBT to Correct Estimates Associated with a Delayed Gastric Emptying

Authors: Leslie J.C.Bluck, Sarah J.Jackson, Georgios Vlasakakis, Adrian Mander

Abstract:

The use of a Bayesian Hierarchical Model (BHM) to interpret breath measurements obtained during a 13C Octanoic Breath Test (13COBT) is demonstrated. The statistical analysis was implemented using WinBUGS, a commercially available computer package for Bayesian inference. A hierarchical setting was adopted where poorly defined parameters associated with a delayed Gastric Emptying (GE) were able to "borrow" strength from global distributions. This is proved to be a sufficient tool to correct model's failures and data inconsistencies apparent in conventional analyses employing a Non-linear least squares technique (NLS). Direct comparison of two parameters describing gastric emptying ng ( tlag -lag phase, t1/ 2 -half emptying time) revealed a strong correlation between the two methods. Despite our large dataset ( n = 164 ), Bayesian modeling was fast and provided a successful fitting for all subjects. On the contrary, NLS failed to return acceptable estimates in cases where GE was delayed.

Keywords: Bayesian hierarchical analysis, 13COBT, Gastricemptying, WinBUGS.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1098
27 Generating Concept Trees from Dynamic Self-organizing Map

Authors: Norashikin Ahmad, Damminda Alahakoon

Abstract:

Self-organizing map (SOM) provides both clustering and visualization capabilities in mining data. Dynamic self-organizing maps such as Growing Self-organizing Map (GSOM) has been developed to overcome the problem of fixed structure in SOM to enable better representation of the discovered patterns. However, in mining large datasets or historical data the hierarchical structure of the data is also useful to view the cluster formation at different levels of abstraction. In this paper, we present a technique to generate concept trees from the GSOM. The formation of tree from different spread factor values of GSOM is also investigated and the quality of the trees analyzed. The results show that concept trees can be generated from GSOM, thus, eliminating the need for re-clustering of the data from scratch to obtain a hierarchical view of the data under study.

Keywords: dynamic self-organizing map, concept formation, clustering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1093
26 Scaling up Detection Rates and Reducing False Positives in Intrusion Detection using NBTree

Authors: Dewan Md. Farid, Nguyen Huu Hoa, Jerome Darmont, Nouria Harbi, Mohammad Zahidur Rahman

Abstract:

In this paper, we present a new learning algorithm for anomaly based network intrusion detection using improved self adaptive naïve Bayesian tree (NBTree), which induces a hybrid of decision tree and naïve Bayesian classifier. The proposed approach scales up the balance detections for different attack types and keeps the false positives at acceptable level in intrusion detection. In complex and dynamic large intrusion detection dataset, the detection accuracy of naïve Bayesian classifier does not scale up as well as decision tree. It has been successfully tested in other problem domains that naïve Bayesian tree improves the classification rates in large dataset. In naïve Bayesian tree nodes contain and split as regular decision-trees, but the leaves contain naïve Bayesian classifiers. The experimental results on KDD99 benchmark network intrusion detection dataset demonstrate that this new approach scales up the detection rates for different attack types and reduces false positives in network intrusion detection.

Keywords: Detection rates, false positives, network intrusiondetection, naïve Bayesian tree.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1892
25 Mining Sequential Patterns Using Hybrid Evolutionary Algorithm

Authors: Mourad Ykhlef, Hebah ElGibreen

Abstract:

Mining Sequential Patterns in large databases has become an important data mining task with broad applications. It is an important task in data mining field, which describes potential sequenced relationships among items in a database. There are many different algorithms introduced for this task. Conventional algorithms can find the exact optimal Sequential Pattern rule but it takes a long time, particularly when they are applied on large databases. Nowadays, some evolutionary algorithms, such as Particle Swarm Optimization and Genetic Algorithm, were proposed and have been applied to solve this problem. This paper will introduce a new kind of hybrid evolutionary algorithm that combines Genetic Algorithm (GA) with Particle Swarm Optimization (PSO) to mine Sequential Pattern, in order to improve the speed of evolutionary algorithms convergence. This algorithm is referred to as SP-GAPSO.

Keywords: Genetic Algorithm, Hybrid Evolutionary Algorithm, Particle Swarm Optimization algorithm, Sequential Pattern mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1625
24 An Efficient Graph Query Algorithm Based on Important Vertices and Decision Features

Authors: Xiantong Li, Jianzhong Li

Abstract:

Graph has become increasingly important in modeling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. Different from the existing methods, our approach, called VFM (Vertex to Frequent Feature Mapping), makes use of vertices and decision features as the basic indexing feature. VFM constructs two mappings between vertices and frequent features to answer graph queries. The VFM approach not only provides an elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit from data mining, especially frequent pattern mining. The results show that the proposed method not only avoids the enumeration method of getting subgraphs of query graph, but also effectively reduces the subgraph isomorphism tests between the query graph and graphs in candidate answer set in verification stage.

Keywords: Decision Feature, Frequent Feature, Graph Dataset, Graph Query

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1447
23 Statistical Process Optimization Through Multi-Response Surface Methodology

Authors: S. Raissi, R- Eslami Farsani

Abstract:

In recent years, response surface methodology (RSM) has brought many attentions of many quality engineers in different industries. Most of the published literature on robust design methodology is basically concerned with optimization of a single response or quality characteristic which is often most critical to consumers. For most products, however, quality is multidimensional, so it is common to observe multiple responses in an experimental situation. Through this paper interested person will be familiarize with this methodology via surveying of the most cited technical papers. It is believed that the proposed procedure in this study can resolve a complex parameter design problem with more than two responses. It can be applied to those areas where there are large data sets and a number of responses are to be optimized simultaneously. In addition, the proposed procedure is relatively simple and can be implemented easily by using ready-made standard statistical packages.

Keywords: Multi-Response Surface Methodology (MRSM), Design of Experiments (DOE), Process modeling, Quality improvement; Robust Design.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3781
22 Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm

Authors: Jehad A. H. Hammad, Nur'Aini binti Abdul Rashid

Abstract:

With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm-s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.

Keywords: Biological sequence, Database index, N-gram indexing, Parallel computing, Sequence retrieval.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1700
21 Developing New Processes and Optimizing Performance Using Response Surface Methodology

Authors: S. Raissi

Abstract:

Response surface methodology (RSM) is a very efficient tool to provide a good practical insight into developing new process and optimizing them. This methodology could help engineers to raise a mathematical model to represent the behavior of system as a convincing function of process parameters. Through this paper the sequential nature of the RSM surveyed for process engineers and its relationship to design of experiments (DOE), regression analysis and robust design reviewed. The proposed four-step procedure in two different phases could help system analyst to resolve the parameter design problem involving responses. In order to check accuracy of the designed model, residual analysis and prediction error sum of squares (PRESS) described. It is believed that the proposed procedure in this study can resolve a complex parameter design problem with one or more responses. It can be applied to those areas where there are large data sets and a number of responses are to be optimized simultaneously. In addition, the proposed procedure is relatively simple and can be implemented easily by using ready-made standard statistical packages.

Keywords: Response Surface Methodology (RSM), Design of Experiments (DOE), Process modeling, Process setting, Process optimization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1431