Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 7029

Search results for: multivariate categorical data

7029 Landscape Data Transformation: Categorical Descriptions to Numerical Descriptors

Authors: Dennis A. Apuan

Abstract:

Categorical data based on description of the agricultural landscape imposed some mathematical and analytical limitations. This problem however can be overcome by data transformation through coding scheme and the use of non-parametric multivariate approach. The present study describes data transformation from qualitative to numerical descriptors. In a collection of 103 random soil samples over a 60 hectare field, categorical data were obtained from the following variables: levels of nitrogen, phosphorus, potassium, pH, hue, chroma, value and data on topography, vegetation type, and the presence of rocks. Categorical data were coded, and Spearman-s rho correlation was then calculated using PAST software ver. 1.78 in which Principal Component Analysis was based. Results revealed successful data transformation, generating 1030 quantitative descriptors. Visualization based on the new set of descriptors showed clear differences among sites, and amount of variation was successfully measured. Possible applications of data transformation are discussed.

Keywords: data transformation, numerical descriptors, principalcomponent analysis

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1309
7028 A Simplified Higher-Order Markov Chain Model

Authors: Chao Wang, Ting-Zhu Huang, Chen Jia

Abstract:

In this paper, we present a simplified higher-order Markov chain model for multiple categorical data sequences also called as simplified higher-order multivariate Markov chain model.

Keywords: Higher-order multivariate Markov chain model, Categorical data sequences, Multivariate Markov chain.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2922
7027 Incremental Algorithm to Cluster the Categorical Data with Frequency Based Similarity Measure

Authors: S.Aranganayagi, K.Thangavel

Abstract:

Clustering categorical data is more complicated than the numerical clustering because of its special properties. Scalability and memory constraint is the challenging problem in clustering large data set. This paper presents an incremental algorithm to cluster the categorical data. Frequencies of attribute values contribute much in clustering similar categorical objects. In this paper we propose new similarity measures based on the frequencies of attribute values and its cardinalities. The proposed measures and the algorithm are experimented with the data sets from UCI data repository. Results prove that the proposed method generates better clusters than the existing one.

Keywords: Clustering, Categorical, Incremental, Frequency, Domain

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1619
7026 Cluster Analysis for the Statistical Modeling of Aesthetic Judgment Data Related to Comics Artists

Authors: George E. Tsekouras, Evi Sampanikou

Abstract:

We compare three categorical data clustering algorithms with respect to the problem of classifying cultural data related to the aesthetic judgment of comics artists. Such a classification is very important in Comics Art theory since the determination of any classes of similarities in such kind of data will provide to art-historians very fruitful information of Comics Art-s evolution. To establish this, we use a categorical data set and we study it by employing three categorical data clustering algorithms. The performances of these algorithms are compared each other, while interpretations of the clustering results are also given.

Keywords: Aesthetic judgment, comics artists, cluster analysis, categorical data.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1415
7025 Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency

Authors: Semeh Ben Salem, Sami Naouali, Moetez Sallami

Abstract:

Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.

Keywords: Clustering, k-means, categorical datasets, pattern recognition, unsupervised learning, knowledge discovery.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3173
7024 Categorical Missing Data Imputation Using Fuzzy Neural Networks with Numerical and Categorical Inputs

Authors: Pilar Rey-del-Castillo, Jesús Cardeñosa

Abstract:

There are many situations where input feature vectors are incomplete and methods to tackle the problem have been studied for a long time. A commonly used procedure is to replace each missing value with an imputation. This paper presents a method to perform categorical missing data imputation from numerical and categorical variables. The imputations are based on Simpson-s fuzzy min-max neural networks where the input variables for learning and classification are just numerical. The proposed method extends the input to categorical variables by introducing new fuzzy sets, a new operation and a new architecture. The procedure is tested and compared with others using opinion poll data.

Keywords: Classifier, imputation techniques, fuzzy systems, fuzzy min-max neural networks.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1487
7023 Categorical Clustering By Converting Associated Information

Authors: Dongmin Cai, Stephen S-T Yau

Abstract:

Lacking an inherent “natural" dissimilarity measure between objects in categorical dataset presents special difficulties in clustering analysis. However, each categorical attributes from a given dataset provides natural probability and information in the sense of Shannon. In this paper, we proposed a novel method which heuristically converts categorical attributes to numerical values by exploiting such associated information. We conduct an experimental study with real-life categorical dataset. The experiment demonstrates the effectiveness of our approach.

Keywords: Categorical, Clustering, Converting, Information

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1144
7022 Qualitative Data Analysis for Health Care Services

Authors: Taner Ersoz, Filiz Ersoz

Abstract:

This study was designed enable application of multivariate technique in the interpretation of categorical data for measuring health care services satisfaction in Turkey. The data was collected from a total of 17726 respondents. The establishment of the sample group and collection of the data were carried out by a joint team from The Ministry of Health and Turkish Statistical Institute (Turk Stat) of Turkey. The multiple correspondence analysis (MCA) was used on the data of 2882 respondents who answered the questionnaire in full. The multiple correspondence analysis indicated that, in the evaluation of health services females, public employees, younger and more highly educated individuals were more concerned and complainant than males, private sector employees, older and less educated individuals. Overall 53 % of the respondents were pleased with the improvements in health care services in the past three years. This study demonstrates the public consciousness in health services and health care satisfaction in Turkey. It was found that most the respondents were pleased with the improvements in health care services over the past three years. Awareness of health service quality increases with education levels. Older individuals and males would appear to have lower expectancies in health services.

Keywords: Multiple correspondence analysis, optimal scaling, multivariate categorical data, health care services, health satisfaction survey, statistical visualizing, Turkey.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 547
7021 Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure

Authors: S.Aranganayagi, K.Thangavel

Abstract:

K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.

Keywords: Clustering, categorical data, K-Modes, weighted dissimilarity measure

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3456
7020 Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data

Authors: George E. Tsekouras, Dimitris Papageorgiou, Sotiris Kotsiantis, Christos Kalloniatis, Panagiotis Pintelas

Abstract:

We develop a three-step fuzzy logic-based algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropy-based clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy c-modes algorithm to obtain a fuzzy partition of the data set, and the third step introduces a novel cluster validity index, which decides the final number of clusters.

Keywords: Categorical data, cultural data, fuzzy logic clustering, fuzzy c-modes, cluster validity index.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1445
7019 Categorical Data Modeling: Logistic Regression Software

Authors: Abdellatif Tchantchane

Abstract:

A Matlab based software for logistic regression is developed to enhance the process of teaching quantitative topics and assist researchers with analyzing wide area of applications where categorical data is involved. The software offers an option of performing stepwise logistic regression to select the most significant predictors. The software includes a feature to detect influential observations in data, and investigates the effect of dropping or misclassifying an observation on a predictor variable. The input data may consist either as a set of individual responses (yes/no) with the predictor variables or as grouped records summarizing various categories for each unique set of predictor variables' values. Graphical displays are used to output various statistical results and to assess the goodness of fit of the logistic regression model. The software recognizes possible convergence constraints when present in data, and the user is notified accordingly.

Keywords: Logistic regression, Matlab, Categorical data, Influential observation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1653
7018 An AK-Chart for the Non-Normal Data

Authors: Chia-Hau Liu, Tai-Yue Wang

Abstract:

Traditional multivariate control charts assume that measurement from manufacturing processes follows a multivariate normal distribution. However, this assumption may not hold or may be difficult to verify because not all the measurement from manufacturing processes are normal distributed in practice. This study develops a new multivariate control chart for monitoring the processes with non-normal data. We propose a mechanism based on integrating the one-class classification method and the adaptive technique. The adaptive technique is used to improve the sensitivity to small shift on one-class classification in statistical process control. In addition, this design provides an easy way to allocate the value of type I error so it is easier to be implemented. Finally, the simulation study and the real data from industry are used to demonstrate the effectiveness of the propose control charts.

Keywords: Multivariate control chart, statistical process control, one-class classification method.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1973
7017 Model of Optimal Centroids Approach for Multivariate Data Classification

Authors: Pham Van Nha, Le Cam Binh

Abstract:

Particle swarm optimization (PSO) is a population-based stochastic optimization algorithm. PSO was inspired by the natural behavior of birds and fish in migration and foraging for food. PSO is considered as a multidisciplinary optimization model that can be applied in various optimization problems. PSO’s ideas are simple and easy to understand but PSO is only applied in simple model problems. We think that in order to expand the applicability of PSO in complex problems, PSO should be described more explicitly in the form of a mathematical model. In this paper, we represent PSO in a mathematical model and apply in the multivariate data classification. First, PSOs general mathematical model (MPSO) is analyzed as a universal optimization model. Then, Model of Optimal Centroids (MOC) is proposed for the multivariate data classification. Experiments were conducted on some benchmark data sets to prove the effectiveness of MOC compared with several proposed schemes.

Keywords: Analysis of optimization, artificial intelligence-based optimization, optimization for learning and data analysis, global optimization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 360
7016 Applying Gibbs Sampler for Multivariate Hierarchical Linear Model

Authors: Satoshi Usami

Abstract:

Among various HLM techniques, the Multivariate Hierarchical Linear Model (MHLM) is desirable to use, particularly when multivariate criterion variables are collected and the covariance structure has information valuable for data analysis. In order to reflect prior information or to obtain stable results when the sample size and the number of groups are not sufficiently large, the Bayes method has often been employed in hierarchical data analysis. In these cases, although the Markov Chain Monte Carlo (MCMC) method is a rather powerful tool for parameter estimation, Procedures regarding MCMC have not been formulated for MHLM. For this reason, this research presents concrete procedures for parameter estimation through the use of the Gibbs samplers. Lastly, several future topics for the use of MCMC approach for HLM is discussed.

Keywords: Gibbs sampler, Hierarchical Linear Model, Markov Chain Monte Carlo, Multivariate Hierarchical Linear Model

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1629
7015 Multivariate Analysis of Spectroscopic Data for Agriculture Applications

Authors: Asmaa M. Hussein, Amr Wassal, Ahmed Farouk Al-Sadek, A. F. Abd El-Rahman

Abstract:

In this study, a multivariate analysis of potato spectroscopic data was presented to detect the presence of brown rot disease or not. Near-Infrared (NIR) spectroscopy (1,350-2,500 nm) combined with multivariate analysis was used as a rapid, non-destructive technique for the detection of brown rot disease in potatoes. Spectral measurements were performed in 565 samples, which were chosen randomly at the infection place in the potato slice. In this study, 254 infected and 311 uninfected (brown rot-free) samples were analyzed using different advanced statistical analysis techniques. The discrimination performance of different multivariate analysis techniques, including classification, pre-processing, and dimension reduction, were compared. Applying a random forest algorithm classifier with different pre-processing techniques to raw spectra had the best performance as the total classification accuracy of 98.7% was achieved in discriminating infected potatoes from control.

Keywords: Brown rot disease, NIR spectroscopy, potato, random forest.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 338
7014 Developing Pedotransfer Functions for Estimating Some Soil Properties using Artificial Neural Network and Multivariate Regression Approaches

Authors: Fereydoon Sarmadian, Ali Keshavarzi

Abstract:

Study of soil properties like field capacity (F.C.) and permanent wilting point (P.W.P.) play important roles in study of soil moisture retention curve. Although these parameters can be measured directly, their measurement is difficult and expensive. Pedotransfer functions (PTFs) provide an alternative by estimating soil parameters from more readily available soil data. In this investigation, 70 soil samples were collected from different horizons of 15 soil profiles located in the Ziaran region, Qazvin province, Iran. The data set was divided into two subsets for calibration (80%) and testing (20%) of the models and their normality were tested by Kolmogorov-Smirnov method. Both multivariate regression and artificial neural network (ANN) techniques were employed to develop the appropriate PTFs for predicting soil parameters using easily measurable characteristics of clay, silt, O.C, S.P, B.D and CaCO3. The performance of the multivariate regression and ANN models was evaluated using an independent test data set. In order to evaluate the models, root mean square error (RMSE) and R2 were used. The comparison of RSME for two mentioned models showed that the ANN model gives better estimates of F.C and P.W.P than the multivariate regression model. The value of RMSE and R2 derived by ANN model for F.C and P.W.P were (2.35, 0.77) and (2.83, 0.72), respectively. The corresponding values for multivariate regression model were (4.46, 0.68) and (5.21, 0.64), respectively. Results showed that ANN with five neurons in hidden layer had better performance in predicting soil properties than multivariate regression.

Keywords: Artificial neural network, Field capacity, Permanentwilting point, Pedotransfer functions.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1570
7013 Nonparametric Control Chart Using Density Weighted Support Vector Data Description

Authors: Myungraee Cha, Jun Seok Kim, Seung Hwan Park, Jun-Geol Baek

Abstract:

In manufacturing industries, development of measurement leads to increase the number of monitoring variables and eventually the importance of multivariate control comes to the fore. Statistical process control (SPC) is one of the most widely used as multivariate control chart. Nevertheless, SPC is restricted to apply in processes because its assumption of data as following specific distribution. Unfortunately, process data are composed by the mixture of several processes and it is hard to estimate as one certain distribution. To alternative conventional SPC, therefore, nonparametric control chart come into the picture because of the strength of nonparametric control chart, the absence of parameter estimation. SVDD based control chart is one of the nonparametric control charts having the advantage of flexible control boundary. However,basic concept of SVDD has been an oversight to the important of data characteristic, density distribution. Therefore, we proposed DW-SVDD (Density Weighted SVDD) to cover up the weakness of conventional SVDD. DW-SVDD makes a new attempt to consider dense of data as introducing the notion of density Weight. We extend as control chart using new proposed SVDD and a simulation study of various distributional data is conducted to demonstrate the improvement of performance.

Keywords: Density estimation, Multivariate control chart, Oneclass classification, Support vector data description (SVDD)

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1878
7012 On the Performance of Information Criteria in Latent Segment Models

Authors: Jaime R. S. Fonseca

Abstract:

Nevertheless the widespread application of finite mixture models in segmentation, finite mixture model selection is still an important issue. In fact, the selection of an adequate number of segments is a key issue in deriving latent segments structures and it is desirable that the selection criteria used for this end are effective. In order to select among several information criteria, which may support the selection of the correct number of segments we conduct a simulation study. In particular, this study is intended to determine which information criteria are more appropriate for mixture model selection when considering data sets with only categorical segmentation base variables. The generation of mixtures of multinomial data supports the proposed analysis. As a result, we establish a relationship between the level of measurement of segmentation variables and some (eleven) information criteria-s performance. The criterion AIC3 shows better performance (it indicates the correct number of the simulated segments- structure more often) when referring to mixtures of multinomial segmentation base variables.

Keywords: Quantitative Methods, Multivariate Data Analysis, Clustering, Finite Mixture Models, Information Theoretical Criteria, Simulation experiments.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1345
7011 Comparison of Artificial Neural Network and Multivariate Regression Methods in Prediction of Soil Cation Exchange Capacity

Authors: Ali Keshavarzi, Fereydoon Sarmadian

Abstract:

Investigation of soil properties like Cation Exchange Capacity (CEC) plays important roles in study of environmental reaserches as the spatial and temporal variability of this property have been led to development of indirect methods in estimation of this soil characteristic. Pedotransfer functions (PTFs) provide an alternative by estimating soil parameters from more readily available soil data. 70 soil samples were collected from different horizons of 15 soil profiles located in the Ziaran region, Qazvin province, Iran. Then, multivariate regression and neural network model (feedforward back propagation network) were employed to develop a pedotransfer function for predicting soil parameter using easily measurable characteristics of clay and organic carbon. The performance of the multivariate regression and neural network model was evaluated using a test data set. In order to evaluate the models, root mean square error (RMSE) was used. The value of RMSE and R2 derived by ANN model for CEC were 0.47 and 0.94 respectively, while these parameters for multivariate regression model were 0.65 and 0.88 respectively. Results showed that artificial neural network with seven neurons in hidden layer had better performance in predicting soil cation exchange capacity than multivariate regression.

Keywords: Easily measurable characteristics, Feed-forwardback propagation, Pedotransfer functions, CEC.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1813
7010 Multivariate Assessment of Mathematics Test Scores of Students in Qatar

Authors: Ali Rashash Alzahrani, Elizabeth Stojanovski

Abstract:

Data on various aspects of education are collected at the institutional and government level regularly. In Australia, for example, students at various levels of schooling undertake examinations in numeracy and literacy as part of NAPLAN testing, enabling longitudinal assessment of such data as well as comparisons between schools and states within Australia. Another source of educational data collected internationally is via the PISA study which collects data from several countries when students are approximately 15 years of age and enables comparisons in the performance of science, mathematics and English between countries as well as ranking of countries based on performance in these standardised tests. As well as student and school outcomes based on the tests taken as part of the PISA study, there is a wealth of other data collected in the study including parental demographics data and data related to teaching strategies used by educators. Overall, an abundance of educational data is available which has the potential to be used to help improve educational attainment and teaching of content in order to improve learning outcomes. A multivariate assessment of such data enables multiple variables to be considered simultaneously and will be used in the present study to help develop profiles of students based on performance in mathematics using data obtained from the PISA study.

Keywords: Cluster analysis, education, mathematics, profiles.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 591
7009 Holistic Face Recognition using Multivariate Approximation, Genetic Algorithms and AdaBoost Classifier: Preliminary Results

Authors: C. Villegas-Quezada, J. Climent

Abstract:

Several works regarding facial recognition have dealt with methods which identify isolated characteristics of the face or with templates which encompass several regions of it. In this paper a new technique which approaches the problem holistically dispensing with the need to identify geometrical characteristics or regions of the face is introduced. The characterization of a face is achieved by randomly sampling selected attributes of the pixels of its image. From this information we construct a set of data, which correspond to the values of low frequencies, gradient, entropy and another several characteristics of pixel of the image. Generating a set of “p" variables. The multivariate data set with different polynomials minimizing the data fitness error in the minimax sense (L∞ - Norm) is approximated. With the use of a Genetic Algorithm (GA) it is able to circumvent the problem of dimensionality inherent to higher degree polynomial approximations. The GA yields the degree and values of a set of coefficients of the polynomials approximating of the image of a face. By finding a family of characteristic polynomials from several variables (pixel characteristics) for each face (say Fi ) in the data base through a resampling process the system in use, is trained. A face (say F ) is recognized by finding its characteristic polynomials and using an AdaBoost Classifier from F -s polynomials to each of the Fi -s polynomials. The winner is the polynomial family closer to F -s corresponding to target face in data base.

Keywords: AdaBoost Classifier, Holistic Face Recognition, Minimax Multivariate Approximation, Genetic Algorithm.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1277
7008 A Multivariate Moving Average Control Chart for Photovoltaic Processes

Authors: Chunchom Pongchavalit

Abstract:

For the electrical metrics that describe photovoltaic cell performance are inherently multivariate in nature, use of a univariate, or one variable, statistical process control chart can have important limitations. Development of a comprehensive process control strategy is known to be significantly beneficial to reducing process variability that ultimately drives up the manufacturing cost photovoltaic cells. The multivariate moving average or MMA chart, is applied to the electrical metrics of photovoltaic cells to illustrate the improved sensitivity on process variability this method of control charting offers. The result show the ability of the MMA chart to expand to as any variables as needed, suggests an application with multiple photovoltaic electrical metrics being used in concert to determine the processes state of control.

Keywords: The multivariate moving average control chart, Photovoltaic processes control, Multivariate system.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1066
7007 Effects of Video Games and Online Chat on Mathematics Performance in High School: An Approach of Multivariate Data Analysis

Authors: Lina Wu, Wenyi Lu, Ye Li

Abstract:

Regarding heavy video game players for boys and super online chat lovers for girls as a symbolic phrase in the current adolescent culture, this project of data analysis verifies the displacement effect on deteriorating mathematics performance. To evaluate correlation or regression coefficients between a factor of playing video games or chatting online and mathematics performance compared with other factors, we use multivariate analysis technique and take gender difference into account. We find the most important reason for the negative sign of the displacement effect on mathematics performance due to students’ poor academic background. Statistical analysis methods in this project could be applied to study internet users’ academic performance from the high school education to the college education.

Keywords: Correlation coefficients, displacement effect, gender difference, multivariate analysis technique, regression coefficients.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1853
7006 Diagnosis of Multivariate Process via Nonlinear Kernel Method Combined with Qualitative Representation of Fault Patterns

Authors: Hyun-Woo Cho

Abstract:

The fault detection and diagnosis of complicated production processes is one of essential tasks needed to run the process safely with good final product quality. Unexpected events occurred in the process may have a serious impact on the process. In this work, triangular representation of process measurement data obtained in an on-line basis is evaluated using simulation process. The effect of using linear and nonlinear reduced spaces is also tested. Their diagnosis performance was demonstrated using multivariate fault data. It has shown that the nonlinear technique based diagnosis method produced more reliable results and outperforms linear method. The use of appropriate reduced space yielded better diagnosis performance. The presented diagnosis framework is different from existing ones in that it attempts to extract the fault pattern in the reduced space, not in the original process variable space. The use of reduced model space helps to mitigate the sensitivity of the fault pattern to noise.

Keywords: Real-time Fault diagnosis, triangular representation of patterns in reduced spaces, Nonlinear kernel technique, multivariate statistical modeling.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1375
7005 Clustering Multivariate Empiric Characteristic Functions for Multi-Class SVM Classification

Authors: María-Dolores Cubiles-de-la-Vega, Rafael Pino-Mejías, Esther-Lydia Silva-Ramírez

Abstract:

A dissimilarity measure between the empiric characteristic functions of the subsamples associated to the different classes in a multivariate data set is proposed. This measure can be efficiently computed, and it depends on all the cases of each class. It may be used to find groups of similar classes, which could be joined for further analysis, or it could be employed to perform an agglomerative hierarchical cluster analysis of the set of classes. The final tree can serve to build a family of binary classification models, offering an alternative approach to the multi-class SVM problem. We have tested this dendrogram based SVM approach with the oneagainst- one SVM approach over four publicly available data sets, three of them being microarray data. Both performances have been found equivalent, but the first solution requires a smaller number of binary SVM models.

Keywords: Cluster Analysis, Empiric Characteristic Function, Multi-class SVM, R.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1649
7004 Distribution Sampling of Vector Variance without Duplications

Authors: Erna T. Herdiani, Maman A. Djauhari

Abstract:

In recent years, the use of vector variance as a measure of multivariate variability has received much attention in wide range of statistics. This paper deals with a more economic measure of multivariate variability, defined as vector variance minus all duplication elements. For high dimensional data, this will increase the computational efficiency almost 50 % compared to the original vector variance. Its sampling distribution will be investigated to make its applications possible.

Keywords: Asymptotic distribution, covariance matrix, likelihood ratio test, vector variance.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1322
7003 Generating Normally Distributed Clusters by Means of a Self-organizing Growing Neural Network– An Application to Market Segmentation –

Authors: Reinhold Decker, Christian Holsing, Sascha Lerke

Abstract:

This paper presents a new growing neural network for cluster analysis and market segmentation, which optimizes the size and structure of clusters by iteratively checking them for multivariate normality. We combine the recently published SGNN approach [8] with the basic principle underlying the Gaussian-means algorithm [13] and the Mardia test for multivariate normality [18, 19]. The new approach distinguishes from existing ones by its holistic design and its great autonomy regarding the clustering process as a whole. Its performance is demonstrated by means of synthetic 2D data and by real lifestyle survey data usable for market segmentation.

Keywords: Artificial neural network, clustering, multivariatenormality, market segmentation, self-organization

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1016
7002 Fault Detection of Drinking Water Treatment Process Using PCA and Hotelling's T2 Chart

Authors: Joval P George, Dr. Zheng Chen, Philip Shaw

Abstract:

This paper deals with the application of Principal Component Analysis (PCA) and the Hotelling-s T2 Chart, using data collected from a drinking water treatment process. PCA is applied primarily for the dimensional reduction of the collected data. The Hotelling-s T2 control chart was used for the fault detection of the process. The data was taken from a United Utilities Multistage Water Treatment Works downloaded from an Integrated Program Management (IPM) dashboard system. The analysis of the results show that Multivariate Statistical Process Control (MSPC) techniques such as PCA, and control charts such as Hotelling-s T2, can be effectively applied for the early fault detection of continuous multivariable processes such as Drinking Water Treatment. The software package SIMCA-P was used to develop the MSPC models and Hotelling-s T2 Chart from the collected data.

Keywords: Principal component analysis, hotelling's t2 chart, multivariate statistical process control, drinking water treatment.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2451
7001 System Identification Based on Stepwise Regression for Dynamic Market Representation

Authors: Alexander Efremov

Abstract:

A system for market identification (SMI) is presented. The resulting representations are multivariable dynamic demand models. The market specifics are analyzed. Appropriate models and identification techniques are chosen. Multivariate static and dynamic models are used to represent the market behavior. The steps of the first stage of SMI, named data preprocessing, are mentioned. Next, the second stage, which is the model estimation, is considered in more details. Stepwise linear regression (SWR) is used to determine the significant cross-effects and the orders of the model polynomials. The estimates of the model parameters are obtained by a numerically stable estimator. Real market data is used to analyze SMI performance. The main conclusion is related to the applicability of multivariate dynamic models for representation of market systems.

Keywords: market identification, dynamic models, stepwise regression.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1362
7000 Clustering Categorical Data Using Hierarchies (CLUCDUH)

Authors: Gökhan Silahtaroğlu

Abstract:

Clustering large populations is an important problem when the data contain noise and different shapes. A good clustering algorithm or approach should be efficient enough to detect clusters sensitively. Besides space complexity, time complexity also gains importance as the size grows. Using hierarchies we developed a new algorithm to split attributes according to the values they have and choosing the dimension for splitting so as to divide the database roughly into equal parts as much as possible. At each node we calculate some certain descriptive statistical features of the data which reside and by pruning we generate the natural clusters with a complexity of O(n).

Keywords: Clustering, tree, split, pruning, entropy, gini.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1244