Search results for: model based clustering
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 36597

Search results for: model based clustering

36597 Fuzzy Optimization Multi-Objective Clustering Ensemble Model for Multi-Source Data Analysis

Authors: C. B. Le, V. N. Pham

Abstract:

In modern data analysis, multi-source data appears more and more in real applications. Multi-source data clustering has emerged as a important issue in the data mining and machine learning community. Different data sources provide information about different data. Therefore, multi-source data linking is essential to improve clustering performance. However, in practice multi-source data is often heterogeneous, uncertain, and large. This issue is considered a major challenge from multi-source data. Ensemble is a versatile machine learning model in which learning techniques can work in parallel, with big data. Clustering ensemble has been shown to outperform any standard clustering algorithm in terms of accuracy and robustness. However, most of the traditional clustering ensemble approaches are based on single-objective function and single-source data. This paper proposes a new clustering ensemble method for multi-source data analysis. The fuzzy optimized multi-objective clustering ensemble method is called FOMOCE. Firstly, a clustering ensemble mathematical model based on the structure of multi-objective clustering function, multi-source data, and dark knowledge is introduced. Then, rules for extracting dark knowledge from the input data, clustering algorithms, and base clusterings are designed and applied. Finally, a clustering ensemble algorithm is proposed for multi-source data analysis. The experiments were performed on the standard sample data set. The experimental results demonstrate the superior performance of the FOMOCE method compared to the existing clustering ensemble methods and multi-source clustering methods.

Keywords: clustering ensemble, multi-source, multi-objective, fuzzy clustering

Procedia PDF Downloads 135
36596 An Experimental Study on Some Conventional and Hybrid Models of Fuzzy Clustering

Authors: Jeugert Kujtila, Kristi Hoxhalli, Ramazan Dalipi, Erjon Cota, Ardit Murati, Erind Bedalli

Abstract:

Clustering is a versatile instrument in the analysis of collections of data providing insights of the underlying structures of the dataset and enhancing the modeling capabilities. The fuzzy approach to the clustering problem increases the flexibility involving the concept of partial memberships (some value in the continuous interval [0, 1]) of the instances in the clusters. Several fuzzy clustering algorithms have been devised like FCM, Gustafson-Kessel, Gath-Geva, kernel-based FCM, PCM etc. Each of these algorithms has its own advantages and drawbacks, so none of these algorithms would be able to perform superiorly in all datasets. In this paper we will experimentally compare FCM, GK, GG algorithm and a hybrid two-stage fuzzy clustering model combining the FCM and Gath-Geva algorithms. Firstly we will theoretically dis-cuss the advantages and drawbacks for each of these algorithms and we will describe the hybrid clustering model exploiting the advantages and diminishing the drawbacks of each algorithm. Secondly we will experimentally compare the accuracy of the hybrid model by applying it on several benchmark and synthetic datasets.

Keywords: fuzzy clustering, fuzzy c-means algorithm (FCM), Gustafson-Kessel algorithm, hybrid clustering model

Procedia PDF Downloads 468
36595 Hierarchical Clustering Algorithms in Data Mining

Authors: Z. Abdullah, A. R. Hamdan

Abstract:

Clustering is a process of grouping objects and data into groups of clusters to ensure that data objects from the same cluster are identical to each other. Clustering algorithms in one of the areas in data mining and it can be classified into partition, hierarchical, density based, and grid-based. Therefore, in this paper, we do a survey and review for four major hierarchical clustering algorithms called CURE, ROCK, CHAMELEON, and BIRCH. The obtained state of the art of these algorithms will help in eliminating the current problems, as well as deriving more robust and scalable algorithms for clustering.

Keywords: clustering, unsupervised learning, algorithms, hierarchical

Procedia PDF Downloads 837
36594 Hybrid Hierarchical Clustering Approach for Community Detection in Social Network

Authors: Radhia Toujani, Jalel Akaichi

Abstract:

Social Networks generally present a hierarchy of communities. To determine these communities and the relationship between them, detection algorithms should be applied. Most of the existing algorithms, proposed for hierarchical communities identification, are based on either agglomerative clustering or divisive clustering. In this paper, we present a hybrid hierarchical clustering approach for community detection based on both bottom-up and bottom-down clustering. Obviously, our approach provides more relevant community structure than hierarchical method which considers only divisive or agglomerative clustering to identify communities. Moreover, we performed some comparative experiments to enhance the quality of the clustering results and to show the effectiveness of our algorithm.

Keywords: agglomerative hierarchical clustering, community structure, divisive hierarchical clustering, hybrid hierarchical clustering, opinion mining, social network, social network analysis

Procedia PDF Downloads 316
36593 A Learning-Based EM Mixture Regression Algorithm

Authors: Yi-Cheng Tian, Miin-Shen Yang

Abstract:

The mixture likelihood approach to clustering is a popular clustering method where the expectation and maximization (EM) algorithm is the most used mixture likelihood method. In the literature, the EM algorithm had been used for mixture regression models. However, these EM mixture regression algorithms are sensitive to initial values with a priori number of clusters. In this paper, to resolve these drawbacks, we construct a learning-based schema for the EM mixture regression algorithm such that it is free of initializations and can automatically obtain an approximately optimal number of clusters. Some numerical examples and comparisons demonstrate the superiority and usefulness of the proposed learning-based EM mixture regression algorithm.

Keywords: clustering, EM algorithm, Gaussian mixture model, mixture regression model

Procedia PDF Downloads 468
36592 Knowledge Representation Based on Interval Type-2 CFCM Clustering

Authors: Lee Myung-Won, Kwak Keun-Chang

Abstract:

This paper is concerned with knowledge representation and extraction of fuzzy if-then rules using Interval Type-2 Context-based Fuzzy C-Means clustering (IT2-CFCM) with the aid of fuzzy granulation. This proposed clustering algorithm is based on information granulation in the form of IT2 based Fuzzy C-Means (IT2-FCM) clustering and estimates the cluster centers by preserving the homogeneity between the clustered patterns from the IT2 contexts produced in the output space. Furthermore, we can obtain the automatic knowledge representation in the design of Radial Basis Function Networks (RBFN), Linguistic Model (LM), and Adaptive Neuro-Fuzzy Networks (ANFN) from the numerical input-output data pairs. We shall focus on a design of ANFN in this paper. The experimental results on an estimation problem of energy performance reveal that the proposed method showed a good knowledge representation and performance in comparison with the previous works.

Keywords: IT2-FCM, IT2-CFCM, context-based fuzzy clustering, adaptive neuro-fuzzy network, knowledge representation

Procedia PDF Downloads 285
36591 Semi-Supervised Hierarchical Clustering Given a Reference Tree of Labeled Documents

Authors: Ying Zhao, Xingyan Bin

Abstract:

Semi-supervised clustering algorithms have been shown effective to improve clustering process with even limited supervision. However, semi-supervised hierarchical clustering remains challenging due to the complexities of expressing constraints for agglomerative clustering algorithms. This paper proposes novel semi-supervised agglomerative clustering algorithms to build a hierarchy based on a known reference tree. We prove that by enforcing distance constraints defined by a reference tree during the process of hierarchical clustering, the resultant tree is guaranteed to be consistent with the reference tree. We also propose a framework that allows the hierarchical tree generation be aware of levels of levels of the agglomerative tree under creation, so that metric weights can be learned and adopted at each level in a recursive fashion. The experimental evaluation shows that the additional cost of our contraint-based semi-supervised hierarchical clustering algorithm (HAC) is negligible, and our combined semi-supervised HAC algorithm outperforms the state-of-the-art algorithms on real-world datasets. The experiments also show that our proposed methods can improve clustering performance even with a small number of unevenly distributed labeled data.

Keywords: semi-supervised clustering, hierarchical agglomerative clustering, reference trees, distance constraints

Procedia PDF Downloads 493
36590 Using Genetic Algorithms and Rough Set Based Fuzzy K-Modes to Improve Centroid Model Clustering Performance on Categorical Data

Authors: Rishabh Srivastav, Divyam Sharma

Abstract:

We propose an algorithm to cluster categorical data named as ‘Genetic algorithm initialized rough set based fuzzy K-Modes for categorical data’. We propose an amalgamation of the simple K-modes algorithm, the Rough and Fuzzy set based K-modes and the Genetic Algorithm to form a new algorithm,which we hypothesise, will provide better Centroid Model clustering results, than existing standard algorithms. In the proposed algorithm, the initialization and updation of modes is done by the use of genetic algorithms while the membership values are calculated using the rough set and fuzzy logic.

Keywords: categorical data, fuzzy logic, genetic algorithm, K modes clustering, rough sets

Procedia PDF Downloads 201
36589 Mitigating the Negative Effect of Intrabrand Clustering: The Role of Interbrand Clustering and Firm Size

Authors: Moeen Naseer Butt

Abstract:

Clustering –geographic concentrations of entities– has recently received more attention in marketing research and has been shown to affect multiple outcomes. This study investigates the impact of intrabrand clustering (clustering of same-brand outlets) on an outlet’s quality performance. Further, it assesses the moderating effects of interbrand clustering (clustering of other-brand outlets) and firm size. An examination of approximately 21,000 food service establishments in New York State in 2019 finds that the impact of intrabrand clustering on an outlet’s quality performance is context-dependent. Specifically, intrabrand clustering decreases, whereas interbrand clustering and firm size help increase the outlet’s performance. Additionally, this study finds that the role of firm size is more substantial than interbrand clustering in mitigating the adverse effects of intrabrand clustering on outlet quality performance.

Keywords: intraband clustering, interbrand clustering, firm size, brand competition, outlet performance, quality violations

Procedia PDF Downloads 153
36588 Flowing Online Vehicle GPS Data Clustering Using a New Parallel K-Means Algorithm

Authors: Orhun Vural, Oguz Bayat, Rustu Akay, Osman N. Ucan

Abstract:

This study presents a new parallel approach clustering of GPS data. Evaluation has been made by comparing execution time of various clustering algorithms on GPS data. This paper aims to propose a parallel based on neighborhood K-means algorithm to make it faster. The proposed parallelization approach assumes that each GPS data represents a vehicle and to communicate between vehicles close to each other after vehicles are clustered. This parallelization approach has been examined on different sized continuously changing GPS data and compared with serial K-means algorithm and other serial clustering algorithms. The results demonstrated that proposed parallel K-means algorithm has been shown to work much faster than other clustering algorithms.

Keywords: parallel k-means algorithm, parallel clustering, clustering algorithms, clustering on flowing data

Procedia PDF Downloads 174
36587 Time Series Regression with Meta-Clusters

Authors: Monika Chuchro

Abstract:

This paper presents a preliminary attempt to apply classification of time series using meta-clusters in order to improve the quality of regression models. In this case, clustering was performed as a method to obtain a subgroups of time series data with normal distribution from inflow into waste water treatment plant data which Composed of several groups differing by mean value. Two simple algorithms: K-mean and EM were chosen as a clustering method. The rand index was used to measure the similarity. After simple meta-clustering, regression model was performed for each subgroups. The final model was a sum of subgroups models. The quality of obtained model was compared with the regression model made using the same explanatory variables but with no clustering of data. Results were compared by determination coefficient (R2), measure of prediction accuracy mean absolute percentage error (MAPE) and comparison on linear chart. Preliminary results allows to foresee the potential of the presented technique.

Keywords: clustering, data analysis, data mining, predictive models

Procedia PDF Downloads 427
36586 Probabilistic Graphical Model for the Web

Authors: M. Nekri, A. Khelladi

Abstract:

The world wide web network is a network with a complex topology, the main properties of which are the distribution of degrees in power law, A low clustering coefficient and a weak average distance. Modeling the web as a graph allows locating the information in little time and consequently offering a help in the construction of the research engine. Here, we present a model based on the already existing probabilistic graphs with all the aforesaid characteristics. This work will consist in studying the web in order to know its structuring thus it will enable us to modelize it more easily and propose a possible algorithm for its exploration.

Keywords: clustering coefficient, preferential attachment, small world, web community

Procedia PDF Downloads 228
36585 Multimodal Optimization of Density-Based Clustering Using Collective Animal Behavior Algorithm

Authors: Kristian Bautista, Ruben A. Idoy

Abstract:

A bio-inspired metaheuristic algorithm inspired by the theory of collective animal behavior (CAB) was integrated to density-based clustering modeled as multimodal optimization problem. The algorithm was tested on synthetic, Iris, Glass, Pima and Thyroid data sets in order to measure its effectiveness relative to CDE-based Clustering algorithm. Upon preliminary testing, it was found out that one of the parameter settings used was ineffective in performing clustering when applied to the algorithm prompting the researcher to do an investigation. It was revealed that fine tuning distance δ3 that determines the extent to which a given data point will be clustered helped improve the quality of cluster output. Even though the modification of distance δ3 significantly improved the solution quality and cluster output of the algorithm, results suggest that there is no difference between the population mean of the solutions obtained using the original and modified parameter setting for all data sets. This implies that using either the original or modified parameter setting will not have any effect towards obtaining the best global and local animal positions. Results also suggest that CDE-based clustering algorithm is better than CAB-density clustering algorithm for all data sets. Nevertheless, CAB-density clustering algorithm is still a good clustering algorithm because it has correctly identified the number of classes of some data sets more frequently in a thirty trial run with a much smaller standard deviation, a potential in clustering high dimensional data sets. Thus, the researcher recommends further investigation in the post-processing stage of the algorithm.

Keywords: clustering, metaheuristics, collective animal behavior algorithm, density-based clustering, multimodal optimization

Procedia PDF Downloads 188
36584 Decision Trees Constructing Based on K-Means Clustering Algorithm

Authors: Loai Abdallah, Malik Yousef

Abstract:

A domain space for the data should reflect the actual similarity between objects. Since objects belonging to the same cluster usually share some common traits even though their geometric distance might be relatively large. In general, the Euclidean distance of data points that represented by large number of features is not capturing the actual relation between those points. In this study, we propose a new method to construct a different space that is based on clustering to form a new distance metric. The new distance space is based on ensemble clustering (EC). The EC distance space is defined by tracking the membership of the points over multiple runs of clustering algorithm metric. Over this distance, we train the decision trees classifier (DT-EC). The results obtained by applying DT-EC on 10 datasets confirm our hypotheses that embedding the EC space as a distance metric would improve the performance.

Keywords: ensemble clustering, decision trees, classification, K nearest neighbors

Procedia PDF Downloads 153
36583 A Relative Entropy Regularization Approach for Fuzzy C-Means Clustering Problem

Authors: Ouafa Amira, Jiangshe Zhang

Abstract:

Clustering is an unsupervised machine learning technique; its aim is to extract the data structures, in which similar data objects are grouped in the same cluster, whereas dissimilar objects are grouped in different clusters. Clustering methods are widely utilized in different fields, such as: image processing, computer vision , and pattern recognition, etc. Fuzzy c-means clustering (fcm) is one of the most well known fuzzy clustering methods. It is based on solving an optimization problem, in which a minimization of a given cost function has been studied. This minimization aims to decrease the dissimilarity inside clusters, where the dissimilarity here is measured by the distances between data objects and cluster centers. The degree of belonging of a data point in a cluster is measured by a membership function which is included in the interval [0, 1]. In fcm clustering, the membership degree is constrained with the condition that the sum of a data object’s memberships in all clusters must be equal to one. This constraint can cause several problems, specially when our data objects are included in a noisy space. Regularization approach took a part in fuzzy c-means clustering technique. This process introduces an additional information in order to solve an ill-posed optimization problem. In this study, we focus on regularization by relative entropy approach, where in our optimization problem we aim to minimize the dissimilarity inside clusters. Finding an appropriate membership degree to each data object is our objective, because an appropriate membership degree leads to an accurate clustering result. Our clustering results in synthetic data sets, gaussian based data sets, and real world data sets show that our proposed model achieves a good accuracy.

Keywords: clustering, fuzzy c-means, regularization, relative entropy

Procedia PDF Downloads 233
36582 Finding Bicluster on Gene Expression Data of Lymphoma Based on Singular Value Decomposition and Hierarchical Clustering

Authors: Alhadi Bustaman, Soeganda Formalidin, Titin Siswantining

Abstract:

DNA microarray technology is used to analyze thousand gene expression data simultaneously and a very important task for drug development and test, function annotation, and cancer diagnosis. Various clustering methods have been used for analyzing gene expression data. However, when analyzing very large and heterogeneous collections of gene expression data, conventional clustering methods often cannot produce a satisfactory solution. Biclustering algorithm has been used as an alternative approach to identifying structures from gene expression data. In this paper, we introduce a transform technique based on singular value decomposition to identify normalized matrix of gene expression data followed by Mixed-Clustering algorithm and the Lift algorithm, inspired in the node-deletion and node-addition phases proposed by Cheng and Church based on Agglomerative Hierarchical Clustering (AHC). Experimental study on standard datasets demonstrated the effectiveness of the algorithm in gene expression data.

Keywords: agglomerative hierarchical clustering (AHC), biclustering, gene expression data, lymphoma, singular value decomposition (SVD)

Procedia PDF Downloads 241
36581 Investigation of Clustering Algorithms Used in Wireless Sensor Networks

Authors: Naim Karasekreter, Ugur Fidan, Fatih Basciftci

Abstract:

Wireless sensor networks are networks in which more than one sensor node is organized among themselves. The working principle is based on the transfer of the sensed data over the other nodes in the network to the central station. Wireless sensor networks concentrate on routing algorithms, energy efficiency and clustering algorithms. In the clustering method, the nodes in the network are divided into clusters using different parameters and the most suitable cluster head is selected from among them. The data to be sent to the center is sent per cluster, and the cluster head is transmitted to the center. With this method, the network traffic is reduced and the energy efficiency of the nodes is increased. In this study, clustering algorithms were examined in terms of clustering performances and cluster head selection characteristics to try to identify weak and strong sides. This work is supported by the Project 17.Kariyer.123 of Afyon Kocatepe University BAP Commission.

Keywords: wireless sensor networks (WSN), clustering algorithm, cluster head, clustering

Procedia PDF Downloads 465
36580 Generalization of Clustering Coefficient on Lattice Networks Applied to Criminal Networks

Authors: Christian H. Sanabria-Montaña, Rodrigo Huerta-Quintanilla

Abstract:

A lattice network is a special type of network in which all nodes have the same number of links, and its boundary conditions are periodic. The most basic lattice network is the ring, a one-dimensional network with periodic border conditions. In contrast, the Cartesian product of d rings forms a d-dimensional lattice network. An analytical expression currently exists for the clustering coefficient in this type of network, but the theoretical value is valid only up to certain connectivity value; in other words, the analytical expression is incomplete. Here we obtain analytically the clustering coefficient expression in d-dimensional lattice networks for any link density. Our analytical results show that the clustering coefficient for a lattice network with density of links that tend to 1, leads to the value of the clustering coefficient of a fully connected network. We developed a model on criminology in which the generalized clustering coefficient expression is applied. The model states that delinquents learn the know-how of crime business by sharing knowledge, directly or indirectly, with their friends of the gang. This generalization shed light on the network properties, which is important to develop new models in different fields where network structure plays an important role in the system dynamic, such as criminology, evolutionary game theory, econophysics, among others.

Keywords: clustering coefficient, criminology, generalized, regular network d-dimensional

Procedia PDF Downloads 365
36579 Performance Analysis of Hierarchical Agglomerative Clustering in a Wireless Sensor Network Using Quantitative Data

Authors: Tapan Jain, Davender Singh Saini

Abstract:

Clustering is a useful mechanism in wireless sensor networks which helps to cope with scalability and data transmission problems. The basic aim of our research work is to provide efficient clustering using Hierarchical agglomerative clustering (HAC). If the distance between the sensing nodes is calculated using their location then it’s quantitative HAC. This paper compares the various agglomerative clustering techniques applied in a wireless sensor network using the quantitative data. The simulations are done in MATLAB and the comparisons are made between the different protocols using dendrograms.

Keywords: routing, hierarchical clustering, agglomerative, quantitative, wireless sensor network

Procedia PDF Downloads 553
36578 Identifying Autism Spectrum Disorder Using Optimization-Based Clustering

Authors: Sharifah Mousli, Sona Taheri, Jiayuan He

Abstract:

Autism spectrum disorder (ASD) is a complex developmental condition involving persistent difficulties with social communication, restricted interests, and repetitive behavior. The challenges associated with ASD can interfere with an affected individual’s ability to function in social, academic, and employment settings. Although there is no effective medication known to treat ASD, to our best knowledge, early intervention can significantly improve an affected individual’s overall development. Hence, an accurate diagnosis of ASD at an early phase is essential. The use of machine learning approaches improves and speeds up the diagnosis of ASD. In this paper, we focus on the application of unsupervised clustering methods in ASD as a large volume of ASD data generated through hospitals, therapy centers, and mobile applications has no pre-existing labels. We conduct a comparative analysis using seven clustering approaches such as K-means, agglomerative hierarchical, model-based, fuzzy-C-means, affinity propagation, self organizing maps, linear vector quantisation – as well as the recently developed optimization-based clustering (COMSEP-Clust) approach. We evaluate the performances of the clustering methods extensively on real-world ASD datasets encompassing different age groups: toddlers, children, adolescents, and adults. Our experimental results suggest that the COMSEP-Clust approach outperforms the other seven methods in recognizing ASD with well-separated clusters.

Keywords: autism spectrum disorder, clustering, optimization, unsupervised machine learning

Procedia PDF Downloads 70
36577 Method of Cluster Based Cross-Domain Knowledge Acquisition for Biologically Inspired Design

Authors: Shen Jian, Hu Jie, Ma Jin, Peng Ying Hong, Fang Yi, Liu Wen Hai

Abstract:

Biologically inspired design inspires inventions and new technologies in the field of engineering by mimicking functions, principles, and structures in the biological domain. To deal with the obstacles of cross-domain knowledge acquisition in the existing biologically inspired design process, functional semantic clustering based on functional feature semantic correlation and environmental constraint clustering composition based on environmental characteristic constraining adaptability are proposed. A knowledge cell clustering algorithm and the corresponding prototype system is developed. Finally, the effectiveness of the method is verified by the visual prosthetic device design.

Keywords: knowledge clustering, knowledge acquisition, knowledge based engineering, knowledge cell, biologically inspired design

Procedia PDF Downloads 393
36576 Power Iteration Clustering Based on Deflation Technique on Large Scale Graphs

Authors: Taysir Soliman

Abstract:

One of the current popular clustering techniques is Spectral Clustering (SC) because of its advantages over conventional approaches such as hierarchical clustering, k-means, etc. and other techniques as well. However, one of the disadvantages of SC is the time consuming process because it requires computing the eigenvectors. In the past to overcome this disadvantage, a number of attempts have been proposed such as the Power Iteration Clustering (PIC) technique, which is one of versions from SC; some of PIC advantages are: 1) its scalability and efficiency, 2) finding one pseudo-eigenvectors instead of computing eigenvectors, and 3) linear combination of the eigenvectors in linear time. However, its worst disadvantage is an inter-class collision problem because it used only one pseudo-eigenvectors which is not enough. Previous researchers developed Deflation-based Power Iteration Clustering (DPIC) to overcome problems of PIC technique on inter-class collision with the same efficiency of PIC. In this paper, we developed Parallel DPIC (PDPIC) to improve the time and memory complexity which is run on apache spark framework using sparse matrix. To test the performance of PDPIC, we compared it to SC, ESCG, ESCALG algorithms on four small graph benchmark datasets and nine large graph benchmark datasets, where PDPIC proved higher accuracy and better time consuming than other compared algorithms.

Keywords: spectral clustering, power iteration clustering, deflation-based power iteration clustering, Apache spark, large graph

Procedia PDF Downloads 144
36575 Feature Weighting Comparison Based on Clustering Centers in the Detection of Diabetic Retinopathy

Authors: Kemal Polat

Abstract:

In this paper, three feature weighting methods have been used to improve the classification performance of diabetic retinopathy (DR). To classify the diabetic retinopathy, features extracted from the output of several retinal image processing algorithms, such as image-level, lesion-specific and anatomical components, have been used and fed them into the classifier algorithms. The dataset used in this study has been taken from University of California, Irvine (UCI) machine learning repository. Feature weighting methods including the fuzzy c-means clustering based feature weighting, subtractive clustering based feature weighting, and Gaussian mixture clustering based feature weighting, have been used and compered with each other in the classification of DR. After feature weighting, five different classifier algorithms comprising multi-layer perceptron (MLP), k- nearest neighbor (k-NN), decision tree, support vector machine (SVM), and Naïve Bayes have been used. The hybrid method based on combination of subtractive clustering based feature weighting and decision tree classifier has been obtained the classification accuracy of 100% in the screening of DR. These results have demonstrated that the proposed hybrid scheme is very promising in the medical data set classification.

Keywords: machine learning, data weighting, classification, data mining

Procedia PDF Downloads 294
36574 Improved K-Means Clustering Algorithm Using RHadoop with Combiner

Authors: Ji Eun Shin, Dong Hoon Lim

Abstract:

Data clustering is a common technique used in data analysis and is used in many applications, such as artificial intelligence, pattern recognition, economics, ecology, psychiatry and marketing. K-means clustering is a well-known clustering algorithm aiming to cluster a set of data points to a predefined number of clusters. In this paper, we implement K-means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. The main idea is to introduce a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. The experimental results demonstrated that K-means algorithm using RHadoop can scale well and efficiently process large data sets on commodity hardware. We also showed that our K-means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases.

Keywords: big data, combiner, K-means clustering, RHadoop

Procedia PDF Downloads 390
36573 Method of Visual Prosthesis Design Based on Biologically Inspired Design

Authors: Shen Jian, Hu Jie, Zhu Guo Niu, Peng Ying Hong

Abstract:

There are two issues exited in the traditional visual prosthesis: lacking systematic method and the low level of humanization. To tackcle those obstacles, a visual prosthesis design method based on biologically inspired design is proposed. Firstly, a constrained FBS knowledge cell model is applied to construct the functional model of visual prosthesis in biological field. Then the clustering results of engineering domain are ob-tained with the use of the cross-domain knowledge cell clustering algorithm. Finally, a prototype system is designed to support the bio-logically inspired design where the conflict is digested by TRIZ and other tools, and the validity of the method is verified by the solution scheme

Keywords: knowledge-based engineering, visual prosthesis, biologically inspired design, biomedical engineering

Procedia PDF Downloads 150
36572 A Neural Network Based Clustering Approach for Imputing Multivariate Values in Big Data

Authors: S. Nickolas, Shobha K.

Abstract:

The treatment of incomplete data is an important step in the data pre-processing. Missing values creates a noisy environment in all applications and it is an unavoidable problem in big data management and analysis. Numerous techniques likes discarding rows with missing values, mean imputation, expectation maximization, neural networks with evolutionary algorithms or optimized techniques and hot deck imputation have been introduced by researchers for handling missing data. Among these, imputation techniques plays a positive role in filling missing values when it is necessary to use all records in the data and not to discard records with missing values. In this paper we propose a novel artificial neural network based clustering algorithm, Adaptive Resonance Theory-2(ART2) for imputation of missing values in mixed attribute data sets. The process of ART2 can recognize learned models fast and be adapted to new objects rapidly. It carries out model-based clustering by using competitive learning and self-steady mechanism in dynamic environment without supervision. The proposed approach not only imputes the missing values but also provides information about handling the outliers.

Keywords: ART2, data imputation, clustering, missing data, neural network, pre-processing

Procedia PDF Downloads 238
36571 Max-Entropy Feed-Forward Clustering Neural Network

Authors: Xiaohan Bookman, Xiaoyan Zhu

Abstract:

The outputs of non-linear feed-forward neural network are positive, which could be treated as probability when they are normalized to one. If we take Entropy-Based Principle into consideration, the outputs for each sample could be represented as the distribution of this sample for different clusters. Entropy-Based Principle is the principle with which we could estimate the unknown distribution under some limited conditions. As this paper defines two processes in Feed-Forward Neural Network, our limited condition is the abstracted features of samples which are worked out in the abstraction process. And the final outputs are the probability distribution for different clusters in the clustering process. As Entropy-Based Principle is considered into the feed-forward neural network, a clustering method is born. We have conducted some experiments on six open UCI data sets, comparing with a few baselines and applied purity as the measurement. The results illustrate that our method outperforms all the other baselines that are most popular clustering methods.

Keywords: feed-forward neural network, clustering, max-entropy principle, probabilistic models

Procedia PDF Downloads 398
36570 Using Closed Frequent Itemsets for Hierarchical Document Clustering

Authors: Cheng-Jhe Lee, Chiun-Chieh Hsu

Abstract:

Due to the rapid development of the Internet and the increased availability of digital documents, the excessive information on the Internet has led to information overflow problem. In order to solve these problems for effective information retrieval, document clustering in text mining becomes a popular research topic. Clustering is the unsupervised classification of data items into groups without the need of training data. Many conventional document clustering methods perform inefficiently for large document collections because they were originally designed for relational database. Therefore they are impractical in real-world document clustering and require special handling for high dimensionality and high volume. We propose the FIHC (Frequent Itemset-based Hierarchical Clustering) method, which is a hierarchical clustering method developed for document clustering, where the intuition of FIHC is that there exist some common words for each cluster. FIHC uses such words to cluster documents and builds hierarchical topic tree. In this paper, we combine FIHC algorithm with ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results show that our method is more accurate than those of well-known document clustering algorithms.

Keywords: FIHC, documents clustering, ontology, closed frequent itemset

Procedia PDF Downloads 360
36569 3D Mesh Coarsening via Uniform Clustering

Authors: Shuhua Lai, Kairui Chen

Abstract:

In this paper, we present a fast and efficient mesh coarsening algorithm for 3D triangular meshes. Theis approach can be applied to very complex 3D meshes of arbitrary topology and with millions of vertices. The algorithm is based on the clustering of the input mesh elements, which divides the faces of an input mesh into a given number of clusters for clustering purpose by approximating the Centroidal Voronoi Tessellation of the input mesh. Once a clustering is achieved, it provides us an efficient way to construct uniform tessellations, and therefore leads to good coarsening of polygonal meshes. With proliferation of 3D scanners, this coarsening algorithm is particularly useful for reverse engineering applications of 3D models, which in many cases are dense, non-uniform, irregular and arbitrary topology. Examples demonstrating effectiveness of the new algorithm are also included in the paper.

Keywords: coarsening, mesh clustering, shape approximation, mesh simplification

Procedia PDF Downloads 328
36568 A Hybrid Method for Determination of Effective Poles Using Clustering Dominant Pole Algorithm

Authors: Anuj Abraham, N. Pappa, Daniel Honc, Rahul Sharma

Abstract:

In this paper, an analysis of some model order reduction techniques is presented. A new hybrid algorithm for model order reduction of linear time invariant systems is compared with the conventional techniques namely Balanced Truncation, Hankel Norm reduction and Dominant Pole Algorithm (DPA). The proposed hybrid algorithm is known as Clustering Dominant Pole Algorithm (CDPA) is able to compute the full set of dominant poles and its cluster center efficiently. The dominant poles of a transfer function are specific eigenvalues of the state space matrix of the corresponding dynamical system. The effectiveness of this novel technique is shown through the simulation results.

Keywords: balanced truncation, clustering, dominant pole, Hankel norm, model reduction

Procedia PDF Downloads 562