Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30127
Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency

Authors: Semeh Ben Salem, Sami Naouali, Moetez Sallami

Abstract:

Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.

Keywords: Clustering, k-means, categorical datasets, pattern recognition, unsupervised learning, knowledge discovery.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1130687

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2774

References:


[1] Jiawei Han, Jian Pei, Micheline Kamber, “Data Mining: Concepts and Techniques”, Elsevier, 3rd edition, 2011, 744 p.
[2] Charu C. Aggarwal, “Data Mining: the textbook”, Springer 2015, 734 pages.
[3] GuojunGan, Chaoqun Ma, Jianhong Wu, “Data Clustering: Theory, Algorithms, and Applications”, ASA-SIAM Series on Statistics and Applied Probability, 2007.
[4] Zhexue Huang, “Extension to the k-means algorithm for clustering large data sets with categorical values.” Data Mining and Knowledge Discovery 2, 283-304 (1998).
[5] Fuyuan Cao, Jiye Liang, Deyu Li, Liang Bai, Chuangyin Dang, “A dissimilarity measure for the k-modes clustering algorithm”, Knowledge Based Systems 26 (2012), Elsevier, pp 120-127.
[6] Z. He, X. Xu, S. Deng, ”Squeezer: an efficient algorithm for clustering categorical data” Journal of Computational Science and Technology 17 (5) (2002) 611-624.
[7] Z. He, X. Xu, S. Deng, “Scalable algorithms for clustering large datasets with mixed type attributes”, International Journal of Intelligent Systems 20 (10) (2005) 1077-1089.
[8] Z. X, Huang, M. K Ng, “A fuzzy k-modes algorithm for clustering categorical data”, IEEE transactions on Fuzzy systems 7(4) (1999) 446-452.
[9] D. W Kim, K. H Lee, D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids”, Pattern recognition letters 25 (2004) 1263-1271.
[10] M. K Ng, M. J Li, Z. X Huang, Z. Y He “On the impact of dissimilarity measure in k-modes clustering algorithm.” IEEE transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 503-507.
[11] D. Gibson, J. Kleinberg, P. Raghavan, “Clustering categorical data: an approach based on dynamical systems”, Proceedings of the 24th VLDB Conference, New York, 1998, pp 311-322.
[12] S. Guha, R. Rastogi, K. Shim, “ROCK: a robust clustering algorithm for categorical attributes”Proceedings of the IEEEInternationalConference on Data Engineering, Sydney, Australia 1999, pp 512-521.
[13] Ng M. K., Li M. J, Huang J. H, He Z, “On the impact of dissimilarity measure in k-modes clustering algorithm.” IEEE transactions on Pattern Analysis and Machine Intelligence 29 (3); 503-507, 2007.
[14] A. Chaturvedi, Paul E. Green and J.D Caroll, “K-modes clustering.”, Journal of classification, Vol.18, No 1, pp 35-55, 2001.
[15] Ralambondrainy, H, “A conceptual version of the k-means algorithm.” Pattern recognition Letters 16, 1147-1157, 1995.
[16] Semeh Ben Salem, Sami Naouali, “Reducing the multidimensionality of OLAP cubes with Genetic Algorithms and Multiple Correspondence Analysis”, international conference on Advanced Wireless, Information, and Communication Technologies (AWICT 2015), Tunisia.
[17] Semeh Ben Salem, Sami Naouali, “Towards Reducing the multidimensionality of OLAP cubes using the Evolutionary Algorithms and Factor Analysis Method”, International Journal of Data Mining and Knowledge Management Process (IJDKM 2016).
[18] Semeh Ben Salem and Sami Naouali, “Pattern Recognition Approach in Multidimensional Databases: Application to the Global Terrorism Database” International Journal of Advanced Computer Science and Applications (IJACSA), 7(8), 2016.