Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30669
Incremental Algorithm to Cluster the Categorical Data with Frequency Based Similarity Measure

Authors: S.Aranganayagi, K.Thangavel


Clustering categorical data is more complicated than the numerical clustering because of its special properties. Scalability and memory constraint is the challenging problem in clustering large data set. This paper presents an incremental algorithm to cluster the categorical data. Frequencies of attribute values contribute much in clustering similar categorical objects. In this paper we propose new similarity measures based on the frequencies of attribute values and its cardinalities. The proposed measures and the algorithm are experimented with the data sets from UCI data repository. Results prove that the proposed method generates better clusters than the existing one.

Keywords: Clustering, domain, Frequency, incremental, Categorical

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1458


[1] Aranganayagi.S and K.Thangavel, "M-Squeezer Algorithm to Cluster the Categorical Data", Computational Mathematics, Narosa, Publishing House, New Delhi, India, 2009
[2] Aranganayagi.S and K.Thangavel, "Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure", International Journal of Computational Intelligence (IJCI), Vol.5, No.2, pp.182-189,WASET, spring 2009.
[3] Arun.K.Pujari, "Data Mining Techniques", University Press, 2001.
[4] Ching- San Chiang, Shu-Chuan Chu, Yi-Chih Hsin and Ming-Hui Wang, "Genetic Distance measure for K-modes Algorithm", International Journal of Innovative Computing and Information and Control, Vol.2 , 2006, pp 33 -40.
[5] Daniel Barbara, Julia Couto, Yi Li, "COOLCAT An entropy based algorithm for categorical clustering", Proceedings of the eleventh international conference on Information and knowledge management, 2002, 582 - 589.
[6] Dae-won kim, Kwang H.Lee, Doheon Lee, "Fuzzy clustering of categorical data using centroids", Pattern recognition letters 25, Elseveir, (2004), 1263-1271.
[7] Dutta, M. and Mahanta, A. Kakoti and Pujari, Arun K., "QROCK a quick version of the ROCK algorithm for clustering of categorical data, Pattern Recogn. Letters, volume = {26}, 2005, 2364 - 2373, Elsevier Science Inc
[8] Hsu.C.C., & Huang,Y.P., "Incremental Clustering of Mixed Data Based on the Distance Hierarchy", Expert System with Applications,(2007),doi:10.1016/j/eswa 2007.08.049
[9] Jiawei Han, Micheline Kamber, "Data Mining Concepts and Techniques", Harcourt India Private Limited, 2001.
[10] Ohn Mar San, Van-Nam Huynh, Yoshiteru Nakamori, "An Alternative Extension of The K-Means algorithm For Clustering Categorical Data", J. Appl. Math. Comput. Sci, Vol. 14, No. 2, 2004, 241-247.
[11] Periklis Andristos, "Clustering Categorical Data based On Information Loss Minimization", EDBT 2004: 123-146.
[12] Sudipto Guga, Rajeev Rastogi, Kyuseok Shim, "ROCK, A Robust Clustering Algorithm For Categorical Attributes", ICDE '99: Proceedings of the 15th International Conference on Data Engineering, 512, IEEE Computer Society, Washington, DC, USA,1999
[13] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, "CACTUS -Clustering Categorical Data using summaries", In Proc. of ACM SIGKDD, International Conference on Knowledge Discovery & Data Mining, 1999, San Diego, CA USA.
[15] Zengyou He, Xiaofei Xu, Shengchun Deng, "Squeezer: An Efficient algorithm for clustering categorical data", Journal of Computer Science and Technology, Volume 17 Issue 5, Editorial Universitaria de Buenos Aires, 2002.
[16] Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong, "KHistograms: An Efficient Algorithm for Categorical Data set",
[17] Zhexue Huang , "A Fast Clustering Algorithm to cluster Very Large Categorical Datasets in Data Mining", In Proc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
[18] Zhexue Huang, "Extensions to the K-means algorithm for clustering Large Data sets with categorical value", Data Mining and Knowledge Discovery 2, Kluwer Academic publishers, 1998. 283- 304.