Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure
Authors: S.Aranganayagi, K.Thangavel
Abstract:
K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.
Keywords: Clustering, categorical data, K-Modes, weighted dissimilarity measure
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1070405
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3688References:
[1] Arun.K.Pujari, "Data Mining Techniques", Universities Press, 2001.
[2] Daniel Barbara, Julia Couto, Yi Li, "COOLCAT An entropy based algorithm for categorical clustering", Proceedings of the eleventh international conference on Information and knowledge management, 2002, 582 - 589.
[3] Dae-won kim, Kwang H.Lee, Doheon Lee, "Fuzzy clustering of categorical data using centroids", Pattern recognition letters 25, Elseveir, (2004), 1263-1271.
[4] George Karypis, Eui-Hong (Sam) Han, Vipinkumar, "CHAMELEON: A hierarchical clustering algorithm using dynamic modeling", IEEE Computer, 1999.
[5] Jiawei Han, Micheline Kamber, "Data Mining Concepts and Techniques", Harcourt India Private Limited, 2001.
[6] Ohn Mar San, Van-Nam Huynh, Yoshiteru Nakamori, "An Alternative Extension of The K-Means algorithm For Clustering Categorical Data", J. Appl. Math. Comput. Sci, Vol. 14, No. 2, 2004, 241-247.
[7] Pavel Berkhin, "Survey of Clustering Data Mining Techniques", Technical report, Accrue software,2002
[8] Periklis Andristos, Clustering Categorical Data based On Information Loss Minimization, EDBT 2004: 123-146.
[9] Sudipto Guga, Rajeev Rastogi, Kyuseok Shim, "ROCK, A Robust Clustering Algorithm For Categorical Attributes", ICDE -99: Proceedings of the 15th International Conference on Data Engineering, 512, IEEE Computer Society, Washington, DC, USA,1999
[10] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, "CACTUS -Clustering Categorical Data using summaries", In Proc. of ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining, 1999, San Diego, CA USA.
[11] www.ics.uci.edu/ mlearn/MLRepository.html
[12] Zengyou He, Xiaofei Xu, Shengchun Deng, "Squeezer: An Efficient algorithm for clustering categorical data", Journal of Computer Science and Technology, Volume 17 Issue 5, Editorial Universitaria de Buenos Aires, 2002.
[13] Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong," K-Histograms: An Efficient Algorithm for Catgorical Data set", www.citebase.org.
[14] Zhexue Huang , "A Fast Clustering Algorithm to cluster Very Large Categorical Datasets in Data Mining", In Proc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
[15] Zhexue Huang, "Extensions to the K-means algorithm for clustering Large Data sets with categorical value", Data Mining and Knowledge Discovery 2, Kluwer Academic publishers, 1998. 283-304.