Categorical Clustering By Converting Associated Information

Dongmin Cai; Stephen S-T Yau

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33156

Categorical Clustering By Converting Associated Information

Authors: Dongmin Cai, Stephen S-T Yau

Abstract:

Lacking an inherent “natural" dissimilarity measure between objects in categorical dataset presents special difficulties in clustering analysis. However, each categorical attributes from a given dataset provides natural probability and information in the sense of Shannon. In this paper, we proposed a novel method which heuristically converts categorical attributes to numerical values by exploiting such associated information. We conduct an experimental study with real-life categorical dataset. The experiment demonstrates the effectiveness of our approach.

Keywords: Categorical, Clustering, Converting, Information

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1075769

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1366

References:

[1] C. C. Aggarwal, A human-computer interactive method for projected clustering, IEEE Transactions on Knowledge and Data Engineering, 16(4), 448-460, 2004.
[2] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACMSIGMOD Int. Conf. Management of Data (SIGMOD'99), pages 49{60, Philadelphia, PA, June 1999.
[3] M.R. Anderberg, Cluster analysis for applications, Academic Press, 1973.
[4] D. Barbara, Y. Li, J. Couto, COOLCAT: An entropy-based algorithm for categorical clustering. In: CIKM Conference. McLean, VA, 2002.
[5] C.L. Blake and C.J. Merz, UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html
[6] D. Cristofor and D. A. Simovici, An information-theoretical approach to clustering categorical databases using genetic algorithms. In Proceedings of the Workshop on Clustering High-Dimensional Data and Its Applications (SIAM ICDM), pages 37-46, Washington, 2002.
[7] Richard O. Duda and Peter E. Hard, Pattern classification and scene analysi. A wiley-Interscience Publication, New York, 1973.
[8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD'96), pages 226{231, Portland, Oregon, Aug. 1996.
[9] D. Fisher, Improving inference through conceptual clustering. In Proc. 1987 National Conference Artificial Intelligence (AAAI-87), pages 461- 465, Seattle, WA, July 1987.
[10] K.C. Gowda and E. Diday, Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24(6): 567-578, 1991.
[11] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS: Clustering categorical data using summaries. In ACM SIGKDD Int-l Conference on Knowledge discovery in Databases, 1999.
[12] David Gibson, Jon Kleiberg, Prabhakar Raghavan: Clustering categorical data: an approach based on dynamic systems". Proc. 1998 Int. Conf. On Very Large Databases, pp. 311-323, New York, August 1998.
[13] J.C. Gower, A general coefficient of similarity and some of its properties. BioMetrics, 27: 857-874, 1971.
[14] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, ROCK: A robust clustering algorithm for categorical attributes. ICDE 1999: 512-521.
[15] A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), pages 58-65, New York, NY, Aug. 1998.
[16] J. Han and M. Kamber, Data mining: concepts and techniques, Morgan Kaufmann publishers, 2001.
[17] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, vol. 2, no. 3, pp 283-304, 1998.
[18] A.K. Jain and R.C. Dubes, Algorithms for clustering data, Rentice Hall, 1988.
[19] L. Kaufman and P.J. Rousseeuw, Finding groups in data - An Introduction to Cluster Analysis in Knowledge, 1990.
[20] Lioyd. Learning square quantization in PCM. (published in IEEE Trans. Information Theory), 28:128-137, 1982), Technical Report, Bell Labs, 1957.
[21] Tao Li, Sheng Ma, Mitsunori Ogihara, Entropy-based criterion in categorical clustering. In Proceedings of The 2004, IEEE International Conference on Machine Learning (ICML 2004), pages 536-543.
[22] J. MacQueen. Some methods for classi┬»cation and analysis of multivariate observations. Proc. 5th Berkeley Symp. Math. Statist, Prob., 1:281-297, 1967.
[23] R.S. Michalski and R.E. Stephen, Automated construction of classification: conceptual clustering versus numerical taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(4): 396- 410, 1983.
[24] J.R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.
[25] J.R. Quinlan, C4.5: Programs for machine learning. Morgan Kaufmann, 1993.
[26] H. Ralambondrainy, A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16:1147-1157, 1995.
[27] Claude. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol.27, pp. 379-423 and 623-656, July and October, 1948.
[28] M. Steinbach, G. Karypis, and V. Kumar, A comparison of document clustering techniques, In KDD workshop on Text Mining, 2000.
[29] L. Talavera and J. Béjar, Intergrating declarative knowledge in hierarchical clustering tasks. Proceedings of the International Symposium on Intelligent Data Analysis, pp. 211-222, Amsterdam, The Netherlands: Springer-Verlag, 1999.
[30] Y. Zhang, A. Fu, C. Cai, and P. Heng, Clustering categorical data, In Proc. 2000 IEEE Int. Conf. Data Engineering, San Deigo, USA, March 2000.