A Similarity Measure for Clustering and its Applications

Guadalupe J. Torres; Ram B. Basnet; Andrew H. Sung; Srinivas Mukkamala; Bernardete M. Ribeiro

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

A Similarity Measure for Clustering and its Applications

Authors: Guadalupe J. Torres, Ram B. Basnet, Andrew H. Sung, Srinivas Mukkamala, Bernardete M. Ribeiro

Abstract:

This paper introduces a measure of similarity between two clusterings of the same dataset produced by two different algorithms, or even the same algorithm (K-means, for instance, with different initializations usually produce different results in clustering the same dataset). We then apply the measure to calculate the similarity between pairs of clusterings, with special interest directed at comparing the similarity between various machine clusterings and human clustering of datasets. The similarity measure thus can be used to identify the best (in terms of most similar to human) clustering algorithm for a specific problem at hand. Experimental results pertaining to the text categorization problem of a Portuguese corpus (wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The significance and other potential applications of the proposed measure are discussed.

Keywords: Clustering Algorithms, Clustering Applications, Similarity Measures, Text Clustering

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1072529

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1513

References:

[1] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, 2002, vol 34, No. 1, pp. 1-47.
[2] C. J. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, L. M.Hage, and W. E. Hammond, "Medical Data Mining: Knowledge Discovery in a Clinical Data Warehouse," American Medical Informatics Association Annual Fall Symposium (formerly SCAMC), 1997, pp. 101-5.
[3] K. Seki and J. Mostafa, "An Application of Text Categorization Methods to Gene Ontology Annotation," Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 138-145.
[4] M. Matteucci. (2008). A Tutorial on Clustering Algorithms. Available: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/.
[5] Y. Pen, G. Kou, Y. Shi, and Z. Chen, "Improving Clustering Analysis for Credit Card Accounts Classification," LNCS 3516, 2005, pp. 548-553.
[6] A. Kalton, K. Wagstaff, and J. Yoo, "Generalized Clustering, Supervised Learning, and Data Assignment," Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, ACM Press, 2001.
[7] T, Kardi. (2008). Similarity Measurement. Available: http://people.revoledu.com/kardi\/tutorial/Similarity/.
[8] M. K. Sankarapani, R. B. Basnet, S. Mukkamala, A. H. Sung, and B. Ribeiro, "Translation Based Arabic Text Categorization," Proceedings of Second International Conference on Information Systems Technology and Management, Dubai, March 2008.
[9] Linguateca. (2007). Linguateca. Available: http://www.linguateca.pt /Repositorio/.
[10] Google. (2008). Google Translate. Available: http://translate .google.com/translate_t.
[11] A. Asuncion and D. J. Newman. (2007). UCI Machine Learning Repository: Iris Data Set. Avaialable: http://www.ics.uci.edu/~mlearn/MLRepository.html.
[12] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text Categorization," Procedings of the Autralasian Data Mining Workshop, Canberra, Australia, 2003.
[13] M. F. Porter, "An Algorithm for Suffix Stripping, Readings in Information Retrieval," Morgan Kaufmann Publishers Inc, 1997.
[14] M. Lan, S.-Y Sung, H.-B. Low, and C.-L. Tan, "A Comparative Study on Term Weighting Schemes for Text Categorization," IJCNN, 2005, vol. 1, pp. 542-545.
[15] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text Categorization," Procedings of the Autralasian Data Mining Workshop, Canberra, Australia, 2003.
[16] G. Karypis. (2008). gCLUTO - Graphical Clustering Toolkit | Karypis Lab. Available: http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/overview.
[17] J. Abonyi and B. Balasko, B. (2008). Fuzzy Clustering and Data Analysis Toolbox. Available: http://www.fmt.vein.hu/softcomp/fclusttoolbox/.
[18] University of Waikato. (2008). Weka 3 -Data Mining with Open Source Machine Learning Sofware in Java. Available: http://cs.waikato.ac.nz/~ml/weka/.