A Text Clustering System based on k-means Type Subspace Clustering and Ontology

Liping Jing; Michael K. Ng; Xinhua Yang; Joshua Zhexue Huang

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

A Text Clustering System based on k-means Type Subspace Clustering and Ontology

Authors: Liping Jing, Michael K. Ng, Xinhua Yang, Joshua Zhexue Huang

Abstract:

This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g., Bisecting-KMeans. Furthermore, the word extraction method is effective in selection of the words to represent the topics of the clusters.

Keywords: Subspace Clustering, Text Mining, Feature Weighting, Cluster Interpretation, Ontology

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1057143

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2465

References:

[1] "Textminer." (Online). Available: http://www.sas.com/technologies/analyticsdatamining/textminer/
[2] "Textanalyst." (Online). Available: http://www.megaputer.com/products/ta/index.php3
[3] "gcluto." (Online). Available: http://www-users.cs.umn.edu/ karypis/cluto/gcluto/
[4] "Refviz." (Online). Available: http://www.refviz.com
[5] W. Fan, L. Wallace, S. Rich, and Z. Zhang, "Tapping into the power of text mining," the Communications of ACM, 2005.
[6] B. Ricardo and R. PBerthier, Modern information retrieval, 1999.
[7] A. Hotho, A. Maedche, and S. Staab, "Ontology-based text document clustering," Seattle, USA, 2001.
[8] L. Jing, M. Ng, J. Xu, and Z. Huang, "Subspace clustering of text documents with feature weighting k-means algorithm," 2005, pp. 802- 812.
[9] C. Fellbaum, "Wordnet: an electronic lexical databases," The MIT Press, 1998.
[10] C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park, "Fast algorithms for projected clustering," Proc. ACM SIGMOD, pp. 61-72, 1999.
[11] K. Y. Yip, D.W. Cheung, and M. K. Ng, "A practical projected clustering algorithm," IEEE Transactions on knowledge and data engineering, vol. 16, no. 11, pp. 1387-1397, 2004.
[12] "20-newsgroups." (Online). Available: http://kdd.ics.uci.edu/databases/20newsgroups /20newsgroups.html
[13] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering : A review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[14] J. Han, M. Kamber, and A. Tung, "Spatial clustering methods in data mining: a survey," Geographic Data Mining and Knowledge Discovery, 2001.
[15] L. Parsons, E. Haque, and H. Liu, "Subspace clustering for high dimensional data: a review," SIGKDD Explorations, vol. 6, no. 1, pp. 90-105, 2004.
[16] J. MacQueen, "Some methods for classification and analysis of multivariate observations," 1967, pp. 281-297.
[17] S. Dhillon, J. Fan, and Y. Guan, "Efficient clustering of very large document collections," Data mining for scientific and engineering applications, pp. 357-381, 2001.
[18] M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," 2000.
[19] O. Duda, E. Hart, and G. Stork, Pattern classification, 2000.
[20] Y. Zhao and G. Karypis, "Comparison of agglomerative and partitional document clustering algorithms," Technical report 02-014, University of Minnesota, 2002.
[21] C. Aggarwal and P. Yu, "Finding generalized projected clusters in high dimensional spaces," Proc. ACM SIGMOD, pp. 70-81, 2000.
[22] H. Frigui and O. Nasraoui, "Unsupervised lerning of prototypes and attribute weights," Pattern recognition, vol. 37, no. 3, pp. 567-581, 2004.
[23] Y. Chan, K. Ching, K. Ng, and Z. Huang, "An optimization algorithm for clustering using weighted dissimilarity measures," Pattern recognition, vol. 37, no. 5, pp. 943-952, 2004.
[24] B. M.W., S. Dumais, and G. O-Brien, "Using linear algebra for intelligent information retrieval," SIAM Review, vol. 37, pp. 573-595, 1995.
[25] I. Herman, G. Melancon, and M. Marshall, "Graph visualization and navigation in information visualization: a survey," IEEE Transactions on Visualization and Computer Graphics, vol. 6, no. 1, pp. 24-43, 2000.
[26] S. Dhillon and S. Modha, "Concept decompositions for large sparse text data using clustering," Machine learning, vol. 42, no. 1, pp. 143-175, 2001.
[27] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "Distributional word clusters vs. words for text categorization," Journal of Machine Learning Research, vol. 3, no. 7-8, pp. 1183-1208, 2003.
[28] A. McCallum, "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," 1996. (Online). Available: http://www.cs.cmu.edu/mccallum/bow
[29] J. Bezdek, "A convergence theorem for the fuzzy isodata clustering algorithms," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 2, pp. 1-8, 1980.
[30] P. Hague and P. Harris, Sampling and statistics, London, 1993.
[31] S. Selim and M. Ismail, "k-means type algorithms: a generalized convergence theorem and characterization of local optimality," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 1, pp. 81-87, 1984.
[32] W. Li, "Random texts exhibit zipf-s-law-like word frequency distribution," IEEE Transactions on Information Theory, vol. 38, no. 6, pp. 1842-1845, 1992.
[33] Z. Shi and J. Ghosh, "A comparative study of generative models for document clustering," San Francisco, CA, May, 2003.
[34] I. Katsavounidis, C. Kuo, and Z. Zhang, "A new initialization technique for generalized lioyd iteration," IEEE signal proceeding, Letters 1(10), pp. 144-146, 1994.
[35] D. Swanson, "Medical literature as a potential source of new knowledge," Bull Med Libr Assoc., vol. 78, no. 1, pp. 29-37, 1990.