Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30184
Clustering Unstructured Text Documents Using Fading Function

Authors: Pallav Roxy, Durga Toshniwal

Abstract:

Clustering unstructured text documents is an important issue in data mining community and has a number of applications such as document archive filtering, document organization and topic detection and subject tracing. In the real world, some of the already clustered documents may not be of importance while new documents of more significance may evolve. Most of the work done so far in clustering unstructured text documents overlooks this aspect of clustering. This paper, addresses this issue by using the Fading Function. The unstructured text documents are clustered. And for each cluster a statistics structure called Cluster Profile (CP) is implemented. The cluster profile incorporates the Fading Function. This Fading Function keeps an account of the time-dependent importance of the cluster. The work proposes a novel algorithm Clustering n-ary Merge Algorithm (CnMA) for unstructured text documents, that uses Cluster Profile and Fading Function. Experimental results illustrating the effectiveness of the proposed technique are also included.

Keywords: Clustering, Text Mining, Unstructured TextDocuments, Fading Function.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1079444

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1674

References:


[1] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, A. Saarela, "Self organization of a massive document collection", IEEE Trans. Neural Networks, vol. 11, 2000, pp. 574-585.
[2] J. Tantrum, A. Murua, W. Stuetzle, "Hierarchical model-based clustering of large datasets through fractionation and refractionation", Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002, pp. 183-190.
[3] I. S. Dhillon, D. S. Modha, "Concept decompositions for large sparse text data using clustering", Machine Learning, vol. 42, 2001, pp. 143- 175.
[4] M. Steinbach, G. Karypis, V. Kumar, "A comparison of document clustering techniques", KDD Workshop on Text Mining, 2000, pp. 109- 110.
[5] S. Vaithyanathan, B. Dom, "Model-based hierarchical clustering", Proc. 16th Conf. Uncertainty in Artificial Intelligence, 2000, pp. 599-608.
[6] M. Meila, D. Heckerman, "An experimental comparison of model-based clustering methods", Machine Learning, vol. 42, 2001, pp. 9-29.
[7] L. O-Callaghan, N. Mishra, A. Meyerson, S. Guha, "Streaming data algorithms for high-quality clustering", In Proc. ICDE, San Jose, CA, February 2002, pp. 685-704.
[8] S. Guha, N. Mishra, R. Motwani, L. O-Callaghan, "Clustering data streams", In Proc. FOCS, California, November 2000, pp. 359-366.
[9] C. C. Agrawal, J. Han, J. Wang, P. S. Yu, "A framework for clustering evolving data streams", In Proc. VLDB, Berlin, September 2003, pp. 81- 92.
[10] C. C. Aggarwal, P. S. Yu , "A framework for clustering massive text and categorical data streams", In Proc. SIAM Conference on Data Mining, Bethesda, MD, April 2006, pp. 407-411.
[11] Y. B. Liu, J. R. Cai, J Yin.,"Clustering text data streams", Journal of Computer Science and Technology, vol. 23(1), Jan. 2008, pp. 112-128.
[12] http://www.nsf.gov/awardsearch