Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering

Daniel I. Morariu; Radu G. Cretulescu; Lucian N. Vintan

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering

Authors: Daniel I. Morariu, Radu G. Cretulescu, Lucian N. Vintan

Abstract:

In text categorization problem the most used method for documents representation is based on words frequency vectors called VSM (Vector Space Model). This representation is based only on words from documents and in this case loses any “word context" information found in the document. In this article we make a comparison between the classical method of document representation and a method called Suffix Tree Document Model (STDM) that is based on representing documents in the Suffix Tree format. For the STDM model we proposed a new approach for documents representation and a new formula for computing the similarity between two documents. Thus we propose to build the suffix tree only for any two documents at a time. This approach is faster, it has lower memory consumption and use entire document representation without using methods for disposing nodes. Also for this method is proposed a formula for computing the similarity between documents, which improves substantially the clustering quality. This representation method was validated using HAC - Hierarchical Agglomerative Clustering. In this context we experiment also the stemming influence in the document preprocessing step and highlight the difference between similarity or dissimilarity measures to find “closer" documents.

Keywords: Text Clustering, Suffix tree documentrepresentation, Hierarchical Agglomerative Clustering

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1334383

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1914

References:

[1] S. Chakrabarti, Mining the Web- Discovering Knowledge from hypertext data, Morgan Kaufmann Press, 2003.
[2] Kaufman, L. and Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley-Interscience, New York (Series in Applied Probability and Statistics), 1990
[3] Manning, C., Raghavan, P., Sch├╝tze, H. Introduction to Information Retrieval, Cambridge University Press, ISBN 978-0-521-86571, 2008
[4] Meyer,S., Stein, B., Potthast, M., The Suffix Tree Document Model Revisited, Proceedings of the I-KNOW 05, 5th International Conference on Knowlegdge Management, Journal of Universal Computer Science, pp.596-603, Graz, 2005
[5] http://feeds.bbci.co.uk/news/rss.xml
[6] http://www.reuters.com/tools/rss
[7] Salton, G., Wong, A., Yang, C. S., A vector space model for information retrieval. Communications of the ACM, 18(11), 1975.
[8] Janruang, J. Guha, S., Semantic Suffix Tree Clustering, In Proceedings of 2011 International Conference on Data Engineering and Internet Technology (DEIT 2011), Bali, Indonesia, 2011.
[9] Morariu, D., Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.