Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30685
Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge

Authors: Hui Wang, Jun Liu, Lu Zhang, Chunping Li

Abstract:

Text similarity measurement is a fundamental issue in many textual applications such as document clustering, classification, summarization and question answering. However, prevailing approaches based on Vector Space Model (VSM) more or less suffer from the limitation of Bag of Words (BOW), which ignores the semantic relationship among words. Enriching document representation with background knowledge from Wikipedia is proven to be an effective way to solve this problem, but most existing methods still cannot avoid similar flaws of BOW in a new vector space. In this paper, we propose a novel text similarity measurement which goes beyond VSM and can find semantic affinity between documents. Specifically, it is a unified graph model that exploits Wikipedia as background knowledge and synthesizes both document representation and similarity computation. The experimental results on two different datasets show that our approach significantly improves VSM-based methods in both text clustering and classification.

Keywords: text classification, text similarity, text clustering, Wikipedia

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1083541

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1742

References:


[1] E.Gabrilovich andS.Markovitch, "Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge,"inProceedings of the 21st National Conference on Artificial Intelligence, Boston,2006, pp. 787-788.
[2] E.Gabrilovich andS.Markovitch, "Computing semantic relatedness using Wikipedia-based explicit semantic analysis,"inProceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, 2007, pp. 1606-1611.
[3] P.Wang andC.Domeniconi, "Building semantic kernels for text classification using Wikipedia,"inProceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, 2008, pp. 713-721.
[4] S.Banerjee, K.Ramanathanand A.Gupta, "Clustering short texts using Wikipedia,"inProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, 2007, pp. 787-788.
[5] J.Hu, L.Fang,Y.Cao, et al.,"Enhancing text clustering by leveraging Wikipedia semantics,"inProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, 2008, pp. 179-186.
[6] X.Hu, X.Zhang, C.Lu,E. K. Park and X. Zhou,"Exploiting Wikipedia as external knowledge for document clustering,"inProceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Paris, 2009, pp. 389-396.
[7] Y.Miao andC.Li, "Enhancing query-oriented summarization based on sentence wikification,"inWorkshop of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010.
[8] Y.Li,W.P.R.Luk,K.S.E.Ho and F.L.K. Chung,"Improving weak ad-hoc queries using Wikipedia as external corpus,"inProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, 2007, pp. 797-798.
[9] Y.Miao andC.Li, "Mining Wikipedia and Yahoo! Answers for question expansion in opinion QA,"inAdvances in Knowledge Discovery and Data Mining, vol. 6118/2010, pp. 367-374. Springer, 2010.
[10] G.Jeh andJ.Widom, "SimRank: Ameasure of structural-context similarity," in: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton,2002, pp. 538-543.
[11] I.Antonellis, H.Garcia-Molina andC.-C.Chang, "Simrank++: Query rewriting through link analysis of the click graph,"in Proceedings of the Very Large Databases,vol.1, iss.1, pp. 408-421, 2008.
[12] D.Lizorkin, P.Velikhov, M.Grinev andD.Turdakov, "Accuracy estimate and optimization techniques for Simrankcomputation,"inProceedings of the Very Large Databases,vol.1, iss.1, pp.422-433, 2008.
[13] S.Zhong andJ.Ghosh, "Generative model-based document clustering: A comparative study,"inKnowledge and Information Systems, vol.8, no.3, pp.374-384, Springer, 2005.
[14] H.Small, "Co-citation in the scientific literature: A newmeasure of the relationship between two documents,"Journal of American Society for Information Science,vol.24,iss.4,pp. 265-269, 1973.
[15] A.Hotho, S.Staab andG.Stumme, "Wordnet improves text document clustering,"inSemantic Web Workshop of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.
[16] I.Yoo, X.Hu and I.-Y.Song, "Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering,"inProceedings of the 12th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,Philadelphia, 2006, pp. 791-796.
[17] L'aszl'o andLov'asz,"Random walks on graphs: A survey,"Bolyai Society Mathematical Studies, vol.2, pp.1-46, 1993.