Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30169
Advanced Information Extraction with n-gram based LSI

Authors: Ahmet Güven, Ö. Özgür Bozkurt, Oya Kalıpsız


Number of documents being created increases at an increasing pace while most of them being in already known topics and little of them introducing new concepts. This fact has started a new era in information retrieval discipline where the requirements have their own specialties. That is digging into topics and concepts and finding out subtopics or relations between topics. Up to now IR researches were interested in retrieving documents about a general topic or clustering documents under generic subjects. However these conventional approaches can-t go deep into content of documents which makes it difficult for people to reach to right documents they were searching. So we need new ways of mining document sets where the critic point is to know much about the contents of the documents. As a solution we are proposing to enhance LSI, one of the proven IR techniques by supporting its vector space with n-gram forms of words. Positive results we have obtained are shown in two different application area of IR domain; querying a document database, clustering documents in the document database.

Keywords: Document clustering, Information Extraction, Information Retrieval, LSI, n-gram.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1422


[1] Bellot, P. and El-Beze, M., A Clustering Method for Information Retrieval, Technical Report IR-0199, Laboratoire d'Informatique d'Avignon,France, 1999.
[2] Berry, M. W., Drmac, Z. and Jessup E. R.: Matrices, Vector Spaces, and Information Retrieval, SIAM Review, v.41 n.2, p.335-362, June 1999.
[3] Boley D., Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 1998.
[4] Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai "Class-based ngram models of Natural Language", Computational Linguistics, vol. 18, pp. 467-479, 1992.
[5] Croft, W.B. and Xu, J.: Corpus-specific stemming using word form cooccurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval (pp. 147-159), Las Vegas, Nevada. 1995.
[6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R.: (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.
[7] Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification. Wiley, New York.2001.
[8] Ekmekcioglu, F. C., Lynch, M. F. and Willett, P. (1996): Stemming and N-gram Matching for Term Conflation in Turkish Texts. Inf. Research, Vol. 2, No. 2.
[9] Kohonen, T., "The Self-Organizing Map," Proceedings of the IEEE, vol. 9, 1990, pp. 1464-1479.
[10] Lingpipe NLP Library
[11] Salton, G. and McGill, M. J.: Int. to modern information retrieval. McGraw-Hill.
[12] Willet, P., Recent trends in hierarchical document clustering: a critical review. Information Processing and Management, vol. 24(5), pages 577- - 597, 1988.
[13] Zemberek Turkish NLP Library:
[14] "Reuters21578collection",
[15] Porterstemmer ,
[16] Foundations of Statistical Natural Language Processing (Hardcover) by Christopher D. Manning, Hinrich Sch├╝tze.
[17] Unsupervised Machine Learning Techniques for Text Document Clustering, Arzucan Özgür, Ethem Alpaydın.