Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32727
Concept Indexing using Ontology and Supervised Machine Learning

Authors: Rossitza M. Setchi, Qiao Tang


Nowadays, ontologies are the only widely accepted paradigm for the management of sharable and reusable knowledge in a way that allows its automatic interpretation. They are collaboratively created across the Web and used to index, search and annotate documents. The vast majority of the ontology based approaches, however, focus on indexing texts at document level. Recently, with the advances in ontological engineering, it became clear that information indexing can largely benefit from the use of general purpose ontologies which aid the indexing of documents at word level. This paper presents a concept indexing algorithm, which adds ontology information to words and phrases and allows full text to be searched, browsed and analyzed at different levels of abstraction. This algorithm uses a general purpose ontology, OntoRo, and an ontologically tagged corpus, OntoCorp, both developed for the purpose of this research. OntoRo and OntoCorp are used in a two-stage supervised machine learning process aimed at generating ontology tagging rules. The first experimental tests show a tagging accuracy of 78.91% which is encouraging in terms of the further improvement of the algorithm.

Keywords: Concepts, indexing, machine learning, ontology, tagging.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1623


[1] S. Rhind-Tutt, ''Semantic indexing: a case study'', Library Collections, Acquisitions, and Technical Services, vol. 27, n. 2, pp. 243-248, 2003.
[2] T. Brasethvik, and J. A. Gulla, ''Natural language analysis for semantic document modeling'', Data & Knowledge Engineering, vol. 38, n. 1, pp. 45-62, 2001.
[3] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff, ''Semantic annotation, indexing, and retrieval'', Web Semantics: Science, Services and Agents on the World Wide Web, vol. 2, n. 1, pp. 49-79, 2004.
[4] TRENDS Project FP6-IST-2005-27916, ''List of user specifications'',, accessed 7 November 2006.
[5] R. K. Rajapakse, and M. Denham, ''Text retrieval with more realistic concept matching and reinforcement learning'', Information Processing & Management, vol. 42, n. 5, pp. 1260-1275, 2006.
[6] L. van Elst, and A. Abecker, ''Ontologies for information management: balancing formality, stability, and sharing scope'', Expert Systems with Applications, vol. 23, n. 4, pp. 357-366, 2002.
[7] G. A. Miller, ''WORDNET: an on-line lexical database, International Journal of Lexicography, vol. 3, n. 4, 1990, pp. 235-312.
[8] A. R. Coden, S. V. Pakhomov, R. K. Ando, P. H. Duffy, and C. G. Chute, ''Domain-specific language models and lexicons for tagging'', Journal of Biomedical Informatics, vol. 38, n. 6, pp. 422-430, 2005.
[9] R. Setchi, Q. Tang, and L. Chen, ''an information retrieval system using deep natural language processing'', Lecture Notes in Artificial Intelligence, vol. 2773, pp. 879 - 885, 2003.
[10] J. Köhler, S. Philippi, M. Specht, and A. R├╝egg, ''Ontology based text indexing and querying for the semantic web'', Knowledge-Based Systems, in press, available at 13 July 2006.
[11] W. N. Francis, H. Kucera, Brown corpus manual of information, to accompany Standard Corpus of Present-Day Edited American English, Providence, Rhode Island, Department of Linguistics, Brown University, 1964, revised 1971, revised and amplified 1979.
[12] E. Brill, ''A simple rule-based part of speech tagger'', Proc. 3rd Conf. on Applied NLP, Trento, Italy, 1992, pp. 152-155.
[13] E. Brill, ''Some advances in rule-based part of speech tagging'', Proc. 12th National Conf. on Artificial Intelligence (AAAI-94), Seattle, US, 1994.
[14] Q. Tang, Knowledge management using machine learning, NLP and ontology, Cardiff, UK, PhD thesis, 2006.
[15] P. Roget, G. Davidson (ed.), Thesaurus of English words and phrases. Penguin Books, UK, 2003.
[16] Project Gutenberg:,
[accessed on 10 November 2006], 2006
[17] C. Fellbaum, (ed.), WordNet: An electronic lexical database, The MIT Press, USA, 1998.
[18] C. E. Shannon, ''A Mathematical Theory of Communication'', Bell System Technical Journal, vol. 27, 1948, pp. 379-42.