Information Retrieval: A Comparative Study of Textual Indexing Using an Oriented Object Database (db4o) and the Inverted File
Authors: Mohammed Erritali
Abstract:
The growth in the volume of text data such as books and articles in libraries for centuries has imposed to establish effective mechanisms to locate them. Early techniques such as abstraction, indexing and the use of classification categories have marked the birth of a new field of research called "Information Retrieval". Information Retrieval (IR) can be defined as the task of defining models and systems whose purpose is to facilitate access to a set of documents in electronic form (corpus) to allow a user to find the relevant ones for him, that is to say, the contents which matches with the information needs of the user. Most of the models of information retrieval use a specific data structure to index a corpus which is called "inverted file" or "reverse index". This inverted file collects information on all terms over the corpus documents specifying the identifiers of documents that contain the term in question, the frequency of each term in the documents of the corpus, the positions of the occurrences of the word... In this paper we use an oriented object database (db4o) instead of the inverted file, that is to say, instead to search a term in the inverted file, we will search it in the db4o database. The purpose of this work is to make a comparative study to see if the oriented object databases may be competing for the inverse index in terms of access speed and resource consumption using a large volume of data.
Keywords: Information Retrieval, indexation, oriented object database (db4o), inverted file.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1337851
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1734References:
[1] Ricardo B Y., Berthier R N. Modern information retrieval, ACM (Association for Computing Machinery).
[2] Baziz, M. (2005). Indexation conceptuelle guidée par ontologie pour la recherche d'information (Doctoral dissertation, Toulouse 3).
[3] Mooers, C. N. (1948). Application of random codes to the gathering of statistical information (Doctoral dissertation, Massachusetts Institute of Technology).
[4] Karbasi, S. Pondération des termes en Recherche d’Information (Doctoral dissertation, Toulouse 3).
[5] Harrathi, F. (2009). Extraction de concepts et de relations entre concepts à partir des documents multilingues: approche statistique et ontologique.
[6] Salton, G. (1969). A comparison between manual and automatic indexing methods. American Documentation, 20(1), 61-71.
[7] Mallak, I. (2011). De nouveaux facteurs pour l'exploitation de la sémantique d'un texte en Recherche d'Information (Doctoral dissertation, Université Paul Sabatier-Toulouse III).
[8] Aouicha, M. B. (2009). Une approche algébrique pour la recherche d'information structurée (Doctoral dissertation).
[9] Barry, C. L. (1994). User-defined relevance criteria: an exploratory study.JASIS, 45(3), 149-159.
[10] Boubekeur-Amirouche, F. (2008). Contribution à la définition de modèles de recherche d'information flexibles basés sur les CP-Nets (Doctoral dissertation, Université de Toulouse, Université Toulouse IIIPaul Sabatier).
[11] Roussey, C. (2001). Une méthode d’indexation sémantique adaptée aux corpus multilingues. Institut National des Sciences Appliquées de Lyon Lyon, Ecole Doctorale Informatique et Information pour la Société.
[12] Azzoug, W. (2014). Contribution à la définition d’une approche d’indexation sémantique de documents textuels.
[13] Porter, M. F. (1980). An algorithm for suffix stripping. Program: electronic library and information systems, 14(3), 130-137.
[14] Buckley, C., Singhal, A., Mitra, M., & Salton, G. (1995, November). New retrieval approaches using SMART: TREC 4. In Proceedings of the Fourth Text REtrieval Conference (TREC-4) (pp. 25-48).
[15] Brini, A. H. (2005). Un modèle de recherche d'information basé sur les réseaux possibilistes (Doctoral dissertation, Toulouse 3).
[16] Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM), 7(3), 216-244.
[17] Agrawal, R., Imieliński, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. In ACM SIGMOD Record (Vol. 22, No. 2, pp. 207-216). ACM.
[18] Tebri H. Formalisation et spécification d’un système de filtrage incrémental d’information. Thèse de doctorat de l’université Paul Sabatier, Toulouse, 2004.
[19] V.Rijsbergen C. J. Information Retrieval. Department of Computing Science University of Glasgow.
[20] Iadh O. Un modèle d'indexation relationnel pour les graphes conceptuels fondé sur une interprétation logique, Thèse pour obtenir le grade de Docteur de l'Université Joseph Fourier, 1992.
[21] Piwowarski B, Denoyer L, Gallinari P. Un modèle pour la recherche d’information sur des documents structurés. 6es Journées internationales d’Analyse statistique des Données Textuelles. LIP6, PARIS – France, 2002.
[22] Denos N. Modélisation de la pertinence en recherche d'information : modèle conceptuel, formalisation et application. Thèse pour obtenir le grade de Docteur de l'Université Joseph Fourier-Grenoble I, 1997.
[23] http://www.comp.lancs.ac.uk/computing/research/stemming/Links/lovin s.htm
[24] http://www.comp.lancs.ac.uk/computing/research/stemming/Links/paice .htm
[25] http://tartarus.org/martin/PorterStemmer/
[26] http://snowball.tartarus.org/