Language and Retrieval Accuracy

Ahmed Abdelali; Jim Cowie; Hamdy S. Soliman

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33090

Language and Retrieval Accuracy

Authors: Ahmed Abdelali, Jim Cowie, Hamdy S. Soliman

Abstract:

One of the major challenges in the Information Retrieval field is handling the massive amount of information available to Internet users. Existing ranking techniques and strategies that govern the retrieval process fall short of expected accuracy. Often relevant documents are buried deep in the list of documents returned by the search engine. In order to improve retrieval accuracy we examine the issue of language effect on the retrieval process. Then, we propose a solution for a more biased, user-centric relevance for retrieved data. The results demonstrate that using indices based on variations of the same language enhances the accuracy of search engines for individual users.

Keywords: Information Search and Retrieval, LanguageVariants, Search Engine, Retrieval Accuracy.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1076172

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1476

References:

[1] Abdelali A, Cowie J, and Soliman H (2005) Language variation as a context for information retrieval. International Workshop on Context- Based Information Retrieval (CIR-05), Paris. July 5th, 2005. CEUR Workshop Proceedings Vol-151, pp. 93-104.
[2] Abdelali, A. (2004) Localization in Modern Standard Arabic. Journal of the American Society for Information Science and Technology (JASIST), Volume 55, Number 1, 2004. pp. 23-28.
[3] Agichtein, E., Brill E., Dumais S., Ragno, R. (2006) Learning user interaction models for predicting web search result preferences, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[4] Agirre, E. and Edmonds, P. (2006) Word Sense Disambiguation Algorithms and Applications. Series: Text, Speech and Language Technology, Vol. 33, 2006, ISBN: 978-1-4020-4808-1
[5] Azzopardi L, Girolami M and van Rijsbergen C J (2003) Investigating the Relationship between Language Model Perplexity and IR Precision- Recall Measures. In the Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval, SIGIR, Toronto, Canada.
[6] Azzopardi L, Girolami M and van Rijsbergen C J (2004) Topic Based Language Models for ad hoc Information Retrieval. In the Proceedings of the International Joint Conference on Neural Networks, Budapest,Hungary.
[7] Cavnar W B and Trenkle M J (1994) N-gram-based text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, pp. 161-175.
[8] Chang, W.W. and Tsai, W.H. (2000) Chinese dialect identification using segmental and prosodic features. Acoustical Society of America Journal. Oct. 2000. Vol.108, pp.1906-1913.
[9] Clarkson P, and Robinson T (1999) Towards improved language model evaluation measures. In: Proc. Eurospeech, p. 2707.
[10] Cowie J, Yevgeny L, and Zacharski R (1999) Language recognition for mono- and multi-lingual documents. Proceedings of the Vextal Conference. Venice 209-214.
[11] Cronen-Townsend, S., Zhou, Y., and Croft, W.B. (2004) A framework for selective query expansion. Poster presentation, in: Proceedings of CIKM'04, pp.236-237.
[12] Dean J, and Henzinger M R (1999) Finding related pages in the World Wide Web. Computer Networks. 31(11-16):1467-79
[13] Dunning T (1994) Statistical identification of language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University.
[14] Gordon M, and Pathak P (1999). Finding information on the World Wide Web: The retrieval effectiveness of search engines. Information Processing & Management, 35(2), 141-180.
[15] Grefenstette G (1995) Comparing two language identification schemes.Third International Conference on Statistical Analysis of Textual Data. Rome,
[16] Gursky, P., Horvath, T., Novotny, R., Vanekova, V., and Vojtas, P. 2006. UPRE: User Preference Based Search System. In Proceedings of the 2006 IEEE/WIC/ACM international Conference on Web intelligence (December 18 - 22, 2006). Web Intelligence. IEEE Computer Society, Washington, DC, 841-844.
[17] House A. S. and Neuburg, E. P. (1977). Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. Acoustical Society of America Journal. Vol 62. pp. 708- 713.
[18] Ide N, and Macleod C (2001). The American national corpus: A standardized resource of American English. Proceedings of Corpus Linguistics 2001, Lancaster UK.
[19] Kennedy G (1998) An introduction to corpus linguistics. Addison Wesley Longman.
[20] Kohonen T (1997). Self-organizing maps, 2nd Edition (Berlin; New York: Springer).
[21] Lafferty J (1997) The noisy channel model. Class notes to statistical methods in language technologies, Carnegie Mellon University Language Technology Institute, www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/11761- s97/WWW/tex/channel.ps December 22, 2005
[22] Lawrence S, and Giles C L (1998) Searching the World Wide Web. Science, 280: 98-100.
[23] MacQueen J B (1967) Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, Vol. 1. pp.281-297.
[24] Manning C and Sch├╝tze H (1999). Foundations of statistical natural language processing. MIT Press. Cambridge, MA.
[25] McNamee P (2004). Language identification: A solved problem suitable for undergraduate instruction. Proceedings of the 20th Annual Consortium for Computing Sciences in Colleges East (CCSCE-04), pp. 94-101.
[26] Moore A (2001) K-means and Hierarchical Clustering - Tutorial Slides. Available at http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html Retrieved on August 29, 2006.
[27] Ponte J M and Croft W B (1998) A language modeling approach to information retrieval system. in Proc. ACM. SIGIR 98, New York, 1998, pp. 275-281.
[28] Purnell, T.; Idsardi, W., and Baugh, J. (1999). Perceptual and Phonetic Experiments on American English Dialect Identification. Journal of Language and Social Psychology, Mar 1999; Vol. 18. pp.10-30.
[29] Sethy A, Georgiou P, and Narayanan S (2005). Building topic specific language models from webdata using competitive models. In Proc. of EUROSPEECH, Interspeech, Lisbon, Portugal.
[30] Siatri R (1998) Information seeking in electronic environment: a comparative investigation among computer scientists in British and Greek Universities. Information Research, Volume 4 No. 2.
[31] Spink A (2002). A user centered approach to evaluating human interaction with Web search engines: an exploratory study. Information Processing & Management, 38(3), 410-426.
[32] Torres-Carrasquillo, P. A., Gleason, T. P., and Reynolds, D. A., (2004). Dialect Identification Using Gaussian Mixture Models. In Proc. Odyssey: The Speaker and Language Recognition Workshop in Toledo, Spain, ISCA, pp. 297-300, 31 May - 3 June 2004.
[33] W3C (2005) Corpus linguistics. http://www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/introdu ction.html.
[34] Wulff S, Gries T S, and Stefanowitsch A (2005) Brutal Brits and argumentative Americans: What collostructional analysis can tell us about lectal variation? Paper presented at the ICLC 2005, Yonsei University.
[35] Yeung K Y, and Ruzzo W L (2001). Principal Component Analysis for clustering gene expression data. Bioinformatics 17, 763-774.