Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems

Tengku Mohd T. Sembok

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems

Authors: Tengku Mohd T. Sembok

Abstract:

Documents retrieval in Information Retrieval Systems (IRS) is generally about understanding of information in the documents concern. The more the system able to understand the contents of documents the more effective will be the retrieval outcomes. But understanding of the contents is a very complex task. Conventional IRS apply algorithms that can only approximate the meaning of document contents through keywords approach using vector space model. Keywords may be unstemmed or stemmed. When keywords are stemmed and conflated in retrieving process, we are a step forwards in applying semantic technology in IRS. Word stemming is a process in morphological analysis under natural language processing, before syntactic and semantic analysis. We have developed algorithms for Malay and Arabic and incorporated stemming in our experimental systems in order to measure retrieval effectiveness. The results have shown that the retrieval effectiveness has increased when stemming is used in the systems.

Keywords: Information Retrieval, Natural Language Processing, Artificial Intelligence.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1074898

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2261

References:

[1] Mizzaro, S. Relevance: The Whole History. Journal of American Society of Information Science, Vol.48, No.9, 1997. pp.810-832.
[2] Gagne, E.D., Yekovich, C.W., Yekovich, F.R. The Cognitive Psychology of The School Learning, Harper Collin. 1993.
[3] Freund, G.E. & Willett, P. Online identification of word variants and arbitrary truncation searching using a string similarity measure. Information Technology: Research and Development 1: 1982. 177-187.
[4] Lennon, M., Pierce, D., Tarry, B. & Willett, P. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science 3: 1981. 177-183.
[5] Ekmekcioglu, F.C., Lynch, M.F., Robertson, A.M., Sembok, T.M.T. & Willett, P. Comparison of n-gram matching and stemming for term conflation in English, Malay, and Turkish texts. Text Technology: The Journal of Computer Text Processing 6(1): 1996. 1-14.
[6] Porter M.F. An Algorithm for suffix stripping, Program, 14(3), 1980. pp.130-137.
[7] Othman, A. Pengakar perkataan melayu untuk sistem capaian dokumen. MSc Thesis. National University of Malaysia. 1993.
[8] Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok. "Experiments with A Malay Stemming Algorithm", Journal of American Society of Information Science. 1996.
[9] Sembok, T.M.T, Yussoff, M. & Ahmad, F. A malay stemming algorithm for information retrieval. Proceedings of the 4th International Conference and Exhibition on Multi-lingual Computing. 1994. 5.1.2.1- 5.1.2.10.
[10] Hani Moh'd Al-Omari, Tengku Mohd. T. Sembok, Mohammed Yusoff, ALMAS: An Arabic Language Morphological Analyser System, Malaysian Journal of Computer Science, Vol. 8, no.2, University of Malaya. 1995.
[11] Belal Abu Ata, Tengku Mohd T. Sembok, Mohamed Yusoff. Implementions of A Malay Stemming Algorithm Using Hashing Technique, Proceedinds of the ICIMU-98: International Conference on Information Technology and Multimedia, UNITEN, 28-30 Sept. 1998.
[12] Sembok, Tengku Mohd Tengku. Application of Mathematical Functional Decomposition in Document Indexing, Prosiding : Pengintegrasian Technologi dalam Sains Matematik. Penang: USM. 1999.
[13] Saidah Saad. 1998. Pembangunan dan Eksperiment ke atas satu sistem capaian maklumat Al-Quran dwi bahasa berasaskan Web. MSc. Thesis. UKM.
[14] Sembok, T.M.T. & Willett, P. Experiments with n-gram string-similarity measure on malay texts. Technical Report. Universiti Kebangsaan Malaysia. 1995.