Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30184
Automatic Building an Extensive Arabic FA Terms Dictionary

Authors: El-Sayed Atlam, Masao Fuketa, Kazuhiro Morita, Jun-ichi Aoe

Abstract:

Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.

Keywords: Arabic Field Association Terms, information extraction, document classification, information retrieval.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1080856

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1335

References:


[1] Atlam, E., Fuketa, M., Morita, K., Aoe, J. (2003). Documents Similarity Measurement using Field Association Terms, Information Processing & Management, 39(6): 809-824.
[2] Atlam, E., Ghada, E., Morita, K., Fuketa, M., Aoe, J. (2006). Automatic building of new field association word candidates using search engine, Information Processing & Management, 42(4): 951-962.
[3] Atlam, E., Morita, K., Fuketa, M., Aoe, J. (2002). A new method for selecting English field association terms of compound words and its knowledge representation, Information Processing & Management, 38(6): 807-821.
[4] Bennet N.A., He, Q., Powell K., Schatz, B.R. (1999). Extracting noun phrases for all of MEDLINE, In Proceedings of the AMIA Symposium, pp. 671-5.
[5] Diab M., Kadri Hacioglu (2004), and Daniel Jurafsky. Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLTNAACL04), Boston, MA, 2004.
[6] Dorji, T., Atlam, E., Yata, S., Fuketa, M., Morita, K., Aoe, J. (2009) Building a Dynamic and Comprehensive Field Association Terms Dictionary from Domain-specific Corpora using Linguistic Knowledge, In Proceedings of the fifth Corpus Linguistics Conference, Liverpool, UK.
[7] Dozawa, T. (1999). Innovative multi information dictionary Imidas-99. Annual Series. Japan: Zueisha Publication Co. (in Japanese).
[8] Drouin, P. (2004). Detection of domain specific terminology using corpora comparison, In Proceedings of the 4th International conference on Language resources and evaluation (CLREC), pp. 79-82.
[9] Fuketa, M., Lee, S., Tsuji, T., Okada, M., Aoe, J. (2000). A Document Classification Method by using Field Association Words, International Journal of Information Sciences 126: 57-70.
[10] Graham-Cumming, J. (2005) Naive Bayesian Text Classification: Fast, accurate, and easy to implement, Dr. Dobb's Journal, http://www.ddj.com/development-tools/184406064, (Accessed 3 September 2009).
[11] Habash, Nizar and Owen Rambow (2005). Arabic Tokenization, Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL05)
[12] Jiang, G., Sato, H., Endoh, A., Ogasawara, K., Sakurai, T. (2005). Extraction of Specific Nursing Terms Using Corpora Comparison, In Proceedings of the AMIA Annual Symposium, 2005: 997.
[13] Krauthammer, M., Nenadic, G. (2004). Term identification in the biomedical literature, Journal of Biomedical Information, 37(6): 512- 526.
[14] Lan M., Tan C., Low H., Sung S. (2005). A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Posters Proc. 14th International World Wide Web Conference, pp. 1032-1033.
[15] Lee, S., Shishibori, M., Sumitomo, T., Aoe, J. (2002). Extraction of Field-coherent Passages, Information Processing & Management, 38(2): 173-207.
[16] Pang, S., Kasabov, N. (2009) Encoding and decoding the knowledge of association rules over SVM classification trees, Knowledge and Information Systems, 19(1): 79-105.
[17] Patry, A., Langlais, P., (2005) Corpus-based terminology extraction. Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, Copenhagen, Denmark, pp. 313-321.
[18] Peng, T., Zuo, W., He, F. (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents, Knowledge and Information Systems, Springer London, 16(3): 281-301.
[19] Rokaya, M., Atlam, E., Fuketa, M., Dorji, T., Aoe, J. (2008) Ranking of Field Association Terms using co-word analysis, Information Processing and Management, 44(2): 738-755.
[20] Salton, G., Allan, J., Buckley, C. (1993) Approaches to passage retrieval in full text information systems. Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, pp. 49-58.
[21] Saneifar, H., Bonniol, S., Laurent, A., Poncelet, P., Roche, M. (2009) Terminology Extraction from Log Files, Database and Expert Systems Applications, Lecture Notes in Computer Science, 5690: 769 - 776.
[22] Sharif, U. M., Ghada, E., Atlam, E., Fuketa, M., Morita, K., Aoe, J. (2007). Improvement of building field association term dictionary using passage retrieval, Information Processing and Management, 43(2): 1793- 1807.
[23] Shereen Khoja. 2001. APT: Arabic Part-of-speech Tagger., Proc. of the Student Workshop at NAACL 2001Smadja, F. (1993) Retrieving collocations form text: Xtract, Computational Linguistics, 19(1): 143- 177.
[24] Srinivasan, P., Pant, G., Menczer, F. (2005) A general evaluation framework for regional crawlers. Information Retrieval, 8(3):417-447.
[25] Stanford TreeTagger - a Language-Independent Part-of-speech Tagger, http://nlp.stanford.edu/software/tagger.shtml (Downloaded 5 November 2009)
[26] Tsuji, T., Nigazawa, H., Okada, M., Aoe, J. (1999) Early Field Recognition by Using Field Association Words, In Proceedings of the 18th International Conference on Computer Processing of Oriental Languages, pp. 301-304.
[27] Velardi, P., Navigli, R., D'Amadio, P. (2008) Mining the Web to Create Specialized Glossaries, IEEE Intelligent Systems, 23(5): 18-25.
[28] Wang, P., Hu, J., Zeng, H., Chen, Z. (2008) Using Wikipedia knowledge to improve text classification, Knowledge and Information Systems, 19(3): 265-394.
[29] Wikipedia Foundation, Inc., English Wikipedia Dumps, http://dumps.wikimedia.org/arwiki/ (Downloaded 5 November 2009)