Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

Selvam M; Natarajan. A M; Thangarajan R

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

Authors: Selvam M, Natarajan. A M, Thangarajan R

Abstract:

Parsing is important in Linguistics and Natural Language Processing to understand the syntax and semantics of a natural language grammar. Parsing natural language text is challenging because of the problems like ambiguity and inefficiency. Also the interpretation of natural language text depends on context based techniques. A probabilistic component is essential to resolve ambiguity in both syntax and semantics thereby increasing accuracy and efficiency of the parser. Tamil language has some inherent features which are more challenging. In order to obtain the solutions, lexicalized and statistical approach is to be applied in the parsing with the aid of a language model. Statistical models mainly focus on semantics of the language which are suitable for large vocabulary tasks where as structural methods focus on syntax which models small vocabulary tasks. A statistical language model based on Trigram for Tamil language with medium vocabulary of 5000 words has been built. Though statistical parsing gives better performance through tri-gram probabilities and large vocabulary size, it has some disadvantages like focus on semantics rather than syntax, lack of support in free ordering of words and long term relationship. To overcome the disadvantages a structural component is to be incorporated in statistical language models which leads to the implementation of hybrid language models. This paper has attempted to build phrase structured hybrid language model which resolves above mentioned disadvantages. In the development of hybrid language model, new part of speech tag set for Tamil language has been developed with more than 500 tags which have the wider coverage. A phrase structured Treebank has been developed with 326 Tamil sentences which covers more than 5000 words. A hybrid language model has been trained with the phrase structured Treebank using immediate head parsing technique. Lexicalized and statistical parser which employs this hybrid language model and immediate head parsing technique gives better results than pure grammar and trigram based model.

Keywords: Hybrid Language Model, Immediate Head Parsing, Lexicalized and Statistical Parsing, Natural Language Processing, Parts of Speech, Probabilistic Context Free Grammar, Tamil Language, Tree Bank.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1328310

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3646

References:

[1] Stolcke, A. and Segal, J. Precise Ngram Probabilities from Stochastic Context-Free Grammars. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994, 74-79.
[2] Chi, Z. and Geman, S, Estimation of Probabilistic Context-Free Grammars. Computational Linguistics 24 2, 1998, 299-306.
[3] Roark B. Probabilistic Top-Down Parsing and Language Modeling, Association for Computational Linguist, 2001
[4] Collins, M. J. Three Generative Lexicalized Models for Statistical Parsing. In Proceedings of the 35th Annual Meeting Of The Acl., 16-23., 1997
[5] Daniel M. Bikel, On the Parameter Space of Generative Lexicalized Statistical Parsing Models, Ph.D. Thesis, University Of Pennsylvania, 2004
[6] Daniel Jurafsky & James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd Edition, Pearson Education, 2006
[7] Chelba, C. And Jelinek, F. Exploiting Syntactic Structure for Language Modeling. In Proceedings for COLING-ACL 98. ACL, Newbrunswick NJ, 1998, 225-231.
[8] Collins, M. J. Head-Driven Statistical Models for Natural Language Parsing. University of Pennsylvania, Ph.D. Dissertation, 1999
[9] Brian Roark Eugene Charniak, Measuring Efficiency in High-Accuracy, Broad-Coverage Statistical Parsing Proceedings of the COLING 2000 Workshop on Efficiency in Large-Scale Parsing Systems, 2001, Pages 29- 36
[10] Chelba, C. And Jelinek, F. Structured Language Modeling. Computer Speech and Language 14, 2000, 283-332.
[11] Peng Xu, Ciprian Chelba, Richer Syntactic Dependencies for Structured Language Modeling Computational Linguistics (ACL), Philadelphia, Proceedings of the 40th Annual Meeting of the Association, 2002
[12] Diego Linares Pontificia and Jos E-Miguel Benedi And Joan-Andreu Sanchez, A Hybrid Language Model based on a Combination of NGrams and Stochastic Context-Free Grammars , ACM Transactions on Asian Language Information Processing, Volume 3, Issue 2, 2004, Pp.113-127.
[13] Ratnaparkhi, A. Learning to parse Natural Language with Maximum Entropy Models. Machine Learning 34 1/2/3, 1999, 151-176.
[14] Charniak, E. A Maximum-Entropy Inspired Parser. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics . ACL, New Brunswick NJ, 2000
[15] Eugene Charniak, Immediate-Head Parsing for Language Models, Proceeding of ACL, 2001
[16] Bharati, Akshar, Vineet Chaitanya and Rajeev Sangal, Natural Language Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi, 1995
[17] Rajendran S, Strategies In The Formation Of Compound Nouns In Tamil, Languages Of India, Volume 4, 2004
[18] Marcus, M. P., Santorini, B. And Marcinkiewicz, M. A, Building A Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19, 1993, 313-330
[19] Charniak, E. Tree-Bank Grammars. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park, 1996, 1031-1036.
[20] Akshar Bharati, Rajeev Sangal, Vineet Chaitanya , Anncorra : Building Tree-Banks in Indian Languages, COLING 2002 Post Conference Workshops - Proceedings of the 3rd Workshop on Asia Language Resources and International Standardization at Taipei, Taiwan, 2002.