Named Entity Recognition using Support Vector Machine: A Language Independent Approach
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1057979Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2816
 N. Chinchor, "MUC-6 Named Entity Task Definition (Version 2.1)," in MUC-6, 1995.
 N. Chinchor, "MUC-7 Named Entity Task Definition (Version 3.5)," in MUC-7, 1998.
 H. Cunningham, "GATE, a General Architecture for Text Engineering," Computers and the Humanities, vol. 36, pp. 223-254, 2002.
 D. Moldovan, S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu, A. Novischi, A. Badulescu, and O. Bolohan, "LCC Tools for Question Answering," in Text REtrieval Conference (TREC) 2002, 2002.
 B. Babych and A. Hartley, "Improving Machine Translation Quality with Automatic Named Entity Recognition," in Proceedings of EAMT/EACL 2003 Workshop on MT and other Language Technology Tools, pp. 1-8, 2003.
 S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schawartz, R. Stone, R. Weischedel, and the Annotation Group, "BBN: Description of the SIFT System as Used for MUC-7," in MUC-7, (Fairfax, Virginia), 1998.
 D. M. Bikel, R. L. Schwartz, and R. M. Weischedel, "An Algorithm that Learns What-s in a Name," Machine Learning, vol. 34, no. 1-3, pp. 211-231, 1999.
 A. Borthwick, Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University, 1999.
 A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman, "NYU:Description of the MENE Named Entity System as Used in MUC-7," in MUC-7, 1998.
 S. Sekine, "Description of the Japanese NE System used for MET-2," in MUC-7, (Fairfax, Virginia), 1998.
 S. W. Bennet, C. Aone, and C. Lovell, "Learning to Tag Multilingual Texts Through Observation," in Proceedings of Empirical Methods of Natural Language Processing, (Providence, Rhode Island), pp. 109-116, 1997.
 A. McCallum and W. Li, "Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons," in Proceedings of CoNLL, (Canada), pp. 188-191, 2003.
 J. D. Lafferty, A. McCallum, and F. C. N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," in Proceedings of the 18th International Conference on Machine Learning (ICML), pp. 282-289, 2001.
 A. Sun, "Using Support Vector Machine for Terrorism Information Extraction," in Proceedings of the 1st NSF/NIJ Symposium on Intelligence and Security, 2003.
 A. De Sitter and W. Daelemans, "Information Extraction via Double Classification," in Proceedings of International Workshop on Adaptive Text Extraction and Mining, (Dubrovnik), 2003.
 N. Kushmerick, E. Johnston, and S. McGuinness, "Information Extraction by Text Classification," in Proceedings of IJCAI-01 Workshop on Adaptive Text Extraction and Mining, (Seattle, WA), 2001.
 A. McCallum, D. Freitag, and F. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation," in Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 591-598, 2000.
 R. Malouf, "Markov Models for Language Independent Named Entity Recognition," in Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), (Taipei, Taiwan), pp. 187-190, 2002.
 J. D. Burger, J. C. Henderson, and T. Morgan, "Statistical Named Entity Recognizer Adaption," in Proceedings of the CoNLL Workshop, (Taipei, Taiwan), pp. 163-166, 2002.
 X. Carrears, L. Marquez, and L. Padro, "Named Entity Recognition using AdaBoost," in Proceedings of the CoNLL Workshop, (Taipei, Taiwan), pp. 167-170, 2002.
 G. Zhou and J. Su, "Named Entity Recognition using an HMM-based Chunk Tagger," in Proceedings of ACL, (Philadelphia), pp. 473-480, 2002.
 H. Yamada, T. Kudo, and Y. Matsumoto, "Japanese Named Entity Extraction using Support Vector Machine," In Transactions of IPSJ, vol. 43, no. 1, pp. 44-53, 2001.
 T. Kudo and Y. Matsumoto, "Chunking with Support Vector Machines," in Proceed-ings of NAACL, pp. 192-199, 2001.
 K. Takeuchi and N. Collier, "Use of Support Vector Machines in Extended Named Entity Recognition," in Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 119-125, 2002.
 A. Masayuki and Y. Matsumoto, "Japanese Named Entity Extraction with Redundant Morphological Analysis," in NAACL -03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, (Morristown, NJ, USA), pp. 8-15, Association for Computational Linguistics, 2003.
 A. Ekbal and S. Bandyopadhyay, "Pattern Based Bootstrapping Method for Named Entity Recognition," in Proceedings of the 6th International Conference on Advances in Pattern Recognition (ICAPR), pp. 349-355, World Scientific, 2007.
 A. Ekbal and S. Bandyopadhyay, "Lexical Pattern Learning from Corpus Data for Named Entity Recognition," in Proceedings of 5th International Conference on Natural Language Processing (ICON), (India), pp. 123- 128, 2007.
 A. Ekbal, S. Naskar, and S. Bandyopadhyay, "Named Entity Recognition and Transliteration in Bengali," Named Entities: Recognition, Classification and Use, Special Issue of Lingvisticae Investigationes Journal, vol. 30, no. 1, pp. 95-114, 2007.
 A. Ekbal and S. Bandyopadhyay, "Bengali Named Entity Recognition using Support Vector Machine," in Proceedings of Workshop on NER for South and South East Asian Languages, 3rd International Joint Conference on Natural Languge Processing (IJCNLP), (India), pp. 51- 58, 2008.
 W. Li and A. McCallum, "Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction," ACM Transactions on Asian Languages Information Processing, vol. 2, no. 3, pp. 290-294, 2004.
 A. Ekbal and S. Bandyopadhyay, "A Hidden Markov Model Based Named Entity Recognition System: Bengali and Hindi as Case Studies," in Proceedings of the 2nd International Conference on Pattern Recognition and Machine Intelligence (PReMI 2007), pp. 545-552, Springer Verlag, 2007.
 V. N. Vapnik, The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc., 1995.
 C. C and V. N. Vapnik, "Support Vector Networks," Machine Learning, vol. 20, pp. 273-297, 1995.
 T. Joachims, "Making large-scale support vector machine learning practical," pp. 169-184, 1999.
 H. Taira and M. Haruno, "Feature Selection in SVM Text Categorization," in Proceedings of AAAI-99, 1999.
 A. Ekbal and S. Bandyopadhyay, "A Web-based Bengali News Corpus for Named Entity Recognition," Language Resources and Evaluation Journal, vol. 42, no. 2, 2008.
 M. Collins and Y. Singer, "Unsupervised models for named entity classification," in Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
 S. Cucerzon and D. Yarowsky, "Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence," in Proceedings of the 1999 Joint SIGDAT conference on EMNLP and VLC, (Washington, D.C.), 1999.
 S. Cucerzan and D. Yarowsky, "Language Independent NER using a Unified Model of Internal and Contextual Evidence," in Proceedings of CoNLL 2002, pp. 171-175, 2002.
 W. Phillips and E. Riloff, "Exploiting Strong Syntactic Heuristics and Co-training to Learn Semantic Lexicons," in EMNLP -02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, (Morristown, NJ, USA), pp. 125-132, Association for Computational Linguistics, 2002.
 E. Riloff and R. Jones, "Learning Dictionaries for Information Extraction by Multi-level Bootstrapping," in AAAI -99/IAAI -99: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, (Menlo Park, CA, USA), pp. 474- 479, American Association for Artificial Intelligence, 1999.
 M. Thelen and E. Riloff, "A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts," in EMNLP -02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, (Morristown, NJ, USA), pp. 214-221, Association for Computational Linguistics, 2002.
 T. Strzalkowski and J. Wang, "A Self-learning Universal Concept Spotter," in Proceedings of the 16th conference on Computational linguistics, (Morristown, NJ, USA), pp. 931-936, Association for Computational Linguistics, 1996.
 R. Yangarber, W. Lin, and R. Grishman, "Unsupervised Learning of Generalized Names," in Proceedings of the 19th international conference on Computational linguistics, (Morristown, NJ, USA), pp. 1-7, Association for Computational Linguistics, 2002.
 A. Ekbal, R. Haque, and S. Bandyopadhyay, "Bengali Part of Speech Tagging using Conditional Random Field," in Proceedings of Seventh International Symposium on Natural Language Processing (SNLP2007), 2007.
 A. Ekbal and S. Bandyopadhyay, "Lexicon Development and POS Tagging using a Tagged Bengali News Corpus," in Proceedings of the 20th International Florida AI Research Society Conference (FLAIRS- 2007), (Florida), pp. 261-263, 2007.
 T. W. Anderson and S. Scolve, Introduction to the Statistical Analysis of Data. Houghton Mifflin, 1978.
 W. S. Gosset, "The Probable Error of a Mean," in Biometrika, vol. 6, pp. 1-25, 1908.