Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: 'Reddit'

Yasmeen Bassas; Sandra Kuebler; Allen Riddell

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: 'Reddit'

Authors: Yasmeen Bassas, Sandra Kuebler, Allen Riddell

Abstract:

Native Language Identification is one of the growing subfields in Natural Language Processing (NLP). The task of Native Language Identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL) and then the trained models are evaluated on a different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and Logistic Regression. Results show that content-based features are more accurate and robust than content independent ones when tested within corpus and across corpus.

Keywords: NLI, NLP, content-based features, content independent features, social media corpus, ML.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.7563501

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 312

References:

[1] M. Koppel, J. Schler, and K. Zigdon, “Automatically determining an anonymous author’s native language,” in International Conference on Intelligence and Security Informatics, ser. Lecture Notes in Computer Science, vol. 3495, 2005.
[2] S.-M. J. Wong and M. Dras, “Contrastive analysis and native language identification,” in Proceedings of the Australasian Language Technology Association Workshop, Sydney, Australia, Dec. 2009, pp. 53–61. (Online). Available: https://www.aclweb.org/anthology/U09-1008
[3] S. Malmasi and M. Dras, “Arabic native language identification,” in Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 180–186. (Online). Available: https://www.aclweb.org/anthology/W14-3625
[4] S. Malmasi, M. Dras, and I. Temnikova, “Norwegian native language identification,” in Proceedings of the International Conference Recent Advances in Natural Language Processing. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, Sep. 2015, pp. 404–412. (Online). Available: https://www.aclweb.org/anthology/R15-1053
[5] E. Rabinovich, Y. Tsvetkov, and S. Wintner, “Native language cognate effects on second language lexical choice,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 329–342, 2018. (Online). Available: https://www.aclweb.org/anthology/Q18-1024
[6] B. G. Gebre, M. Zampieri, P. Wittenburg, and T. Heskes, “Improving native language identification with TF-IDF weighting,” in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, GA, Jun. 2013, pp. 216–223. (Online). Available: https://www.aclweb.org/anthology/W13-1728
[7] I. Markov, L. Chen, C. Strapparava, and G. Sidorov, “CIC-FBK approach to native language identification,” in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 374–381. (Online). Available: https://www.aclweb.org/anthology/W17-5042
[8] S. Malmasi and M. Dras, “Chinese native language identification,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers. Gothenburg, Sweden: Association for Computational Linguistics, Apr. 2014, pp. 95–99. (Online). Available: https://www.aclweb.org/anthology/E14-4019
[9] S.-M. J. Wong and M. Dras, “Exploiting parse structures for native language identification,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, Jul. 2011, pp. 1600–1610. (Online). Available: https://www.aclweb.org/anthology/D11-1148
[10] S. Malmasi and M. Dras, “Large-scale native language identification with cross-corpus evaluation,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, May–Jun. 2015, pp. 1403–1409. (Online). Available: https://www.aclweb.org/anthology/N15-1160
[11] G. Goldin, E. Rabinovich, and S. Wintner, “Native language identification with user generated content,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Oct.-Nov. 2018, pp. 3591–3601. (Online). Available: https://www.aclweb.org/anthology/D18-1395
[12] D. Blanchard, J. Tetreault, D. Higgins, A. Cahill, and M. Chodorow, “ETS Corpus of Non-Native Written English,” Linguistic Data Consortium, LDC2014T06, 2013.
[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.