WASET

	%0 Journal Article
	%A Yasmeen Bassas and  Sandra Kuebler and  Allen Riddell
	%D 2023
	%J International Journal of Cognitive and Language Sciences
	%B World Academy of Science, Engineering and Technology
	%I Open Science Index 193, 2023
	%T Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: 'Reddit'
	%U https://publications.waset.org/pdf/10012918
	%V 193
	%X Native Language Identification is one of the growing subfields in Natural Language Processing (NLP). The task of Native Language Identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL) and then the trained models are evaluated on a different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and Logistic Regression. Results show that content-based features are more accurate and robust than content independent ones when tested within corpus and across corpus.
	%P 53 - 57