Grammatically Coded Corpus of Spoken Lithuanian: Methodology and Development
Authors: L. Kamandulytė-Merfeldienė
Abstract:
The paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances.
Keywords: CHILDES, Corpus of Spoken Lithuanian, grammatical annotation, grammatical disambiguation, lexicon, Lithuanian.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1129916
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 947References:
[1] J. Kuvač Kraljević, and G. Hržica, “Croatian adult spoken language corpus (HrAL),” FLUMINENSIA: Journal for Philological Research, vol. 28, no. 2, 2017, pp. 87–102.
[2] D. Biber, “Investigation language use through corpus-based analyses of association patterns,” International Journal of Corpus Linguistics, vol. 1, no. 2, 1996, pp. 171–198.
[3] D. Biber, University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam: John Benjamins, 2006.
[4] G. Gravier, G. Adda, N. Paulson, M. Carré, A. Giraudel, and O. Galibert, “The ETAPE corpus for the evaluation of speech-based TV content processing in the French language,” in LREC-Eighth international conference on Language Resources and Evaluation, Turkey, 2012.
[5] R. Simpson, and D. Mendis, “A Corpus-Based Study of Idioms in Academic Speech”, Tesol Quarterly, vol. 37, iss. 3, 2003, pp. 419–441.
[6] R. Reppen, “English language teaching and corpus linguistics: Lessons from the American National Corpus,” in Contemporary Corpus Linguistics, P. Baker, Ed. London: Continuum, 2012, pp. 204–213.
[7] R. Carter, and M. McCarthy, Exploring Spoken English. Cambridge: Cambridge University Press, 1997.
[8] M. McCarthy, and M. Handford, “Invisible to us: A preliminary corpus-based study of spoken business English,” in Discourse in the Professions: Perspectives form Corpus Linguistics, U. Connor, T. Upton, Eds. Amsterdam: John Benjamins, 2004, pp.167–201.
[9] Corpus of Spoken Lithuanian, http://donelaitis.vdu.lt/sakytines-kalbos-tekstynas/ Accessed on 20/03/2017.
[10] Child Language Data Exchange System, https://childes.psy.cmu.edu/ Accessed on 20/03/2017.
[11] B. MacWhinney, “The TalkBank Project,” in Creating and Digitizing Language Corpora: Synchronic Databases, vol. 1, J. C. Beal, K. P. Corrigan & H. L. Moisl, Eds. Houndmills: Palgrave-Macmillan, 2007, pp. 163–180.
[12] I. Dabašinskienė, and L. Kamandulytė, “Corpora of Spoken Lithuanian,” Estonian papers in applied linguistics, no. 5, 2009, pp. 67–77.
[13] L. Kamandulytė-Merfeldienė, and I. Balčiūnienė, “Syntactically Coded Corpus of Spoken Lithuanian: Developmental Issues and Pilot Studies,” Studies about Languages, no. 28, 2016, pp. 92–101,
[14] L. Kamandulytė-Merfeldienė, “Pertarų dažnumas ir įvairovė sakytinėje kalboje (The Frequency and Variety of Fillers in Spoken Lithuanian Language),” Bendrinėkalba, no. 87, 2014, pp. 1–10.
[15] L. Kamandulytė-Merfeldienė, and I. Balčiūnienė, “Funkciniai pasakymų tipai sakytinėje kalboje (Types of Sentences and their Functions in Spoken Lithuanian),” Thought elaboration: linguistics, literature, media expression: coolection of scientific papers, 2016, pp. 11–29.
[16] L. Kamandulytė-Merfeldienė, and I. Balčiūnienė, “Atributinių ir predikatinių junginių su būdvardžiais dažnumas ir struktūra sakytinėje kalboje (Frequency and structure of attributive and predicative utterances in spoken Lithuanian)”, Lituanistica, vol. 62, no. 2, 2016, pp. 127–137.
[17] L. Kamandulytė-Merfeldienė, “Morphological modifications in Lithuanian child directed speech”, Estonian Papers in Applied Linguistics, no. 3, 2007, pp. 155–166.
[18] A. J. Liddicoat, An Introduction to Conversion Analysis, London: Continuum, 2007.
[19] G. Brown, and G. Yule, Discourse analysis, Cambridge University Press, 2001.
[20] D. Crystal, A Dictionary of Linguistics and Phonetics, Blackwell Reference, 2003.
[21] I. Dabašinskienė, “Šnekamosios lietuvių kalbos morfologinės ypatybės (The Morphological Features of Spoken Lithuanian)”, Acta Linguistica Lithuanica, no. 60, 2009, pp. 1–15.