Search results for: spoken corpus
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 582

Search results for: spoken corpus

582 Grammatically Coded Corpus of Spoken Lithuanian: Methodology and Development

Authors: L. Kamandulytė-Merfeldienė

Abstract:

The paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances.

Keywords: CHILDES, corpus of spoken Lithuanian, grammatical annotation, grammatical disambiguation, lexicon, Lithuanian

Procedia PDF Downloads 204
581 A Self-Built Corpus-Based Study of Four-Word Lexical Bundles in Native English Teachers’ EFL Classroom Discourse in Northeast China: The Significance of Stance

Authors: Fang Tan

Abstract:

This research focuses on the appropriate use of lexical bundles in spoken discourse, particularly in English as a Foreign Language (EFL) classrooms in Northeast China. While previous studies have mainly examined lexical bundles in written discourse, there is a need to investigate their usage in spoken discourse due to the limited availability of spoken discourse corpora. English teachers’ use of lexical bundles is crucial for effective teaching and communication in the EFL classroom. The aim of this study is to investigate the functions of four-word lexical bundles in native English teachers’ EFL oral English classes in Northeast China. Specifically, the research focuses on the usage of stance bundles, which were found to be the most significant type of bundle in the analyzed corpus. By comparing the self-built university spoken English classroom discourse corpus with the other self-built university English for General Purposes (EGP) corpus, the study aims to highlight the difference in bundle usage between native and non-native teachers in EFL classrooms. The research employs a corpus-based study. The observed corpus consists of more than 300,000 tokens, in which the data has been collected in the past five years. The reference corpus is composed of over 800,000 tokens, in which the data has been collected over 12 years. All the primary data collection involved transcribing and annotating spoken English classes taught by native English teachers. The analysis procedures included identifying and categorizing four-word lexical bundles, with specific emphasis on stance bundles. Frequency counts, and comparisons with the Chinese English teachers’ corpus were conducted to identify patterns and differences in bundle usage. The research addresses the following questions: 1) What are the functions of four-word lexical bundles in native English teachers’ EFL oral English classes? 2) How do stance bundles differ in usage between native and non-native English teachers’ classes? 3) What implications can be drawn for English teachers’ professional development based on the findings? In conclusion, this study provides valuable insights into the usage of four-word lexical bundles, particularly stance bundles, in native English teachers’ EFL oral English classes in Northeast China. The research highlights the difference in bundle usage between native and non-native English teachers’ classes and provides implications for English teachers’ professional development. The findings contribute to the understanding of lexical bundle usage in EFL classroom discourse and have theoretical importance for language teaching methodologies. The self-built university English classroom discourse corpus used in this research is a valuable resource for future studies in this field.

Keywords: EFL classroom discourse, four-word lexical bundles, stance, implication

Procedia PDF Downloads 29
580 Tagging a corpus of Media Interviews with Diplomats: Challenges and Solutions

Authors: Roberta Facchinetti, Sara Corrizzato, Silvia Cavalieri

Abstract:

Increasing interconnection between data digitalization and linguistic investigation has given rise to unprecedented potentialities and challenges for corpus linguists, who need to master IT tools for data analysis and text processing, as well as to develop techniques for efficient and reliable annotation in specific mark-up languages that encode documents in a format that is both human and machine-readable. In the present paper, the challenges emerging from the compilation of a linguistic corpus will be taken into consideration, focusing on the English language in particular. To do so, the case study of the InterDiplo corpus will be illustrated. The corpus, currently under development at the University of Verona (Italy), represents a novelty in terms both of the data included and of the tag set used for its annotation. The corpus covers media interviews and debates with diplomats and international operators conversing in English with journalists who do not share the same lingua-cultural background as their interviewees. To date, this appears to be the first tagged corpus of international institutional spoken discourse and will be an important database not only for linguists interested in corpus analysis but also for experts operating in international relations. In the present paper, special attention will be dedicated to the structural mark-up, parts of speech annotation, and tagging of discursive traits, that are the innovational parts of the project being the result of a thorough study to find the best solution to suit the analytical needs of the data. Several aspects will be addressed, with special attention to the tagging of the speakers’ identity, the communicative events, and anthropophagic. Prominence will be given to the annotation of question/answer exchanges to investigate the interlocutors’ choices and how such choices impact communication. Indeed, the automated identification of questions, in relation to the expected answers, is functional to understand how interviewers elicit information as well as how interviewees provide their answers to fulfill their respective communicative aims. A detailed description of the aforementioned elements will be given using the InterDiplo-Covid19 pilot corpus. The data yielded by our preliminary analysis of the data will highlight the viable solutions found in the construction of the corpus in terms of XML conversion, metadata definition, tagging system, and discursive-pragmatic annotation to be included via Oxygen.

Keywords: spoken corpus, diplomats’ interviews, tagging system, discursive-pragmatic annotation, english linguistics

Procedia PDF Downloads 151
579 Number Variation of the Personal Pronoun We in American Spoken English

Authors: Qiong Hu, Ming Yue

Abstract:

Language variation signals the newest usage of language community, which might become the developmental trend of that language. The personal pronoun we is prescribed as a plural pronoun in grammar, but its number value is more flexible in actual use. Based on the homemade Friends corpus, the present research explores the number value of the first person pronoun we in nowadays American spoken English. With consideration of the subjectivity of we, this paper used ‘we+ PCU (Perception-cognation-utterance) verbs’ collocations and ‘we+ plural categories’ as the parameters. Results from corpus data and manual annotation show that: 1) the overall frequency of we has been increasing; 2) we has been increasingly used with other plural categories, indicating a weakening of its plural reference; and 3) we has been increasingly used with PCU (perception-cognition-utterance) verbs of strong subjectivity, indicating a strengthening of its singular reference. All these seem to support our hypothesis that we is undergoing the process of further grammaticalization towards a singular reference, though future evidence is needed to attest the bold prediction.

Keywords: number, PCU verbs, personal pronoun we,

Procedia PDF Downloads 199
578 The Use of Corpora in Improving Modal Verb Treatment in English as Foreign Language Textbooks

Authors: Lexi Li, Vanessa H. K. Pang

Abstract:

This study aims to demonstrate how native and learner corpora can be used to enhance modal verb treatment in EFL textbooks in mainland China. It contributes to a corpus-informed and learner-centered design of grammar presentation in EFL textbooks that enhances the authenticity and appropriateness of textbook language for target learners. The linguistic focus is will, would, can, could, may, might, shall, should, must. The native corpus is the spoken component of BNC2014 (hereafter BNCS2014). The spoken part is chosen because pedagogical purpose of the textbooks is communication-oriented. Using the standard query option of CQPweb, 5% of each of the nine modals was sampled from BNCS2014. The learner corpus is the POS-tagged Ten-thousand English Compositions of Chinese Learners (TECCL). All the essays under the 'secondary school' section were selected. A series of five secondary coursebooks comprise the textbook corpus. All the data in both the learner and the textbook corpora are retrieved through the concordance functions of WordSmith Tools (version, 5.0). Data analysis was divided into two parts. The first part compared the patterns of modal verbs in the textbook corpus and BNC2014 with respect to distributional features, semantic functions, and co-occurring constructions to examine whether the textbooks reflect the authentic use of English. Secondly, the learner corpus was analyzed in terms of the use (distributional features, semantic functions, and co-occurring constructions) and the misuse (syntactic errors, e.g., she can sings*.) of the nine modal verbs to uncover potential difficulties that confront learners. The analysis of distribution indicates several discrepancies between the textbook corpus and BNCS2014. The first four most frequent modal verbs in BNCS2014 are can, would, will, could, while can, will, should, could are the top four in the textbooks. Most strikingly, there is an unusually high proportion of can (41.1%) in the textbooks. The results on different meanings shows that will, would and must are the most problematic. For example, for will, the textbooks contain 20% more occurrences of 'volition' and 20% less of 'prediction' than those in BNCS2014. Regarding co-occurring structures, the textbooks over-represented the structure 'modal +do' across the nine modal verbs. Another major finding is that the structure of 'modal +have done' that frequently co-occur with could, would, should, and must is underused in textbooks. Besides, these four modal verbs are the most difficult for learners, as the error analysis shows. This study demonstrates how the synergy of native and learner corpora can be harnessed to improve EFL textbook presentation of modal verbs in a way that textbooks can provide not only authentic language used in natural discourse but also appropriate design tailed for the needs of target learners.

Keywords: English as Foreign Language, EFL textbooks, learner corpus, modal verbs, native corpus

Procedia PDF Downloads 110
577 The Value of Computerized Corpora in EFL Textbook Design: The Case of Modal Verbs

Authors: Lexi Li

Abstract:

This study aims to contribute to the field of how computer technology can be exploited to enhance EFL textbook design. Specifically, the study demonstrates how computerized native and learner corpora can be used to enhance modal verb treatment in EFL textbooks. The linguistic focus is will, would, can, could, may, might, shall, should, must. The native corpus is the spoken component of BNC2014 (hereafter BNCS2014). The spoken part is chosen because the pedagogical purpose of the textbooks is communication-oriented. Using the standard query option of CQPweb, 5% of each of the nine modals was sampled from BNCS2014. The learner corpus is the POS-tagged Ten-thousand English Compositions of Chinese Learners (TECCL). All the essays under the “secondary school” section were selected. A series of five secondary coursebooks comprise the textbook corpus. All the data in both the learner and the textbook corpora are retrieved through the concordance functions of WordSmith Tools (version, 5.0). Data analysis was divided into two parts. The first part compared the patterns of modal verbs in the textbook corpus and BNC2014 with respect to distributional features, semantic functions, and co-occurring constructions to examine whether the textbooks reflect the authentic use of English. Secondly, the learner corpus was compared with the textbook corpus in terms of the use (distributional features, semantic functions, and co-occurring constructions) in order to examine the degree of influence of the textbook on learners’ use of modal verbs. Moreover, the learner corpus was analyzed for the misuse (syntactic errors, e.g., she can sings*.) of the nine modal verbs to uncover potential difficulties that confront learners. The results indicate discrepancies between the textbook presentation of modal verbs and authentic modal use in natural discourse in terms of distributions of frequencies, semantic functions, and co-occurring structures. Furthermore, there are consistent patterns of use between the learner corpus and the textbook corpus with respect to the three above-mentioned aspects, except could, will and must, partially confirming the correlation between the frequency effects and L2 grammar acquisition. Further analysis reveals that the exceptions are caused by both positive and negative L1 transfer, indicating that the frequency effects can be intercepted by L1 interference. Besides, error analysis revealed that could, would, should and must are the most difficult for Chinese learners due to both inter-linguistic and intra-linguistic interference. The discrepancies between the textbook corpus and the native corpus point to a need to adjust the presentation of modal verbs in the textbooks in terms of frequencies, different meanings, and verb-phrase structures. Along with the adjustment of modal verb treatment based on authentic use, it is important for textbook writers to take into consideration the L1 interference as well as learners’ difficulties in their use of modal verbs. The present study is a methodological showcase of the combination both native and learner corpora in the enhancement of EFL textbook language authenticity and appropriateness for learners.

Keywords: EFL textbooks, learner corpus, modal verbs, native corpus

Procedia PDF Downloads 90
576 A Corpus-Based Analysis of Japanese Learners' English Modal Auxiliary Verb Usage in Writing

Authors: S. Nakayama

Abstract:

For non-native English speakers, using English modal auxiliary verbs appropriately can be among the most challenging tasks. This research sought to identify differences in modal verb usage between Japanese non-native English speakers (JNNS) and native speakers (NS) from two different perspectives: frequency of use and distribution of verb phrase structures (VPS) where modal verbs occur. This study can contribute to the identification of JNNSs' interlanguage with regard to modal verbs; the main aim is to make a suggestion for the improvement of teaching materials as well as to help language teachers to be able to teach modal verbs in a way that is helpful for learners. To address the primary question in this study, usage of nine central modals (‘can’, ‘could’, ‘may’, ‘might’, ‘shall’, ‘should’, ‘will’, ‘would’, and ‘must’) by JNNS was compared with that by NSs in the International Corpus Network of Asian Learners of English (ICNALE). This corpus is one of the largest freely-available corpora focusing on Asian English learners’ language use. The ICNALE corpus consists of four modules: ‘Spoken Monologue’, ‘Spoken Dialogue’, ‘Written Essays’, and ‘Edited Essays’. Among these, this research adopted the ‘Written Essays’ module only, which is the set of 200-300 word essays and contains approximately 1.3 million words in total. Frequency analysis revealed gaps as well as similarities in frequency order. Specifically, both JNNSs and NSs used ‘can’ with the most frequency, followed by ‘should’ and ‘will’; however, usage of all the other modals except for ‘shall’ was not identical to each other. A log-likelihood test uncovered JNNSs’ overuse of ‘can’ and ‘must’ as well as their underuse of ‘will’ and ‘would’. VPS analysis revealed that JNNSs used modal verbs in a relatively narrow range of VPSs as compared to NSs. Results showed that JNNSs used most of the modals with bare infinitives or the passive voice only whereas NSs used the modals in a wide range of VPSs including the progressive construction and the perfect aspect, both of which were the structures where JNNSs rarely used the modals. Results of frequency analysis suggest that language teachers or teaching materials should explain other modality items so that learners can avoid relying heavily on certain modals and have a wide range of lexical items to reflect their feelings more accurately. Besides, the underused modals should be more stressed in the classroom because they are members of epistemic modals, which allow us to not only interject our views into propositions but also build a relationship with readers. As for VPSs, teaching materials should present more examples of the modals occurring in a wide range of VPSs to help learners to be able to express their opinions from a variety of viewpoints.

Keywords: corpus linguistics, Japanese learners of English, modal auxiliary verbs, International Corpus Network of Asian Learners of English

Procedia PDF Downloads 102
575 The Omani Learner of English Corpus: Source and Tools

Authors: Anood Al-Shibli

Abstract:

Designing a learner corpus is not an easy task to accomplish because dealing with learners’ language has many variables which might affect the results of any study based on learners’ language production (spoken and written). Also, it is very essential to systematically design a learner corpus especially when it is aimed to be a reference to language research. Therefore, designing the Omani Learner Corpus (OLEC) has undergone many explicit and systematic considerations. These criteria can be regarded as the foundation to design any learner corpus to be exploited effectively in language use and language learning studies. Added to that, OLEC is manually error-annotated corpus. Error-annotation in learner corpora is very essential; however, it is time-consuming and prone to errors. Consequently, a navigating tool is designed to help the annotators to insert errors’ codes in order to make the error-annotation process more efficient and consistent. To assure accuracy, error annotation procedure is followed to annotate OLEC and some preliminary findings are noted. One of the main results of this procedure is creating an error-annotation system based on the Omani learners of English language production. Because OLEC is still in the first stages, the primary findings are related to only one level of proficiency and one error type which is verb related errors. It is found that Omani learners in OLEC has the tendency to have more errors in forming the verb and followed by problems in agreement of verb. Comparing the results to other error-based studies indicate that the Omani learners tend to have basic verb errors which can found in lower-level of proficiency. To this end, it is essential to admit that examining learners’ errors can give insights to language acquisition and language learning and most errors do not happen randomly but they occur systematically among language learners.

Keywords: error-annotation system, error-annotation manual, learner corpora, verbs related errors

Procedia PDF Downloads 104
574 Aspects of Diglossia in Arabic Language Learning

Authors: Adil Ishag

Abstract:

Diglossia emerges in a situation where two distinctive varieties of a language are used alongside within a certain community. In this case, one is considered as a high or standard variety and the second one as a low or colloquial variety. Arabic is an extreme example of a highly diglossic language. This diglossity is due to the fact that Arabic is one of the most spoken languages and spread over 22 Countries in two continents as a mother tongue, and it is also widely spoken in many other Islamic countries as a second language or simply the language of Quran. The geographical variation between the countries where the language is spoken and the duality of the classical Arabic and daily spoken dialects in the Arab world on the other hand; makes the Arabic language one of the most diglossic languages. This paper tries to investigate this phenomena and its relation to learning Arabic as a first and second language.

Keywords: Arabic language, diglossia, first and second language, language learning

Procedia PDF Downloads 523
573 A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: Aqa-Webcorp

Authors: Wided Bakari, Patrce Bellot, Mahmoud Neji

Abstract:

With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair’s questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a database of texts and a corpus of pair’s question-texts. In addition, we give preliminary results of our proposal method. Some investigations for the construction of Arabic corpus are also presented in this document.

Keywords: Arabic, web, corpus, search engine, URL, question, corpus building, script, Google, html, txt

Procedia PDF Downloads 291
572 Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: ’Reddit’

Authors: Yasmeen Bassas, Sandra Kuebler, Allen Riddell

Abstract:

Native language identification is one of the growing subfields in natural language processing (NLP). The task of native language identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features, when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL), and then the trained models are evaluated on different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and logistic regression. Results show that content-based features are more accurate and robust than content independent ones when tested within the corpus and across corpus.

Keywords: NLI, NLP, content-based features, content independent features, social media corpus, ML

Procedia PDF Downloads 95
571 Corpus-Based Description of Core English Nouns of Pakistani English, an EFL Learner Perspective at Secondary Level

Authors: Abrar Hussain Qureshi

Abstract:

Vocabulary has been highlighted as a key indicator in any foreign language learning program, especially English as a foreign language (EFL). It is often considered a potential tool in foreign language curriculum, and its deficiency impedes successful communication in the target language. The knowledge of the lexicon is very significant in getting communicative competence and performance. Nouns constitute a considerable bulk of English vocabulary. Rather, they are the bones of the English language and are the main semantic carrier in spoken and written discourse. As nouns dominate the bulk of the English lexicon, their role becomes all the more potential. The undertaken research is a systematic effort in this regard to work out a list of highly frequent list of Pakistani English nouns for the EFL learners at the secondary level. It will encourage autonomy for the EFL learners as well as will save their time. The corpus used for the research has been developed locally from leading English newspapers of Pakistan. Wordsmith Tools has been used to process the research data and to retrieve word list of frequent Pakistani English nouns. The retrieved list of core Pakistani English nouns is supposed to be useful for English language learners at the secondary level as it covers a wide range of speech events.

Keywords: corpus, EFL, frequency list, nouns

Procedia PDF Downloads 66
570 Semantic Preference across Research Articles: A Corpus-Based Study of Adjectives in English

Authors: Valdênia Carvalho e Almeida

Abstract:

The goal of the present study is to investigate the semantic preference of the most frequent adjectives in research articles through a corpus-based analysis of texts published in journals in Applied Linguistics (AL). The corpus used in this study contains texts published in the period from 2014 to 2018 in the three journals: Language Learning and Technology; English for Academic Purposes, and TESOL Quaterly, totaling more than one million words. A corpus-based analysis was carried out on the corpus to identify the most frequent adjectives that co-occurred in the three journals. By observing the concordance lines of the adjectives and analyzing the words they associated with, the semantic preferences of each adjective were determined. Later, the AL corpus analysis was compared to the investigation of the same adjectives in a corpus of Chemistry. This second part of the study aimed to identify possible differences and similarities between the two corpora in relation to the use of the adjectives in research articles from both areas. The results show that there are some preferences which seem to be closely related not only to the academic genre of the texts but also to the specific domain of the discipline and, to a lesser extent, to the context of research in each journal. This research illustrates a possible contribution of Corpus Linguistics to explore the concept of semantic preference in more detail, considering the complex nature of the phenomenon.

Keywords: applied linguistics, corpus linguistics, chemistry, research article, semantic preference

Procedia PDF Downloads 146
569 Specialized Translation Teaching Strategies: A Corpus-Based Approach

Authors: Yingying Ding

Abstract:

This study presents a methodology of specialized translation with the objective of helping teachers to improve the strategies in teaching translation. In order to allow students to acquire skills to translate specialized texts, they need to become familiar with the semantic and syntactic features of source texts and target texts. The aim of our study is to use a corpus-based approach in the teaching of specialized translation between Chinese and Italian. This study proposes to construct a specialized Chinese - Italian comparable corpus that consists of 50 economic contracts from the domain of food. With the help of AntConc, we propose to compile a comparable corpus in for translation teaching purposes. This paper attempts to provide insight into how teachers could benefit from comparable corpus in the teaching of specialized translation from Italian into Chinese and through some examples of passive sentences how students could learn to apply different strategies for translating appropriately the voice.

Keywords: contrastive studies, specialised translation, corpus-based approach, teaching

Procedia PDF Downloads 333
568 Studying Language of Immediacy and Language of Distance from a Corpus Linguistic Perspective: A Pilot Study of Evaluation Markers in French Television Weather Reports

Authors: Vince Liégeois

Abstract:

Language of immediacy and distance: Within their discourse theory, Koch & Oesterreicher establish a distinction between a language of immediacy and a language of distance. The former refers to those discourses which are oriented more towards a spoken norm, whereas the latter entails discourses oriented towards a written norm, regardless of whether they are realised phonically or graphically. This means that an utterance can be realised phonically but oriented more towards the written language norm (e.g., a scientific presentation or eulogy) or realised graphically but oriented towards a spoken norm (e.g., a scribble or chat messages). Research desiderata: The methodological approach from Koch & Oesterreicher has often been criticised for not providing a corpus-linguistic methodology, which makes it difficult to work with quantitative data or address large text collections within this research paradigm. Consequently, the Koch & Oesterreicher approach has difficulties gaining ground in those research areas which rely more on corpus linguistic research models, like text linguistics and LSP-research. A combinatory approach: Accordingly, we want to establish a combinatory approach with corpus-based linguistic methodology. To this end, we propose to (i) include data about the context of an utterance (e.g., monologicity/dialogicity, familiarity with the speaker) – which were called “conditions of communication” in the original work of Koch & Oesterreicher – and (ii) correlate the linguistic phenomenon at the centre of the inquiry (e.g., evaluation markers) to a group of linguistic phenomena deemed typical for either distance- or immediacy-language. Based on these two parameters, linguistic phenomena and texts could then be mapped on an immediacy-distance continuum. Pilot study: To illustrate the benefits of this approach, we will conduct a pilot study on evaluation phenomena in French television weather reports, a form of domain-sensitive discourse which has often been cited as an example of a “text genre”. Within this text genre, we will look at so-called “evaluation markers,” e.g., fixed strings like bad weather, stifling hot, and “no luck today!”. These evaluation markers help to communicate the coming weather situation towards the lay audience but have not yet been studied within the Koch & Oesterreicher research paradigm. Accordingly, we want to figure out whether said evaluation markers are more typical for those weather reports which tend more towards immediacy or those which tend more towards distance. To this aim, we collected a corpus with different kinds of television weather reports,e.g., as part of the news broadcast, including dialogue. The evaluation markers themselves will be studied according to the explained methodology, by correlating them to (i) metadata about the context and (ii) linguistic phenomena characterising immediacy-language: repetition, deixis (personal, spatial, and temporal), a freer choice of tense and right- /left-dislocation. Results: Our results indicate that evaluation markers are more dominantly present in those weather reports inclining towards immediacy-language. Based on the methodology established above, we have gained more insight into the working of evaluation markers in the domain-sensitive text genre of (television) weather reports. For future research, it will be interesting to determine whether said evaluation markers are also typical for immediacy-language-oriented in other domain-sensitive discourses.

Keywords: corpus-based linguistics, evaluation markers, language of immediacy and distance, weather reports

Procedia PDF Downloads 173
567 Corporate Cautionary Statement: A Genre of Professional Communication

Authors: Chie Urawa

Abstract:

Cautionary statements or disclaimers in corporate annual reports need to be carefully designed because clear cautionary statements may protect a company in the case of legal disputes and may undermine positive impressions. This study compares the language of cautionary statements using two corpora, Sony’s cautionary statement corpus (S-corpus) and Panasonic’s cautionary statement corpus (P-corpus), illustrating the differences and similarities in relation to the use of meaningful cautionary statements and critically analyzing why practitioners use the way. The findings describe the distinct differences between the two companies in the presentation of the risk factors and the way how they make the statements. The word ability is used more for legal protection in S-corpus whereas the word possibility is used more to convey a better impression in P-corpus. The main similarities are identified in the use of lexical words and pronouns, and almost the same wordings for eight years. The findings show how they make the statements unique to the company in the presentation of risk factors, and the characteristics of specific genre of professional communication. Important implications of this study are that more comprehensive approach can be applied in other contexts, and be used by companies to reflect upon their cautionary statements.

Keywords: cautionary statements, corporate annual reports, corpus, risk factors

Procedia PDF Downloads 131
566 Cataphora in English and Chinese Conversation: A Corpus-based Contrastive Study

Authors: Jun Gao

Abstract:

This paper combines the corpus-based and contrastive approaches, seeking to provide a systematic account of cataphora in English and Chinese natural conversations. Based on spoken corpus data, the first part of the paper examines a range of characteristics of cataphora in the two languages, including frequency of occurrence, patterns, and syntactic features. On the basis of this exploration, cataphora in the two languages are contrasted in a structured way. The analysis shows that English and Chinese share a similar distribution of cataphora in natural conversations in terms of frequency of occurrence, with repeat identification cataphora higher than first mention cataphora and intra-sentential cataphora much higher than inter-sentential cataphora. In terms of patterns, three types are identified in English, i.e. P+N, Ø+N, and it+Clause, while in Chinese, two types are identified, i.e., P+N and Ø+N. English and Chinese are similar in terms of syntactic features, i.e., cataphor and postcedent in the intra-sentential cataphora mainly occur in the initial subject position of the same clause, with postcedent immediately followed or delayed, and cataphor and postcedent are mostly in adjacent sentences in inter-sentential cataphora. In the second part of the paper, the motivations of cataphora are investigated. It is found that cataphora is primarily motivated by the speaker and hearer’s different knowledge states with regard to the referent. Other factors are also involved, such as interference, word search, and the tension between the principles of Economy and Clarity.

Keywords: cataphora, contrastive study, motivation, pattern, syntactic features

Procedia PDF Downloads 51
565 A Corpus-Based Study on the Styles of Three Translators

Authors: Wang Yunhong

Abstract:

The present paper is preoccupied with the different styles of three translators in their translating a Chinese classical novel Shuihu Zhuan. Based on a parallel corpus, it adopts a target-oriented approach to look into whether and what stylistic differences and shifts the three translations have revealed. The findings show that the three translators demonstrate different styles concerning their word choices and sentence preferences, which implies that identification of recurrent textual patterns may be a basic step for investigating the style of a translator.

Keywords: corpus, lexical choices, sentence characteristics, style

Procedia PDF Downloads 231
564 A Web-Based Self-Learning Grammar for Spoken Language Understanding

Authors: S. Biondi, V. Catania, R. Di Natale, A. R. Intilisano, D. Panno

Abstract:

One of the major goals of Spoken Dialog Systems (SDS) is to understand what the user utters. In the SDS domain, the Spoken Language Understanding (SLU) Module classifies user utterances by means of a pre-definite conceptual knowledge. The SLU module is able to recognize only the meaning previously included in its knowledge base. Due the vastity of that knowledge, the information storing is a very expensive process. Updating and managing the knowledge base are time-consuming and error-prone processes because of the rapidly growing number of entities like proper nouns and domain-specific nouns. This paper proposes a solution to the problem of Name Entity Recognition (NER) applied to a SDS domain. The proposed solution attempts to automatically recognize the meaning associated with an utterance by using the PANKOW (Pattern based Annotation through Knowledge On the Web) method at runtime. The method being proposed extracts information from the Web to increase the SLU knowledge module and reduces the development effort. In particular, the Google Search Engine is used to extract information from the Facebook social network.

Keywords: spoken dialog system, spoken language understanding, web semantic, name entity recognition

Procedia PDF Downloads 304
563 Spoken Rhetoric in Arabic Heritage

Authors: Ihab Al-Mokrani

Abstract:

The Arabic heritage has two types of spoken rhetoric: the first type which al-Jaahiz calls “the rhetoric of the sign,” which means body language, and the rhetoric of silence which is of no less importance than the rhetoric of the sign, the speaker’s appearance and movements, etc. The second type is the spoken performance of utterances which bears written rhetoric arts like metaphor, simile, metonymy, etc. Rationale of the study: First: in spite of the factual existence of rhetorical phenomena in the Arabic heritage, there has been no contemporary study handling the spoken rhetoric in the Arabic heritage. Second: Arabic Civilization is originally a spoken one. Comparing the Arabic culture and civilization, from one side, to the Greek, roman or Pharaonic cultures and civilizations, from the other side, shows that the latter cultures and civilizations started and flourished written while the former started among illiterate people who had no interest in writing until recently. That sort of difference on the part of the Arabic culture and civilization created a rhetoric different from rhetoric in the other cultures and civilizations. Third: the spoken nature of the Arabic civilization influenced the Arabic rhetoric in the sense that specific rhetorical arts have been introduced matching that spoken nature. One of these arts is the art of concision which compensates for the absence of writing’s means of preserving the text. In addition, this interprets why many of the definitions of the Arabic rhetoric were defining rhetoric as the art of concision. Also, this interprets the fact that the literary genres known in the Arabic culture were limited by the available narrow space like poetry, anecdotes, and stories, while the literary genres in the Greek culture were of wide space as epics and drama. This is not of any contrast to the fact that some Arabic poetry would exceed 100 lines of poetry as Arabic poetry was based on the line organic unity, which means that every line could stand alone with a full meaning that is not dependent on the rest of the poem; and that last aspect has never happened in any culture other than the Arabic culture.

Keywords: Arabic rhetoric, spoken rhetoric, Arabic heritage, culture

Procedia PDF Downloads 739
562 Anti-Language in Jordanian Spoken Arabic: A Sociolinguistic Perspective

Authors: Ahmad Mohammad Al-Harahsheh

Abstract:

Anti-language reflects anti-society; it is a restricted spoken code used among a group of interlocutors because of anti-society. This study aims to shed light on the sociolinguistic characteristics of anti-language used by prisoners in Jordan. The participants included were 15 male-Jordanian prisoners who have recently been released. The data were written, transliterated, and analyzed on the basis of sociolinguistics and discourse analysis. This study draws on sociolinguistic theory of language codes as the theoretical framework. The study concludes that anti-language is a male language and is used for secrecy, as the prisoners' tendency to protect themselves from the police; it is a verbal competition, contest and display. In addition, it is employed to express obnoxious ideas and acts by using more pleasant or blurred words and expressions. Also, the anti-language used by prisoners has six linguistic characteristics in JSA (Jordanian Spoken Arabic), such as relexicalization, neologism, rhyme formation, semantic change, derivation, and metaphorical expressions.

Keywords: anti-language, Jordanian Spoken Arabic, sociolinguistics, prisoners

Procedia PDF Downloads 330
561 A Corpus-Assisted Discourse Analysis of Adjectival Collocation of the Word 'Education' in the American Context

Authors: Ngan Nguyen

Abstract:

The study analyses adjectives collocating with the word ‘education’ in the American language of the Corpus of Global Web-based English using a combination of corpus linguistic and discourse analytical methods to examine not only language patterns but also social political ideologies around the topic. Significant conclusions are deduced: (1) there are a large number of adjectival collocates of the word education which have been identified and classified into four categories representing four different aspects of education: level, quality, forms and types of education; (2) education, as in combination with three first categories, carries the meaning as the act and process of teaching and learning while with the last category having the meaning of a particular kind of teaching or training; (3) higher education is the topic that gains most concerns from the American public; (4) five most significant ideologies are discovered from the corpus: higher education associates with financial affairs, higher education is an industry, monetary policy of the government on higher education, people require greater accessibility to higher education and people value higher education. The study contributes to the field of developing meanings of words through corpus analysis and the field of discourse analysis.

Keywords: adjectival collocation, American context, corpus linguistics, discourse analysis, education

Procedia PDF Downloads 297
560 Saudi Twitter Corpus for Sentiment Analysis

Authors: Adel Assiri, Ahmed Emam, Hmood Al-Dossari

Abstract:

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment analysis with (K= 0.807). Our next work is to extend this corpus and creation a large-scale lexicon for Saudi dialect from the corpus.

Keywords: Arabic, sentiment analysis, Twitter, annotation

Procedia PDF Downloads 587
559 The Istrian Istrovenetian-Croatian Bilingual Corpus

Authors: Nada Poropat Jeletic, Gordana Hrzica

Abstract:

Bilingual conversational corpora represent a meaningful and the most comprehensive data source for investigating the genuine contact phenomena in non-monitored bi-lingual speech productions. They can be particularly useful for bilingual research since some features of bilingual interaction can hardly be accessed with more traditional methodologies (e.g., elicitation tasks). The method of language sampling provides the resources for describing language interaction in a bilingual community and/or in bilingual situations (e.g. code-switching, amount of languages used, number of languages used, etc.). To capture these phenomena in genuine communication situations, such sampling should be as close as possible to spontaneous communication. Bilingual spoken corpus design is methodologically demanding. Therefore this paper aims at describing the methodological challenges that apply to the corpus design of the conversational corpus design of the Istrian Istrovenetian-Croatian Bilingual Corpus. Croatian is the first official language of the Croatian-Italian officially bilingual Istria County, while Istrovenetian is a diatopic subvariety of Venetian, a longlasting lingua franca in the Istrian peninsula, the mother tongue of the members of the Italian National Community in Istria and the primary code of informal everyday communication among the Istrian Italophone population. Within the CLARIN infrastructure, TalkBank is being used, as it provides relevant procedures for designing and analyzing bilingual corpora. Furthermore, it allows public availability allows for easy replication of studies and cumulative progress as a research community builds up around the corpus, while the tools developed within the field of corpus linguistics enable easy retrieval and analysis of information. The method of language sampling employed is kept at the level of spontaneous communication, in order to maximise the naturalness of the collected conversational data. All speakers have provided written informed consent in which they agree to be recorded at a random point within the period of one month after signing the consent. Participants are administered a background questionnaire providing information about the socioeconomic status and the exposure and language usage in the participants social networks. Recording data are being transcribed, phonologically adapted within a standard-sized orthographic form, coded and segmented (speech streams are being segmented into communication units based on syntactic criteria) and are being marked following the CHAT transcription system and its associated CLAN suite of programmes within the TalkBank toolkit. The corpus consists of transcribed sound recordings of 36 bilingual speakers, while the target is to publish the whole corpus by the end of 2020, by sampling spontaneous conversations among approximately 100 speakers from all the bilingual areas of Istria for ensuring representativeness (the participants are being recruited across three generations of native bilingual speakers in all the bilingual areas of the peninsula). Conversational corpora are still rare in TalkBank, so the Corpus will contribute to BilingBank as a highly relevant and scientifically reliable resource for an internationally established and active research community. The impact of the research of communities with societal bilingualism will contribute to the growing body of research on bilingualism and multilingualism, especially regarding topics of language dominance, language attrition and loss, interference and code-switching etc.

Keywords: conversational corpora, bilingual corpora, code-switching, language sampling, corpus design methodology

Procedia PDF Downloads 105
558 Corpus Linguistic Methods in a Theoretical Study of Quran Verb Tense and Aspect in Translations from Arabic to English

Authors: Jawharah Alasmari

Abstract:

In inflectional morphology of verb, tense and aspect indicate action’s time either past/present or future and their period whether completed or not. The usage and meaning of tense and aspect differ in Arabic and English, therefore is no simple one -to- one mapping from an Arabic verb inflected form an appropriate English translation depends on a range of features, including immediate and wider context of use. The Quranic Arabic Corpus includes seven alternative expertly crafted English translations of each Arabic verses, which provides a test dataset for the study of appropriate Arabic to English translations of verb tense and aspect. We applied Corpus Linguistics Methods in a theoretical study of exemplary verbs, to elicit candidate verbal contexts which influence the choice of English inflection for each verse.

Keywords: Corpus linguistics methods, Arabic verb, tense and aspect, English translations

Procedia PDF Downloads 352
557 Combining Corpus Linguistics and Critical Discourse Analysis to Study Power Relations in Hindi Newspapers

Authors: Vandana Mishra, Niladri Sekhar Dash, Jayshree Charkraborty

Abstract:

This present paper focuses on the application of corpus linguistics techniques for critical discourse analysis (CDA) of Hindi newspapers. While Corpus linguistics is the study of language as expressed in corpora (samples) of 'real world' text, CDA is an interdisciplinary approach to the study of discourse that views language as a form of social practice. CDA has mainly been studied from a qualitative perspective. However, we can say that recent studies have begun combining corpus linguistics with CDA in analyzing large volumes of text for the study of existing power relations in society. The corpus under our study is also of a sizable amount (1 million words of Hindi newspaper texts) and its analysis requires an alternative analytical procedure. So, we have combined both the quantitative approach i.e. the use of corpus techniques with CDA’s traditional qualitative analysis. In this context, we have focused on the Keyword Analysis Sorting Concordance Lines of the selected Keywords and calculating collocates of the keywords. We have made use of the Wordsmith Tool for all these analysis. The analysis starts with identifying the keywords in the political news corpus when compared with the main news corpus. The keywords are extracted from the corpus based on their keyness calculated through statistical tests like chi-squared test and log-likelihood test on the frequent words of the corpus. Some of the top occurring keywords are मोदी (Modi), भाजपा (BJP), कांग्रेस (Congress), सरकार (Government) and पार्टी (Political party). This is followed by the concordance analysis of these keywords which generates thousands of lines but we have to select few lines and examine them based on our objective. We have also calculated the collocates of the keywords based on their Mutual Information (MI) score. Both concordance and collocation help to identify lexical patterns in the political texts. Finally, all these quantitative results derived from the corpus techniques will be subjectively interpreted in accordance to the CDA’s theory to examine the ways in which political news discourse produces social and political inequality, power abuse or domination.

Keywords: critical discourse analysis, corpus linguistics, Hindi newspapers, power relations

Procedia PDF Downloads 179
556 A Corpus-Based Discourse Analysis of the Disappearance of MH370 in Malaysia and United Kingdom Newspapers: A Pilot Study

Authors: Theng Theng Ong

Abstract:

This pilot study adopts a corpus-based discourse analysis to explore the construction of Malaysia airline tragedy MH370 in the selected Malaysian and United Kingdom (UK) newspapers. Fairclough’s three-dimensional model is adopted in the study to support the corpus-based analysis. The analysis aims to determine the ways in which Malaysian Airline tragedy MH370 is linguistically defined and constructed in terms of keywords and collocation. The study also seeks to identify the types of discourse that are presented in the news articles. In addition, the differences or similarities in terms of keywords, topics or issues covered by the selected Malaysian and UK news media are examined.

Keywords: corpus, CDA, newspapers, airline tragedies

Procedia PDF Downloads 260
555 The Effect of Problem-Based Mobile-Assisted Tasks on Spoken Intelligibility of English as a Foreign Language Learners

Authors: Loghman Ansarian, Teoh Mei Lin

Abstract:

In an attempt to increase oral proficiency of Iranian EFL learners, the researchers compared the effect of problem-based mobile-assisted language learning with the conventional language learning approach (Communicative Language Teaching) in Iran. The experimental group (n=37) went through PBL instruction and the control group (n=33) went through conventional instruction. The results of quantitative data analysis after 26 sessions of treatment revealed that PBL could positively affect participants' knowledge of grammar, vocabulary, spoken fluency, and pronunciation; however, in terms of task achievement, no significant effect was found. This study can have pedagogical implications for language teachers, and material developers.

Keywords: problem-based learning, spoken intelligibility, Iranian EFL context, cognitive learning

Procedia PDF Downloads 144
554 A Corpus-Based Analysis of "MeToo" Discourse in South Korea: Coverage Representation in Korean Newspapers

Authors: Sun-Hee Lee, Amanda Kraley

Abstract:

The “MeToo” movement is a social movement against sexual abuse and harassment. Though the hashtag went viral in 2017 following different cultural flashpoints in different countries, the initial response was quiet in South Korea. This radically changed in January 2018, when a high-ranking senior prosecutor, Seo Ji-hyun, gave a televised interview discussing being sexually assaulted by a colleague. Acknowledging public anger, particularly among women, on the long-existing problems of sexual harassment and abuse, the South Korean media have focused on several high-profile cases. Analyzing the media representation of these cases is a window into the evolving South Korean discourse around “MeToo.” This study presents a linguistic analysis of “MeToo” discourse in South Korea by utilizing a corpus-based approach. The term corpus (pl. corpora) is used to refer to electronic language data, that is, any collection of recorded instances of spoken or written language. A “MeToo” corpus has been collected by extracting newspaper articles containing the keyword “MeToo” from BIGKinds, big data analysis, and service and Nexis Uni, an online academic database search engine, to conduct this language analysis. The corpus analysis explores how Korean media represent accusers and the accused, victims and perpetrators. The extracted data includes 5,885 articles from four broadsheet newspapers (Chosun, JoongAng, Hangyore, and Kyunghyang) and 88 articles from two Korea-based English newspapers (Korea Times and Korea Herald) between January 2017 and November 2020. The information includes basic data analysis with respect to keyword frequency and network analysis and adds refined examinations of select corpus samples through naming strategies, semantic relations, and pragmatic properties. Along with the exponential increase of the number of articles containing the keyword “MeToo” from 104 articles in 2017 to 3,546 articles in 2018, the network and keyword analysis highlights ‘US,’ ‘Harvey Weinstein’, and ‘Hollywood,’ as keywords for 2017, with articles in 2018 highlighting ‘Seo Ji-Hyun, ‘politics,’ ‘President Moon,’ ‘An Ui-Jeong, ‘Lee Yoon-taek’ (the names of perpetrators), and ‘(Korean) society.’ This outcome demonstrates the shift of media focus from international affairs to domestic cases. Another crucial finding is that word ‘defamation’ is widely distributed in the “MeToo” corpus. This relates to the South Korean legal system, in which a person who defames another by publicly alleging information detrimental to their reputation—factual or fabricated—is punishable by law (Article 307 of the Criminal Act of Korea). If the defamation occurs on the internet, it is subject to aggravated punishment under the Act on Promotion of Information and Communications Network Utilization and Information Protection. These laws, in particular, have been used against accusers who have publicly come forward in the wake of “MeToo” in South Korea, adding an extra dimension of risk. This corpus analysis of “MeToo” newspaper articles contributes to the analysis of the media representation of the “MeToo” movement and sheds light on the shifting landscape of gender relations in the public sphere in South Korea.

Keywords: corpus linguistics, MeToo, newspapers, South Korea

Procedia PDF Downloads 177
553 Passive Voice in SLA: Armenian Learners’ Case Study

Authors: Emma Nemishalyan

Abstract:

It is believed that learners’ mother tongue (L1 hereafter) has a huge impact on their second language acquisition (L2 hereafter). This hypothesis has been exposed to both positive and negative criticism. Based on research results of a wide range of learners’ corpora (Chinese, Japanese, Spanish among others) the hypothesis has either been proved or disproved. However, no such study has been conducted on the Armenian learners. The aim of this paper is to understand the implication of the hypothesis on the Armenian learners’ corpus in terms of the use of the passive voice. To this end, the method of Contrastive Interlanguage Analysis (hereafter CIA) has been used on native speakers’ corpus (Louvain Corpus of Native English Essays (LOCNESS)) and Armenian learners’ corpus which has been compiled by me in compliance with International Corpus of Learner English (ICLE) guidelines. CIA compares the interlanguage (the language produced by learners) with the one produced by native speakers. With the help of this method, it is possible not only to highlight the mistakes that learners make, but also to underline the under or overuses. The choice of the grammar issue (passive voice) is conditioned by the fact that typologically Armenian and English are drastically different as they belong to different branches. Moreover, the passive voice is considered to be one of the most problematic grammar topics to be acquired by learners of the English language. Based on this difference, we hypothesized that Armenian learners would either overuse or underuse some types of the passive voice. With the help of Lancsbox software, we have identified the frequency rates of passive voice usage in LOCNESS and Armenian learners’ corpus to understand whether the latter have the same usage pattern of the passive voice as the native speakers. Secondly, we have identified the types of the passive voice used by the Armenian leaners trying to track down the reasons in their mother tongue. The results of the study showed that Armenian learners underused the passive voices in contrast to native speakers. Furthermore, the hypothesis that learners’ L1 has an impact on learners’ L2 acquisition and production was proved.

Keywords: corpus linguistics, applied linguistics, second language acquisition, corpus compilation

Procedia PDF Downloads 48