Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 466

Search results for: corpus linguistics

466 Semantic Preference across Research Articles: A Corpus-Based Study of Adjectives in English

Authors: Valdênia Carvalho e Almeida

Abstract:

The goal of the present study is to investigate the semantic preference of the most frequent adjectives in research articles through a corpus-based analysis of texts published in journals in Applied Linguistics (AL). The corpus used in this study contains texts published in the period from 2014 to 2018 in the three journals: Language Learning and Technology; English for Academic Purposes, and TESOL Quaterly, totaling more than one million words. A corpus-based analysis was carried out on the corpus to identify the most frequent adjectives that co-occurred in the three journals. By observing the concordance lines of the adjectives and analyzing the words they associated with, the semantic preferences of each adjective were determined. Later, the AL corpus analysis was compared to the investigation of the same adjectives in a corpus of Chemistry. This second part of the study aimed to identify possible differences and similarities between the two corpora in relation to the use of the adjectives in research articles from both areas. The results show that there are some preferences which seem to be closely related not only to the academic genre of the texts but also to the specific domain of the discipline and, to a lesser extent, to the context of research in each journal. This research illustrates a possible contribution of Corpus Linguistics to explore the concept of semantic preference in more detail, considering the complex nature of the phenomenon.

Keywords: applied linguistics, corpus linguistics, chemistry, research article, semantic preference

Procedia PDF Downloads 92
465 Combining Corpus Linguistics and Critical Discourse Analysis to Study Power Relations in Hindi Newspapers

Authors: Vandana Mishra, Niladri Sekhar Dash, Jayshree Charkraborty

Abstract:

This present paper focuses on the application of corpus linguistics techniques for critical discourse analysis (CDA) of Hindi newspapers. While Corpus linguistics is the study of language as expressed in corpora (samples) of 'real world' text, CDA is an interdisciplinary approach to the study of discourse that views language as a form of social practice. CDA has mainly been studied from a qualitative perspective. However, we can say that recent studies have begun combining corpus linguistics with CDA in analyzing large volumes of text for the study of existing power relations in society. The corpus under our study is also of a sizable amount (1 million words of Hindi newspaper texts) and its analysis requires an alternative analytical procedure. So, we have combined both the quantitative approach i.e. the use of corpus techniques with CDA’s traditional qualitative analysis. In this context, we have focused on the Keyword Analysis Sorting Concordance Lines of the selected Keywords and calculating collocates of the keywords. We have made use of the Wordsmith Tool for all these analysis. The analysis starts with identifying the keywords in the political news corpus when compared with the main news corpus. The keywords are extracted from the corpus based on their keyness calculated through statistical tests like chi-squared test and log-likelihood test on the frequent words of the corpus. Some of the top occurring keywords are मोदी (Modi), भाजपा (BJP), कांग्रेस (Congress), सरकार (Government) and पार्टी (Political party). This is followed by the concordance analysis of these keywords which generates thousands of lines but we have to select few lines and examine them based on our objective. We have also calculated the collocates of the keywords based on their Mutual Information (MI) score. Both concordance and collocation help to identify lexical patterns in the political texts. Finally, all these quantitative results derived from the corpus techniques will be subjectively interpreted in accordance to the CDA’s theory to examine the ways in which political news discourse produces social and political inequality, power abuse or domination.

Keywords: critical discourse analysis, corpus linguistics, Hindi newspapers, power relations

Procedia PDF Downloads 117
464 Corpus Linguistic Methods in a Theoretical Study of Quran Verb Tense and Aspect in Translations from Arabic to English

Authors: Jawharah Alasmari

Abstract:

In inflectional morphology of verb, tense and aspect indicate action’s time either past/present or future and their period whether completed or not. The usage and meaning of tense and aspect differ in Arabic and English, therefore is no simple one -to- one mapping from an Arabic verb inflected form an appropriate English translation depends on a range of features, including immediate and wider context of use. The Quranic Arabic Corpus includes seven alternative expertly crafted English translations of each Arabic verses, which provides a test dataset for the study of appropriate Arabic to English translations of verb tense and aspect. We applied Corpus Linguistics Methods in a theoretical study of exemplary verbs, to elicit candidate verbal contexts which influence the choice of English inflection for each verse.

Keywords: Corpus linguistics methods, Arabic verb, tense and aspect, English translations

Procedia PDF Downloads 305
463 Compilation and Statistical Analysis of an Arabic-English Legal Corpus in Sketch Engine

Authors: C. Brierley, H. El-Farahaty, A. Farhan

Abstract:

The Leeds Parallel Corpus of Arabic-English Constitutions is a parallel corpus for the Arabic legal domain. Analysis of legal language via Corpus Linguistics techniques is an important development. In legal proceedings, a corpus-based approach to disambiguating meaning is set to replace the dictionary as an interpretative tool, and legal scholarship in the States is now attuned to the potential for Text Analytics over vast quantities of text-based legal material, following the business and medical industries. This trend is reflected in Europe: the interdisciplinary research group in Computer Assisted Legal Linguistics mines big data collections of legal and non-legal texts to analyse: legal interpretations; legal discourse; the comprehensibility of legal texts; conflict resolution; and linguistic human rights. This paper focuses on ‘dignity’ as an important aspect of the overarching concept of human rights in current constitutions across the Arab world. We have compiled a parallel, Arabic-English raw text corpus (169,861 Arabic words and 205,893 English words) from reputable websites such as the World Intellectual Property Organisation and CONSTITUTE, and uploaded and queried our corpus in Sketch Engine. Our most challenging task was sentence-level alignment of Arabic-English data. This entailed manual intervention to ensure correspondence on a one-to-many basis since Arabic sentences differ from English in length and punctuation. We have searched for morphological variants of ‘dignity’ (رامة ك, karāma) in the Arabic data and inspected their English translation equivalents. The term occurs most frequently in the Sudanese constitution (10 instances), and not at all in the constitution of Palestine. Its most frequent collocate, determined via the logDice statistic in Sketch Engine, is ‘human’ as in ‘human dignity’.

Keywords: Arabic constitution, corpus-based legal linguistics, human rights, parallel Arabic-English legal corpora

Procedia PDF Downloads 84
462 Words of Peace in the Speeches of the Egyptian President, Abdulfattah El-Sisi: A Corpus-Based Study

Authors: Mohamed S. Negm, Waleed S. Mandour

Abstract:

The present study aims primarily at investigating words of peace (lexemes of peace) in the formal speeches of the Egyptian president Abdulfattah El-Sisi in a two-year span of time, from 2018 to 2019. This paper attempts to shed light not only on the contextual use of the antonyms, war and peace, but also it underpins quantitative analysis through the current methods of corpus linguistics. As such, the researchers have deployed a corpus-based approach in collecting, encoding, and processing 30 presidential speeches over the stated period (23,411 words and 25,541 tokens in total). Further, semantic fields and collocational networkzs are identified and compared statistically. Results have shown a significant propensity of adopting peace, including its relevant collocation network, textually and therefore, ideationally, at the expense of war concept which in most cases surfaces euphemistically through the noun conflict. The president has not justified the action of war with an honorable cause or a valid reason. Such results, so far, have indicated a positive sociopolitical mindset the Egyptian president possesses and moreover, reveal national and international fair dealing on arising issues.

Keywords: CADS, collocation network, corpus linguistics, critical discourse analysis

Procedia PDF Downloads 72
461 Corpus Linguistics as a Tool for Translation Studies Analysis: A Bilingual Parallel Corpus of Students’ Translations

Authors: Juan-Pedro Rica-Peromingo

Abstract:

Nowadays, corpus linguistics has become a key research methodology for Translation Studies, which broadens the scope of cross-linguistic studies. In the case of the study presented here, the approach used focuses on learners with little or no experience to study, at an early stage, general mistakes and errors, the correct or incorrect use of translation strategies, and to improve the translational competence of the students. Led by Sylviane Granger and Marie-Aude Lefer of the Centre for English Corpus Linguistics of the University of Louvain, the MUST corpus (MUltilingual Student Translation Corpus) is an international project which brings together partners from Europe and worldwide universities and connects Learner Corpus Research (LCR) and Translation Studies (TS). It aims to build a corpus of translations carried out by students including both direct (L2 > L1) an indirect (L1 > L2) translations, from a great variety of text types, genres, and registers in a wide variety of languages: audiovisual translations (including dubbing, subtitling for hearing population and for deaf population), scientific, humanistic, literary, economic and legal translation texts. This paper focuses on the work carried out by the Spanish team from the Complutense University (UCMA), which is part of the MUST project, and it describes the specific features of the corpus built by its members. All the texts used by UCMA are either direct or indirect translations between English and Spanish. Students’ profiles comprise translation trainees, foreign language students with a major in English, engineers studying EFL and MA students, all of them with different English levels (from B1 to C1); for some of the students, this would be their first experience with translation. The MUST corpus is searchable via Hypal4MUST, a web-based interface developed by Adam Obrusnik from Masaryk University (Czech Republic), which includes a translation-oriented annotation system (TAS). A distinctive feature of the interface is that it allows source texts and target texts to be aligned, so we can be able to observe and compare in detail both language structures and study translation strategies used by students. The initial data obtained point out the kind of difficulties encountered by the students and reveal the most frequent strategies implemented by the learners according to their level of English, their translation experience and the text genres. We have also found common errors in the graduate and postgraduate university students’ translations: transfer errors, lexical errors, grammatical errors, text-specific translation errors, and cultural-related errors have been identified. Analyzing all these parameters will provide more material to bring better solutions to improve the quality of teaching and the translations produced by the students.

Keywords: corpus studies, students’ corpus, the MUST corpus, translation studies

Procedia PDF Downloads 56
460 Modeling False Statements in Texts

Authors: Francielle A. Vargas, Thiago A. S. Pardo

Abstract:

According to the standard philosophical definition, lying is saying something that you believe to be false with the intent to deceive. For deception detection, the FBI trains its agents in a technique named statement analysis, which attempts to detect deception based on parts of speech (i.e., linguistics style). This method is employed in interrogations, where the suspects are first asked to make a written statement. In this poster, we model false statements using linguistics style. In order to achieve this, we methodically analyze linguistic features in a corpus of fake news in the Portuguese language. The results show that they present substantial lexical, syntactic and semantic variations, as well as punctuation and emotion distinctions.

Keywords: deception detection, linguistics style, computational linguistics, natural language processing

Procedia PDF Downloads 107
459 A Corpus-Assisted Discourse Analysis of Adjectival Collocation of the Word 'Education' in the American Context

Authors: Ngan Nguyen

Abstract:

The study analyses adjectives collocating with the word ‘education’ in the American language of the Corpus of Global Web-based English using a combination of corpus linguistic and discourse analytical methods to examine not only language patterns but also social political ideologies around the topic. Significant conclusions are deduced: (1) there are a large number of adjectival collocates of the word education which have been identified and classified into four categories representing four different aspects of education: level, quality, forms and types of education; (2) education, as in combination with three first categories, carries the meaning as the act and process of teaching and learning while with the last category having the meaning of a particular kind of teaching or training; (3) higher education is the topic that gains most concerns from the American public; (4) five most significant ideologies are discovered from the corpus: higher education associates with financial affairs, higher education is an industry, monetary policy of the government on higher education, people require greater accessibility to higher education and people value higher education. The study contributes to the field of developing meanings of words through corpus analysis and the field of discourse analysis.

Keywords: adjectival collocation, American context, corpus linguistics, discourse analysis, education

Procedia PDF Downloads 238
458 The Automatisation of Dictionary-Based Annotation in a Parallel Corpus of Old English

Authors: Ana Elvira Ojanguren Lopez, Javier Martin Arista

Abstract:

The aims of this paper are to present the automatisation procedure adopted in the implementation of a parallel corpus of Old English, as well as, to assess the progress of automatisation with respect to tagging, annotation, and lemmatisation. The corpus consists of an aligned parallel text with word-for-word comparison Old English-English that provides the Old English segment with inflectional form tagging (gloss, lemma, category, and inflection) and lemma annotation (spelling, meaning, inflectional class, paradigm, word-formation and secondary sources). This parallel corpus is intended to fill a gap in the field of Old English, in which no parallel and/or lemmatised corpora are available, while the average amount of corpus annotation is low. With this background, this presentation has two main parts. The first part, which focuses on tagging and annotation, selects the layouts and fields of lexical databases that are relevant for these tasks. Most information used for the annotation of the corpus can be retrieved from the lexical and morphological database Nerthus and the database of secondary sources Freya. These are the sources of linguistic and metalinguistic information that will be used for the annotation of the lemmas of the corpus, including morphological and semantic aspects as well as the references to the secondary sources that deal with the lemmas in question. Although substantially adapted and re-interpreted, the lemmatised part of these databases draws on the standard dictionaries of Old English, including The Student's Dictionary of Anglo-Saxon, An Anglo-Saxon Dictionary, and A Concise Anglo-Saxon Dictionary. The second part of this paper deals with lemmatisation. It presents the lemmatiser Norna, which has been implemented on Filemaker software. It is based on a concordance and an index to the Dictionary of Old English Corpus, which comprises around three thousand texts and three million words. In its present state, the lemmatiser Norna can assign lemma to around 80% of textual forms on an automatic basis, by searching the index and the concordance for prefixes, stems and inflectional endings. The conclusions of this presentation insist on the limits of the automatisation of dictionary-based annotation in a parallel corpus. While the tagging and annotation are largely automatic even at the present stage, the automatisation of alignment is pending for future research. Lemmatisation and morphological tagging are expected to be fully automatic in the near future, once the database of secondary sources Freya and the lemmatiser Norna have been completed.

Keywords: corpus linguistics, historical linguistics, old English, parallel corpus

Procedia PDF Downloads 110
457 A Comparison of the First Language Vocabulary Used by Indonesian Year 4 Students and the Vocabulary Taught to Them in English Language Textbooks

Authors: Fitria Ningsih

Abstract:

This study concerns on the process of making corpus obtained from Indonesian year 4 students’ free writing compared to the vocabulary taught in English language textbooks. 369 students’ sample writings from 19 public elementary schools in Malang, East Java, Indonesia and 5 selected English textbooks were analyzed through corpus in linguistics method using AdTAT -the Adelaide Text Analysis Tool- program. The findings produced wordlists of the top 100 words most frequently used by students and the top 100 words given in English textbooks. There was a 45% match between the two lists. Furthermore, the classifications of the top 100 most frequent words from the two corpora based on part of speech found that both the Indonesian and English languages employed a similar use of nouns, verbs, adjectives, and prepositions. Moreover, to see the contextualizing the vocabulary of learning materials towards the students’ need, a depth-analysis dealing with the content and the cultural views from the vocabulary taught in the textbooks was discussed through the criteria developed from the checklist. Lastly, further suggestions are addressed to language teachers to understand the students’ background such as recognizing the basic words students acquire before teaching them new vocabulary in order to achieve successful learning of the target language.

Keywords: corpus, frequency, English, Indonesian, linguistics, textbooks, vocabulary, wordlists, writing

Procedia PDF Downloads 110
456 Statistical Comparison of Machine and Manual Translation: A Corpus-Based Study of Gone with the Wind

Authors: Yanmeng Liu

Abstract:

This article analyzes and compares the linguistic differences between machine translation and manual translation, through a case study of the book Gone with the Wind. As an important carrier of human feeling and thinking, the literature translation poses a huge difficulty for machine translation, and it is supposed to expose distinct translation features apart from manual translation. In order to display linguistic features objectively, tentative uses of computerized and statistical evidence to the systematic investigation of large scale translation corpora by using quantitative methods have been deployed. This study compiles bilingual corpus with four versions of Chinese translations of the book Gone with the Wind, namely, Piao by Chunhai Fan, Piao by Huairen Huang, translations by Google Translation and Baidu Translation. After processing the corpus with the software of Stanford Segmenter, Stanford Postagger, and AntConc, etc., the study analyzes linguistic data and answers the following questions: 1. How does the machine translation differ from manual translation linguistically? 2. Why do these deviances happen? This paper combines translation study with the knowledge of corpus linguistics, and concretes divergent linguistic dimensions in translated text analysis, in order to present linguistic deviances in manual and machine translation. Consequently, this study provides a more accurate and more fine-grained understanding of machine translation products, and it also proposes several suggestions for machine translation development in the future.

Keywords: corpus-based analysis, linguistic deviances, machine translation, statistical evidence

Procedia PDF Downloads 54
455 A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: Aqa-Webcorp

Authors: Wided Bakari, Patrce Bellot, Mahmoud Neji

Abstract:

With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair’s questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a database of texts and a corpus of pair’s question-texts. In addition, we give preliminary results of our proposal method. Some investigations for the construction of Arabic corpus are also presented in this document.

Keywords: Arabic, web, corpus, search engine, URL, question, corpus building, script, Google, html, txt

Procedia PDF Downloads 245
454 Using Corpora in Semantic Studies of English Adjectives

Authors: Oxana Lukoshus

Abstract:

The methods of corpus linguistics, a well-established field of research, are being increasingly applied in cognitive linguistics. Corpora data are especially useful for different quantitative studies of grammatical and other aspects of language. The main objective of this paper is to demonstrate how present-day corpora can be applied in semantic studies in general and in semantic studies of adjectives in particular. Polysemantic adjectives have been the subject of numerous studies. But most of them have been carried out on dictionaries. Undoubtedly, dictionaries are viewed as one of the basic data sources, but only at the initial steps of a research. The author usually starts with the analysis of the lexicographic data after which s/he comes up with a hypothesis. In the research conducted three polysemantic synonyms true, loyal, faithful have been analyzed in terms of differences and similarities in their semantic structure. A corpus-based approach in the study of the above-mentioned adjectives involves the following. After the analysis of the dictionary data there was the reference to the following corpora to study the distributional patterns of the words under study – the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). These corpora are continually updated and contain thousands of examples of the words under research which make them a useful and convenient data source. For the purpose of this study there were no special needs regarding genre, mode or time of the texts included in the corpora. Out of the range of possibilities offered by corpus-analysis software (e.g. word lists, statistics of word frequencies, etc.), the most useful tool for the semantic analysis was the extracting a list of co-occurrence for the given search words. Searching by lemmas, e.g. true, true to, and grouping the results by lemmas have proved to be the most efficient corpora feature for the adjectives under the study. Following the search process, the corpora provided a list of co-occurrences, which were then to be analyzed and classified. Not every co-occurrence was relevant for the analysis. For example, the phrases like An enormous sense of responsibility to protect the minds and hearts of the faithful from incursions by the state was perceived to be the basic duty of the church leaders or ‘True,’ said Phoebe, ‘but I'd probably get to be a Union Official immediately were left out as in the first example the faithful is a substantivized adjective and in the second example true is used alone with no other parts of speech. The subsequent analysis of the corpora data gave the grounds for the distribution groups of the adjectives under the study which were then investigated with the help of a semantic experiment. To sum it up, the corpora-based approach has proved to be a powerful, reliable and convenient tool to get the data for the further semantic study.

Keywords: corpora, corpus-based approach, polysemantic adjectives, semantic studies

Procedia PDF Downloads 242
453 Tagging a corpus of Media Interviews with Diplomats: Challenges and Solutions

Authors: Roberta Facchinetti, Sara Corrizzato, Silvia Cavalieri

Abstract:

Increasing interconnection between data digitalization and linguistic investigation has given rise to unprecedented potentialities and challenges for corpus linguists, who need to master IT tools for data analysis and text processing, as well as to develop techniques for efficient and reliable annotation in specific mark-up languages that encode documents in a format that is both human and machine-readable. In the present paper, the challenges emerging from the compilation of a linguistic corpus will be taken into consideration, focusing on the English language in particular. To do so, the case study of the InterDiplo corpus will be illustrated. The corpus, currently under development at the University of Verona (Italy), represents a novelty in terms both of the data included and of the tag set used for its annotation. The corpus covers media interviews and debates with diplomats and international operators conversing in English with journalists who do not share the same lingua-cultural background as their interviewees. To date, this appears to be the first tagged corpus of international institutional spoken discourse and will be an important database not only for linguists interested in corpus analysis but also for experts operating in international relations. In the present paper, special attention will be dedicated to the structural mark-up, parts of speech annotation, and tagging of discursive traits, that are the innovational parts of the project being the result of a thorough study to find the best solution to suit the analytical needs of the data. Several aspects will be addressed, with special attention to the tagging of the speakers’ identity, the communicative events, and anthropophagic. Prominence will be given to the annotation of question/answer exchanges to investigate the interlocutors’ choices and how such choices impact communication. Indeed, the automated identification of questions, in relation to the expected answers, is functional to understand how interviewers elicit information as well as how interviewees provide their answers to fulfill their respective communicative aims. A detailed description of the aforementioned elements will be given using the InterDiplo-Covid19 pilot corpus. The data yielded by our preliminary analysis of the data will highlight the viable solutions found in the construction of the corpus in terms of XML conversion, metadata definition, tagging system, and discursive-pragmatic annotation to be included via Oxygen.

Keywords: spoken corpus, diplomats’ interviews, tagging system, discursive-pragmatic annotation, english linguistics

Procedia PDF Downloads 78
452 Specialized Translation Teaching Strategies: A Corpus-Based Approach

Authors: Yingying Ding

Abstract:

This study presents a methodology of specialized translation with the objective of helping teachers to improve the strategies in teaching translation. In order to allow students to acquire skills to translate specialized texts, they need to become familiar with the semantic and syntactic features of source texts and target texts. The aim of our study is to use a corpus-based approach in the teaching of specialized translation between Chinese and Italian. This study proposes to construct a specialized Chinese - Italian comparable corpus that consists of 50 economic contracts from the domain of food. With the help of AntConc, we propose to compile a comparable corpus in for translation teaching purposes. This paper attempts to provide insight into how teachers could benefit from comparable corpus in the teaching of specialized translation from Italian into Chinese and through some examples of passive sentences how students could learn to apply different strategies for translating appropriately the voice.

Keywords: contrastive studies, specialised translation, corpus-based approach, teaching

Procedia PDF Downloads 282
451 Grammatically Coded Corpus of Spoken Lithuanian: Methodology and Development

Authors: L. Kamandulytė-Merfeldienė

Abstract:

The paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances.

Keywords: CHILDES, corpus of spoken Lithuanian, grammatical annotation, grammatical disambiguation, lexicon, Lithuanian

Procedia PDF Downloads 159
450 Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: 'Reddit'

Authors: Yasmeen Bassas, Sandra Kuebler, Allen Riddell

Abstract:

Native Language Identification is one of the growing subfields in Natural Language Processing (NLP). The task of Native Language Identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL) and then the trained models are evaluated on a different data using an external corpus (Reddit). Several features that have been proven useful for NLI tasks are used in this work (such as word n-grams, character n-grams, POS n-grams and function words). Three classifiers are used in this task; the baseline, linear SVM, and Logistic Regression. Two experiments are conducted in this paper. The first one is to explore the performance of the content-based features versus the content independent ones within one domain using TOEFL corpus. The second experiment is to examine how the trained models perform in a different data using an external corpus called Reddit. The aim is to find out which features (content-based vs content independent features) are more accurate when tested on a different corpus. Results show that content-based features are more accurate and robust than content independent ones when tested within corpus and across corpus.

Keywords: Native Language Identification, Social Media Data: Reddit, NLP, Content-based features, Content independent features

Procedia PDF Downloads 34
449 Corporate Cautionary Statement: A Genre of Professional Communication

Authors: Chie Urawa

Abstract:

Cautionary statements or disclaimers in corporate annual reports need to be carefully designed because clear cautionary statements may protect a company in the case of legal disputes and may undermine positive impressions. This study compares the language of cautionary statements using two corpora, Sony’s cautionary statement corpus (S-corpus) and Panasonic’s cautionary statement corpus (P-corpus), illustrating the differences and similarities in relation to the use of meaningful cautionary statements and critically analyzing why practitioners use the way. The findings describe the distinct differences between the two companies in the presentation of the risk factors and the way how they make the statements. The word ability is used more for legal protection in S-corpus whereas the word possibility is used more to convey a better impression in P-corpus. The main similarities are identified in the use of lexical words and pronouns, and almost the same wordings for eight years. The findings show how they make the statements unique to the company in the presentation of risk factors, and the characteristics of specific genre of professional communication. Important implications of this study are that more comprehensive approach can be applied in other contexts, and be used by companies to reflect upon their cautionary statements.

Keywords: cautionary statements, corporate annual reports, corpus, risk factors

Procedia PDF Downloads 94
448 Verb Bias in Mandarin: The Corpus Based Study of Children

Authors: Jou-An Chung

Abstract:

The purpose of this study is to investigate the verb bias of the Mandarin verbs in children’s reading materials and provide the criteria for categorization. Verb bias varies cross-linguistically. As Mandarin and English are typological different, this study hopes to shed light on Mandarin verb bias with the use of corpus and provide thorough and detailed criteria for analysis. Moreover, this study focuses on children’s reading materials since it is a significant issue in understanding children’s sentence processing. Therefore, investigating verb bias of Mandarin verbs in children’s reading materials is also an important issue and can provide further insights into children’s sentence processing. The small corpus is built up for this study. The corpus consists of the collection of school textbooks and Mandarin Daily News for children. The files are then segmented and POS tagged by JiebaR (Chinese segmentation with R). For the ease of analysis, the one-word character verbs and intransitive verbs are excluded beforehand. The total of 20 high frequency verbs are hand-coded and are further categorized into one of the three types, namely DO type, SC type and other category. If the frequency of taking Other Type exceeds the threshold of 25%, the verb is excluded from the study. The results show that 10 verbs are direct object bias verbs, and six verbs are sentential complement bias verbs. The paired T-test was done to assure the statistical significance (p = 0.0001062 for DO bias verb, p=0.001149 for SC bias verb). The result has shown that in children’s reading materials, the DO biased verbs are used more than the SC bias verbs since the simplest structure of sentences is easier for children’s sentence comprehension or processing. In sum, this study not only discussed verb bias in child's reading materials but also provided basic coding criteria for verb bias analysis in Mandarin and underscored the role of context. Sentences are easier for children’s sentence comprehension or processing. In sum, this study not only discussed verb bias in child corpus, but also provided basic coding criteria for verb bias analysis in Mandarin and underscored the role of context.

Keywords: corpus linguistics, verb bias, child language, psycholinguistics

Procedia PDF Downloads 172
447 A Corpus-Based Study on the Styles of Three Translators

Authors: Wang Yunhong

Abstract:

The present paper is preoccupied with the different styles of three translators in their translating a Chinese classical novel Shuihu Zhuan. Based on a parallel corpus, it adopts a target-oriented approach to look into whether and what stylistic differences and shifts the three translations have revealed. The findings show that the three translators demonstrate different styles concerning their word choices and sentence preferences, which implies that identification of recurrent textual patterns may be a basic step for investigating the style of a translator.

Keywords: corpus, lexical choices, sentence characteristics, style

Procedia PDF Downloads 184
446 A Syntactic Approach to Applied and Socio-Linguistics in Arabic Language in Modern Communications

Authors: Adeyemo Abduljeeel Taiwo

Abstract:

This research is an attempt that creates a conducive atmosphere of a phonological and morphological compendium of Arabic language in Modern Standard Arabic (MSA) for modern day communications. The research is carried out with the chief aim of grammatical analysis of the two broad fields of Arabic linguistics namely: Applied and Socio-Linguistics. It draws a pictorial record of Applied and Socio-Linguistics in Arabic phonology and morphology. Thematically, it postulates and contemplates to a large degree, the theory of concord in contemporary modern Arabic language acquisition. It utilizes an analytical method while it portrays Arabic as a Semitic language that promotes linguistics and syntax among the scholars of the fields.

Keywords: Arabic language, applied linguistics, socio-linguistics, modern communications

Procedia PDF Downloads 232
445 Saudi Twitter Corpus for Sentiment Analysis

Authors: Adel Assiri, Ahmed Emam, Hmood Al-Dossari

Abstract:

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment analysis with (K= 0.807). Our next work is to extend this corpus and creation a large-scale lexicon for Saudi dialect from the corpus.

Keywords: Arabic, sentiment analysis, Twitter, annotation

Procedia PDF Downloads 434
444 Crossing the Interdisciplinary Border: A Multidimensional Linguistics Analysis of a Legislative Discourse

Authors: Manvender Kaur Sarjit Singh

Abstract:

There is a crucial mismatch between classroom written language tasks and real world written language requirements. Realizing the importance of reducing the gap between the professional needs of the legal practitioners and the higher learning institutions that offer the legislative education in Malaysia, it is deemed necessary to develop a framework that integrates real-life written communication with the teaching of content-based legislative discourse to future legal practitioners. By highlighting the actual needs of the legal practitioners in the country, the present teaching practices will be enhanced and aligned with the actual needs of the learners thus realizing the vision and aspirations of the Malaysian Education Blueprint 2013-2025 and Legal Profession Qualifying Board. The need to focus future education according to the actual needs of the learners can be realized by developing a teaching framework which is designed within the prospective requirements of its real-life context. This paper presents the steps taken to develop a specific teaching framework that fulfills the fundamental real-life context of the prospective legal practitioners. The teaching framework was developed based on real-life written communication from the legal profession in Malaysia, using the specific genre analysis approach which integrates a corpus-based approach and a structural linguistics analysis. This approach was adopted due to its fundamental nature of intensive exploration of the real-life written communication according to the established strategies used. The findings showed the use of specific moves and parts-of-speech by the legal practitioners, in order to prepare the selected genre. The teaching framework is hoped to enhance the teachings of content-based law courses offered at present in the higher learning institutions in Malaysia.

Keywords: linguistics analysis, corpus analysis, genre analysis, legislative discourse

Procedia PDF Downloads 306
443 Studying Language of Immediacy and Language of Distance from a Corpus Linguistic Perspective: A Pilot Study of Evaluation Markers in French Television Weather Reports

Authors: Vince Liégeois

Abstract:

Language of immediacy and distance: Within their discourse theory, Koch & Oesterreicher establish a distinction between a language of immediacy and a language of distance. The former refers to those discourses which are oriented more towards a spoken norm, whereas the latter entails discourses oriented towards a written norm, regardless of whether they are realised phonically or graphically. This means that an utterance can be realised phonically but oriented more towards the written language norm (e.g., a scientific presentation or eulogy) or realised graphically but oriented towards a spoken norm (e.g., a scribble or chat messages). Research desiderata: The methodological approach from Koch & Oesterreicher has often been criticised for not providing a corpus-linguistic methodology, which makes it difficult to work with quantitative data or address large text collections within this research paradigm. Consequently, the Koch & Oesterreicher approach has difficulties gaining ground in those research areas which rely more on corpus linguistic research models, like text linguistics and LSP-research. A combinatory approach: Accordingly, we want to establish a combinatory approach with corpus-based linguistic methodology. To this end, we propose to (i) include data about the context of an utterance (e.g., monologicity/dialogicity, familiarity with the speaker) – which were called “conditions of communication” in the original work of Koch & Oesterreicher – and (ii) correlate the linguistic phenomenon at the centre of the inquiry (e.g., evaluation markers) to a group of linguistic phenomena deemed typical for either distance- or immediacy-language. Based on these two parameters, linguistic phenomena and texts could then be mapped on an immediacy-distance continuum. Pilot study: To illustrate the benefits of this approach, we will conduct a pilot study on evaluation phenomena in French television weather reports, a form of domain-sensitive discourse which has often been cited as an example of a “text genre”. Within this text genre, we will look at so-called “evaluation markers,” e.g., fixed strings like bad weather, stifling hot, and “no luck today!”. These evaluation markers help to communicate the coming weather situation towards the lay audience but have not yet been studied within the Koch & Oesterreicher research paradigm. Accordingly, we want to figure out whether said evaluation markers are more typical for those weather reports which tend more towards immediacy or those which tend more towards distance. To this aim, we collected a corpus with different kinds of television weather reports,e.g., as part of the news broadcast, including dialogue. The evaluation markers themselves will be studied according to the explained methodology, by correlating them to (i) metadata about the context and (ii) linguistic phenomena characterising immediacy-language: repetition, deixis (personal, spatial, and temporal), a freer choice of tense and right- /left-dislocation. Results: Our results indicate that evaluation markers are more dominantly present in those weather reports inclining towards immediacy-language. Based on the methodology established above, we have gained more insight into the working of evaluation markers in the domain-sensitive text genre of (television) weather reports. For future research, it will be interesting to determine whether said evaluation markers are also typical for immediacy-language-oriented in other domain-sensitive discourses.

Keywords: corpus-based linguistics, evaluation markers, language of immediacy and distance, weather reports

Procedia PDF Downloads 51
442 Historical Development of Negative Emotive Intensifiers in Hungarian

Authors: Martina Katalin Szabó, Bernadett Lipóczi, Csenge Guba, István Uveges

Abstract:

In this study, an exhaustive analysis was carried out about the historical development of negative emotive intensifiers in the Hungarian language via NLP methods. Intensifiers are linguistic elements which modify or reinforce a variable character in the lexical unit they apply to. Therefore, intensifiers appear with other lexical items, such as adverbs, adjectives, verbs, infrequently with nouns. Due to the complexity of this phenomenon (set of sociolinguistic, semantic, and historical aspects), there are many lexical items which can operate as intensifiers. The group of intensifiers are admittedly one of the most rapidly changing elements in the language. From a linguistic point of view, particularly interesting are a special group of intensifiers, the so-called negative emotive intensifiers, that, on their own, without context, have semantic content that can be associated with negative emotion, but in particular cases, they may function as intensifiers (e.g.borzasztóanjó ’awfully good’, which means ’excellent’). Despite their special semantic features, negative emotive intensifiers are scarcely examined in literature based on large Historical corpora via NLP methods. In order to become better acquainted with trends over time concerning the intensifiers, The exhaustively analysed a specific historical corpus, namely the Magyar TörténetiSzövegtár (Hungarian Historical Corpus). This corpus (containing 3 millions text words) is a collection of texts of various genres and styles, produced between 1772 and 2010. Since the corpus consists of raw texts and does not contain any additional information about the language features of the data (such as stemming or morphological analysis), a large amount of manual work was required to process the data. Thus, based on a lexicon of negative emotive intensifiers compiled in a previous phase of the research, every occurrence of each intensifier was queried, and the results were stored in a separate data frame. Then, basic linguistic processing (POS-tagging, lemmatization etc.) was carried out automatically with the ‘magyarlanc’ NLP-toolkit. Finally, the frequency and collocation features of all the negative emotive words were automatically analyzed in the corpus. Outcomes of the research revealed in detail how these words have proceeded through grammaticalization over time, i.e., they change from lexical elements to grammatical ones, and they slowly go through a delexicalization process (their negative content diminishes over time). What is more, it was also pointed out which negative emotive intensifiers are at the same stage in this process in the same time period. Giving a closer look to the different domains of the analysed corpus, it also became certain that during this process, the pragmatic role’s importance increases: the newer use expresses the speaker's subjective, evaluative opinion at a certain level.

Keywords: historical corpus analysis, historical linguistics, negative emotive intensifiers, semantic changes over time

Procedia PDF Downloads 111
441 A Corpus-Based Discourse Analysis of the Disappearance of MH370 in Malaysia and United Kingdom Newspapers: A Pilot Study

Authors: Theng Theng Ong

Abstract:

This pilot study adopts a corpus-based discourse analysis to explore the construction of Malaysia airline tragedy MH370 in the selected Malaysian and United Kingdom (UK) newspapers. Fairclough’s three-dimensional model is adopted in the study to support the corpus-based analysis. The analysis aims to determine the ways in which Malaysian Airline tragedy MH370 is linguistically defined and constructed in terms of keywords and collocation. The study also seeks to identify the types of discourse that are presented in the news articles. In addition, the differences or similarities in terms of keywords, topics or issues covered by the selected Malaysian and UK news media are examined.

Keywords: corpus, CDA, newspapers, airline tragedies

Procedia PDF Downloads 213
440 Adjectives in Academic Discourse: A Comparative Study of Research Articles

Authors: Beata Grymska

Abstract:

The research studies on academic discourse focus in general on lexical bundles, epistemic modality markers, or interactions between writers and readers. Following the research into the written forms of the academic community, this study concentrates on adjectives in research articles. The study investigates the distribution of adjectives in research articles in two academic disciplines: linguistics and medicine. It is corpus-based in design and consists of 100 linguistic and 100 medical research articles all written in English. The aim of the study is to compare the distribution of adjectives between the two corpora and four main parts of articles: IMRD (Introduction, Methods, Results, and Discussion). The second aim is to see if the two corpora share common core adjectives, e.g., different, important, specific, and if there are discipline-specific adjectives. The further part of the paper elaborates on adjectives use in the corpora together with examples. The results indicate that the two corpora do not differ in the distribution of adjectives to a great extent. The occurrences of the most frequently used adjectives depend on the academic discipline of the research articles. The concluding part reflects upon the role of adjectives in academic discourse and also presents how corpora can be helpful in composing academic texts.

Keywords: academic discourse, academic texts, adjectives, corpus analysis, research articles

Procedia PDF Downloads 96
439 A Corpus-Based Analysis of Japanese Learners' English Modal Auxiliary Verb Usage in Writing

Authors: S. Nakayama

Abstract:

For non-native English speakers, using English modal auxiliary verbs appropriately can be among the most challenging tasks. This research sought to identify differences in modal verb usage between Japanese non-native English speakers (JNNS) and native speakers (NS) from two different perspectives: frequency of use and distribution of verb phrase structures (VPS) where modal verbs occur. This study can contribute to the identification of JNNSs' interlanguage with regard to modal verbs; the main aim is to make a suggestion for the improvement of teaching materials as well as to help language teachers to be able to teach modal verbs in a way that is helpful for learners. To address the primary question in this study, usage of nine central modals (‘can’, ‘could’, ‘may’, ‘might’, ‘shall’, ‘should’, ‘will’, ‘would’, and ‘must’) by JNNS was compared with that by NSs in the International Corpus Network of Asian Learners of English (ICNALE). This corpus is one of the largest freely-available corpora focusing on Asian English learners’ language use. The ICNALE corpus consists of four modules: ‘Spoken Monologue’, ‘Spoken Dialogue’, ‘Written Essays’, and ‘Edited Essays’. Among these, this research adopted the ‘Written Essays’ module only, which is the set of 200-300 word essays and contains approximately 1.3 million words in total. Frequency analysis revealed gaps as well as similarities in frequency order. Specifically, both JNNSs and NSs used ‘can’ with the most frequency, followed by ‘should’ and ‘will’; however, usage of all the other modals except for ‘shall’ was not identical to each other. A log-likelihood test uncovered JNNSs’ overuse of ‘can’ and ‘must’ as well as their underuse of ‘will’ and ‘would’. VPS analysis revealed that JNNSs used modal verbs in a relatively narrow range of VPSs as compared to NSs. Results showed that JNNSs used most of the modals with bare infinitives or the passive voice only whereas NSs used the modals in a wide range of VPSs including the progressive construction and the perfect aspect, both of which were the structures where JNNSs rarely used the modals. Results of frequency analysis suggest that language teachers or teaching materials should explain other modality items so that learners can avoid relying heavily on certain modals and have a wide range of lexical items to reflect their feelings more accurately. Besides, the underused modals should be more stressed in the classroom because they are members of epistemic modals, which allow us to not only interject our views into propositions but also build a relationship with readers. As for VPSs, teaching materials should present more examples of the modals occurring in a wide range of VPSs to help learners to be able to express their opinions from a variety of viewpoints.

Keywords: corpus linguistics, Japanese learners of English, modal auxiliary verbs, International Corpus Network of Asian Learners of English

Procedia PDF Downloads 46
438 The Repetition of New Words and Information in Mandarin-Speaking Children: A Corpus-Based Study

Authors: Jian-Jun Gao

Abstract:

Repetition is used for a variety of functions in conversation. When young children first learn to speak, they often repeat words from the adult’s recent utterance with the learning and social function. The objective of this study was to ascertain whether the repetitions are equivalent in indicating attention to new words and the initial repeat of information in conversation. Based on the observation of naturally occurring language use in Taiwan Corpus of Child Mandarin (TCCM), the results in this study provided empirical support to the previous findings that children are more likely to repeat new words they are offered than to repeat new information. When children get older, there would be a drop in the repetition of both new words and new information.

Keywords: acquisition, corpus, mandarin, new words, new information, repetition

Procedia PDF Downloads 60
437 Men Act, Women Are Acted Upon: Morphosyntactic Framing of the Sexual Intercourse in Online Pornography Titles

Authors: Aleksandra Tomic

Abstract:

According to reliable sources, 4% of all websites is devoted to pornographic material, yet these estimates are often reported to be much higher. The largest internet pornography streaming website reports 21.2 billion visits in 2015 only. Considering the ubiquity of online pornography and the frequency of use, it is necessary to examine its potential influence on the construal of the sexual act and the roles of participants. Apart from the verbal and physical interactions in the pornographic movies themselves, the language in the titles of movies has the power to frame the sexual intercourse. In this study, Critical Discourse Analysis and corpus linguistics approaches will be used to examine the way the sexual intercourse and the roles of the participants are ideologically construed and perpetuated in the Internet pornography discourse. To this end, the study will explore the association between the specific morphosyntactic aspects of the references to performers of both genders, the person and the thematic role, and the gender of referred performer in the corpus of online pornographic movie titles. Distinctive collexeme analysis will be conducted to uncover possible associations between for gender of the performer denoted by the linguistic expression, and the person and thematic role assigned to it in the titles of online pornography movies. Initial results of the chi-square procedure performed on a sample of 295 online pornography movie titles on the largest pornography streaming website ‘Pornhub’ yielded significant results. The use of the three person categories was not equally distributed between genders, X2 (2, N = 106) = 32.52, p < 0.001, with female performers being referred to in the third person in 71.7% of the instances, and speaking in the first person 20.8% of the time, whereas male performers spoke in the first person 68% of the time, and were referred to in the third person in 17% of the instances. Moreover, there was a gender disparity in the assignment of thematic roles, with linguistic expressions for women being assigned the Patient role and men the Agent role in 58.8% of the cases, whereas the roles were reversed in 41.2% of the instances, X2 (1, N = 262) = 8.07633, p < 0.005. The results are discussed in terms of the ideologies surrounding female and male sexuality in the pornography discourse. Potential patterns of power imbalance, objectification, and discrimination are highlighted. Finally, the evidence from psycholinguistic studies on the influence of the language structure on event construal is related to the results of the study.

Keywords: corpus linguistics, gender studies, pornography, thematic roles

Procedia PDF Downloads 98