Search results for: corpus building
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 4242

Search results for: corpus building

4242 A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: Aqa-Webcorp

Authors: Wided Bakari, Patrce Bellot, Mahmoud Neji

Abstract:

With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair’s questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a database of texts and a corpus of pair’s question-texts. In addition, we give preliminary results of our proposal method. Some investigations for the construction of Arabic corpus are also presented in this document.

Keywords: Arabic, web, corpus, search engine, URL, question, corpus building, script, Google, html, txt

Procedia PDF Downloads 291
4241 Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: ’Reddit’

Authors: Yasmeen Bassas, Sandra Kuebler, Allen Riddell

Abstract:

Native language identification is one of the growing subfields in natural language processing (NLP). The task of native language identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features, when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL), and then the trained models are evaluated on different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and logistic regression. Results show that content-based features are more accurate and robust than content independent ones when tested within the corpus and across corpus.

Keywords: NLI, NLP, content-based features, content independent features, social media corpus, ML

Procedia PDF Downloads 95
4240 Semantic Preference across Research Articles: A Corpus-Based Study of Adjectives in English

Authors: Valdênia Carvalho e Almeida

Abstract:

The goal of the present study is to investigate the semantic preference of the most frequent adjectives in research articles through a corpus-based analysis of texts published in journals in Applied Linguistics (AL). The corpus used in this study contains texts published in the period from 2014 to 2018 in the three journals: Language Learning and Technology; English for Academic Purposes, and TESOL Quaterly, totaling more than one million words. A corpus-based analysis was carried out on the corpus to identify the most frequent adjectives that co-occurred in the three journals. By observing the concordance lines of the adjectives and analyzing the words they associated with, the semantic preferences of each adjective were determined. Later, the AL corpus analysis was compared to the investigation of the same adjectives in a corpus of Chemistry. This second part of the study aimed to identify possible differences and similarities between the two corpora in relation to the use of the adjectives in research articles from both areas. The results show that there are some preferences which seem to be closely related not only to the academic genre of the texts but also to the specific domain of the discipline and, to a lesser extent, to the context of research in each journal. This research illustrates a possible contribution of Corpus Linguistics to explore the concept of semantic preference in more detail, considering the complex nature of the phenomenon.

Keywords: applied linguistics, corpus linguistics, chemistry, research article, semantic preference

Procedia PDF Downloads 146
4239 Specialized Translation Teaching Strategies: A Corpus-Based Approach

Authors: Yingying Ding

Abstract:

This study presents a methodology of specialized translation with the objective of helping teachers to improve the strategies in teaching translation. In order to allow students to acquire skills to translate specialized texts, they need to become familiar with the semantic and syntactic features of source texts and target texts. The aim of our study is to use a corpus-based approach in the teaching of specialized translation between Chinese and Italian. This study proposes to construct a specialized Chinese - Italian comparable corpus that consists of 50 economic contracts from the domain of food. With the help of AntConc, we propose to compile a comparable corpus in for translation teaching purposes. This paper attempts to provide insight into how teachers could benefit from comparable corpus in the teaching of specialized translation from Italian into Chinese and through some examples of passive sentences how students could learn to apply different strategies for translating appropriately the voice.

Keywords: contrastive studies, specialised translation, corpus-based approach, teaching

Procedia PDF Downloads 333
4238 Metaphors Investigation between President Xi Jinping of China and Trump of Us on the Corpus-Based Approach

Authors: Jie Zheng, Ruifeng Luo

Abstract:

The United States is the world’s most developed economy with the strongest military power. China is the fastest growing country with growing comprehensive strength and its economic strength is second only to the US. However, the conflict between them is getting serious in recent years. President’s address is the representative of a nation’s ideology. The paper has built up a small sized corpus of President Xi Jinping and Trump’s speech in Davos to investigate their respective use and types of metaphors and calculate the respective percentage of each type of metaphor. The result shows President Xi Jinping employs more metaphors than Trump. The metaphors of Xi includes “building” metaphor, “plant” metaphor, “journey” metaphor, “ship” metaphor, “traffic” metaphor, “nation is a person” metaphor, “show” metaphor, etc while Trump’s comprises “war” metaphor, “building” metaphor, “journey” metaphor, “traffic” metaphor, “tax” metaphor, “book” metaphor, etc. After investigating metaphor use differences, the paper makes an analysis of the underlying ideology between the two nations. China is willing to strengthen ties with all the countries all over the world and has built a platform of development for them and itself to go to the destination of social well being while the US pays much concern to itself, emphasizing its first leading position and is also willing to help its alliances to development. The paper’s comparison of the ideology difference between the two countries will help them get a better understanding and reduce the conflict to some extent.

Keywords: metaphor; corpus; ideology; conflict

Procedia PDF Downloads 116
4237 Grammatically Coded Corpus of Spoken Lithuanian: Methodology and Development

Authors: L. Kamandulytė-Merfeldienė

Abstract:

The paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances.

Keywords: CHILDES, corpus of spoken Lithuanian, grammatical annotation, grammatical disambiguation, lexicon, Lithuanian

Procedia PDF Downloads 204
4236 Corporate Cautionary Statement: A Genre of Professional Communication

Authors: Chie Urawa

Abstract:

Cautionary statements or disclaimers in corporate annual reports need to be carefully designed because clear cautionary statements may protect a company in the case of legal disputes and may undermine positive impressions. This study compares the language of cautionary statements using two corpora, Sony’s cautionary statement corpus (S-corpus) and Panasonic’s cautionary statement corpus (P-corpus), illustrating the differences and similarities in relation to the use of meaningful cautionary statements and critically analyzing why practitioners use the way. The findings describe the distinct differences between the two companies in the presentation of the risk factors and the way how they make the statements. The word ability is used more for legal protection in S-corpus whereas the word possibility is used more to convey a better impression in P-corpus. The main similarities are identified in the use of lexical words and pronouns, and almost the same wordings for eight years. The findings show how they make the statements unique to the company in the presentation of risk factors, and the characteristics of specific genre of professional communication. Important implications of this study are that more comprehensive approach can be applied in other contexts, and be used by companies to reflect upon their cautionary statements.

Keywords: cautionary statements, corporate annual reports, corpus, risk factors

Procedia PDF Downloads 131
4235 A Corpus-Based Study on the Styles of Three Translators

Authors: Wang Yunhong

Abstract:

The present paper is preoccupied with the different styles of three translators in their translating a Chinese classical novel Shuihu Zhuan. Based on a parallel corpus, it adopts a target-oriented approach to look into whether and what stylistic differences and shifts the three translations have revealed. The findings show that the three translators demonstrate different styles concerning their word choices and sentence preferences, which implies that identification of recurrent textual patterns may be a basic step for investigating the style of a translator.

Keywords: corpus, lexical choices, sentence characteristics, style

Procedia PDF Downloads 231
4234 A Corpus-Assisted Discourse Analysis of Adjectival Collocation of the Word 'Education' in the American Context

Authors: Ngan Nguyen

Abstract:

The study analyses adjectives collocating with the word ‘education’ in the American language of the Corpus of Global Web-based English using a combination of corpus linguistic and discourse analytical methods to examine not only language patterns but also social political ideologies around the topic. Significant conclusions are deduced: (1) there are a large number of adjectival collocates of the word education which have been identified and classified into four categories representing four different aspects of education: level, quality, forms and types of education; (2) education, as in combination with three first categories, carries the meaning as the act and process of teaching and learning while with the last category having the meaning of a particular kind of teaching or training; (3) higher education is the topic that gains most concerns from the American public; (4) five most significant ideologies are discovered from the corpus: higher education associates with financial affairs, higher education is an industry, monetary policy of the government on higher education, people require greater accessibility to higher education and people value higher education. The study contributes to the field of developing meanings of words through corpus analysis and the field of discourse analysis.

Keywords: adjectival collocation, American context, corpus linguistics, discourse analysis, education

Procedia PDF Downloads 297
4233 The Analysis of Deceptive and Truthful Speech: A Computational Linguistic Based Method

Authors: Seham El Kareh, Miramar Etman

Abstract:

Recently, detecting liars and extracting features which distinguish them from truth-tellers have been the focus of a wide range of disciplines. To the author’s best knowledge, most of the work has been done on facial expressions and body gestures but only few works have been done on the language used by both liars and truth-tellers. This paper sheds light on four axes. The first axis copes with building an audio corpus for deceptive and truthful speech for Egyptian Arabic speakers. The second axis focuses on examining the human perception of lies and proving our need for computational linguistic-based methods to extract features which characterize truthful and deceptive speech. The third axis is concerned with building a linguistic analysis program that could extract from the corpus the inter- and intra-linguistic cues for deceptive and truthful speech. The program built here is based on selected categories from the Linguistic Inquiry and Word Count program. Our results demonstrated that Egyptian Arabic speakers on one hand preferred to use first-person pronouns and present tense compared to the past tense when lying and their lies lacked of second-person pronouns, and on the other hand, when telling the truth, they preferred to use the verbs related to motion and the nouns related to time. The results also showed that there is a need for bigger data to prove the significance of words related to emotions and numbers.

Keywords: Egyptian Arabic corpus, computational analysis, deceptive features, forensic linguistics, human perception, truthful features

Procedia PDF Downloads 176
4232 Saudi Twitter Corpus for Sentiment Analysis

Authors: Adel Assiri, Ahmed Emam, Hmood Al-Dossari

Abstract:

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment analysis with (K= 0.807). Our next work is to extend this corpus and creation a large-scale lexicon for Saudi dialect from the corpus.

Keywords: Arabic, sentiment analysis, Twitter, annotation

Procedia PDF Downloads 587
4231 Corpus Linguistic Methods in a Theoretical Study of Quran Verb Tense and Aspect in Translations from Arabic to English

Authors: Jawharah Alasmari

Abstract:

In inflectional morphology of verb, tense and aspect indicate action’s time either past/present or future and their period whether completed or not. The usage and meaning of tense and aspect differ in Arabic and English, therefore is no simple one -to- one mapping from an Arabic verb inflected form an appropriate English translation depends on a range of features, including immediate and wider context of use. The Quranic Arabic Corpus includes seven alternative expertly crafted English translations of each Arabic verses, which provides a test dataset for the study of appropriate Arabic to English translations of verb tense and aspect. We applied Corpus Linguistics Methods in a theoretical study of exemplary verbs, to elicit candidate verbal contexts which influence the choice of English inflection for each verse.

Keywords: Corpus linguistics methods, Arabic verb, tense and aspect, English translations

Procedia PDF Downloads 352
4230 Combining Corpus Linguistics and Critical Discourse Analysis to Study Power Relations in Hindi Newspapers

Authors: Vandana Mishra, Niladri Sekhar Dash, Jayshree Charkraborty

Abstract:

This present paper focuses on the application of corpus linguistics techniques for critical discourse analysis (CDA) of Hindi newspapers. While Corpus linguistics is the study of language as expressed in corpora (samples) of 'real world' text, CDA is an interdisciplinary approach to the study of discourse that views language as a form of social practice. CDA has mainly been studied from a qualitative perspective. However, we can say that recent studies have begun combining corpus linguistics with CDA in analyzing large volumes of text for the study of existing power relations in society. The corpus under our study is also of a sizable amount (1 million words of Hindi newspaper texts) and its analysis requires an alternative analytical procedure. So, we have combined both the quantitative approach i.e. the use of corpus techniques with CDA’s traditional qualitative analysis. In this context, we have focused on the Keyword Analysis Sorting Concordance Lines of the selected Keywords and calculating collocates of the keywords. We have made use of the Wordsmith Tool for all these analysis. The analysis starts with identifying the keywords in the political news corpus when compared with the main news corpus. The keywords are extracted from the corpus based on their keyness calculated through statistical tests like chi-squared test and log-likelihood test on the frequent words of the corpus. Some of the top occurring keywords are मोदी (Modi), भाजपा (BJP), कांग्रेस (Congress), सरकार (Government) and पार्टी (Political party). This is followed by the concordance analysis of these keywords which generates thousands of lines but we have to select few lines and examine them based on our objective. We have also calculated the collocates of the keywords based on their Mutual Information (MI) score. Both concordance and collocation help to identify lexical patterns in the political texts. Finally, all these quantitative results derived from the corpus techniques will be subjectively interpreted in accordance to the CDA’s theory to examine the ways in which political news discourse produces social and political inequality, power abuse or domination.

Keywords: critical discourse analysis, corpus linguistics, Hindi newspapers, power relations

Procedia PDF Downloads 179
4229 A Corpus-Based Discourse Analysis of the Disappearance of MH370 in Malaysia and United Kingdom Newspapers: A Pilot Study

Authors: Theng Theng Ong

Abstract:

This pilot study adopts a corpus-based discourse analysis to explore the construction of Malaysia airline tragedy MH370 in the selected Malaysian and United Kingdom (UK) newspapers. Fairclough’s three-dimensional model is adopted in the study to support the corpus-based analysis. The analysis aims to determine the ways in which Malaysian Airline tragedy MH370 is linguistically defined and constructed in terms of keywords and collocation. The study also seeks to identify the types of discourse that are presented in the news articles. In addition, the differences or similarities in terms of keywords, topics or issues covered by the selected Malaysian and UK news media are examined.

Keywords: corpus, CDA, newspapers, airline tragedies

Procedia PDF Downloads 260
4228 Automatic Tagging and Accuracy in Assamese Text Data

Authors: Chayanika Hazarika Bordoloi

Abstract:

This paper is an attempt to work on a highly inflectional language called Assamese. This is also one of the national languages of India and very little has been achieved in terms of computational research. Building a language processing tool for a natural language is not very smooth as the standard and language representation change at various levels. This paper presents inflectional suffixes of Assamese verbs and how the statistical tools, along with linguistic features, can improve the tagging accuracy. Conditional random fields (CRF tool) was used to automatically tag and train the text data; however, accuracy was improved after linguistic featured were fed into the training data. Assamese is a highly inflectional language; hence, it is challenging to standardizing its morphology. Inflectional suffixes are used as a feature of the text data. In order to analyze the inflections of Assamese word forms, a list of suffixes is prepared. This list comprises suffixes, comprising of all possible suffixes that various categories can take is prepared. Assamese words can be classified into inflected classes (noun, pronoun, adjective and verb) and un-inflected classes (adverb and particle). The corpus used for this morphological analysis has huge tokens. The corpus is a mixed corpus and it has given satisfactory accuracy. The accuracy rate of the tagger has gradually improved with the modified training data.

Keywords: CRF, morphology, tagging, tagset

Procedia PDF Downloads 164
4227 Passive Voice in SLA: Armenian Learners’ Case Study

Authors: Emma Nemishalyan

Abstract:

It is believed that learners’ mother tongue (L1 hereafter) has a huge impact on their second language acquisition (L2 hereafter). This hypothesis has been exposed to both positive and negative criticism. Based on research results of a wide range of learners’ corpora (Chinese, Japanese, Spanish among others) the hypothesis has either been proved or disproved. However, no such study has been conducted on the Armenian learners. The aim of this paper is to understand the implication of the hypothesis on the Armenian learners’ corpus in terms of the use of the passive voice. To this end, the method of Contrastive Interlanguage Analysis (hereafter CIA) has been used on native speakers’ corpus (Louvain Corpus of Native English Essays (LOCNESS)) and Armenian learners’ corpus which has been compiled by me in compliance with International Corpus of Learner English (ICLE) guidelines. CIA compares the interlanguage (the language produced by learners) with the one produced by native speakers. With the help of this method, it is possible not only to highlight the mistakes that learners make, but also to underline the under or overuses. The choice of the grammar issue (passive voice) is conditioned by the fact that typologically Armenian and English are drastically different as they belong to different branches. Moreover, the passive voice is considered to be one of the most problematic grammar topics to be acquired by learners of the English language. Based on this difference, we hypothesized that Armenian learners would either overuse or underuse some types of the passive voice. With the help of Lancsbox software, we have identified the frequency rates of passive voice usage in LOCNESS and Armenian learners’ corpus to understand whether the latter have the same usage pattern of the passive voice as the native speakers. Secondly, we have identified the types of the passive voice used by the Armenian leaners trying to track down the reasons in their mother tongue. The results of the study showed that Armenian learners underused the passive voices in contrast to native speakers. Furthermore, the hypothesis that learners’ L1 has an impact on learners’ L2 acquisition and production was proved.

Keywords: corpus linguistics, applied linguistics, second language acquisition, corpus compilation

Procedia PDF Downloads 48
4226 The Repetition of New Words and Information in Mandarin-Speaking Children: A Corpus-Based Study

Authors: Jian-Jun Gao

Abstract:

Repetition is used for a variety of functions in conversation. When young children first learn to speak, they often repeat words from the adult’s recent utterance with the learning and social function. The objective of this study was to ascertain whether the repetitions are equivalent in indicating attention to new words and the initial repeat of information in conversation. Based on the observation of naturally occurring language use in Taiwan Corpus of Child Mandarin (TCCM), the results in this study provided empirical support to the previous findings that children are more likely to repeat new words they are offered than to repeat new information. When children get older, there would be a drop in the repetition of both new words and new information.

Keywords: acquisition, corpus, mandarin, new words, new information, repetition

Procedia PDF Downloads 110
4225 Chinese Students’ Use of Corpus Tools in an English for Academic Purposes Writing Course: Influence on Learning Behaviour, Performance Outcomes and Perceptions

Authors: Jingwen Ou

Abstract:

Writing for academic purposes in a second or foreign language poses a significant challenge for non-native speakers, particularly at the tertiary level, where English academic writing for L2 students is often hindered by difficulties in academic discourse, including vocabulary, academic register, and organization. The past two decades have witnessed a rising popularity in the application of the data-driven learning (DDL) approach in EAP writing instruction. In light of such a trend, this study aims to enhance the integration of DDL into English for academic purposes (EAP) writing classrooms by investigating the perception of Chinese college students regarding the use of corpus tools for improving EAP writing. Additionally, the research explores their corpus consultation behaviors during training to provide insights into corpus-assisted EAP instruction for DDL practitioners. Given the uprising popularity of DDL, this research aims to investigate Chinese university students’ use of corpus tools with three main foci: 1) the influence of corpus tools on learning behaviours, 2) the influence of corpus tools on students’ academic writing performance outcomes, and 3) students’ perceptions and potential perceptional changes towards the use of such tools. Three corpus tools, CQPWeb, Sketch Engine, and LancsBox X, are selected for investigation due to the scarcity of empirical research on patterns of learners’ engagement with a combination of multiple corpora. The research adopts a pre-test / post-test design for the evaluation of students’ academic writing performance before and after the intervention. Twenty participants will be divided into two groups: an intervention and a non-intervention group. Three corpus training workshops will be delivered at the beginning, middle, and end of a semester. An online survey and three separate focus group interviews are designed to investigate students’ perceptions of the use of corpus tools for improving academic writing skills, particularly the rhetorical functions in different essay sections. Insights from students’ consultation sessions indicated difficulties with DDL practice, including insufficiency of time to complete all tasks, struggle with technical set-up, unfamiliarity with the DDL approach and difficulty with some advanced corpus functions. Findings from the main study aim to provide pedagogical insights and training resources for EAP practitioners and learners.

Keywords: corpus linguistics, data-driven learning, English for academic purposes, tertiary education in China

Procedia PDF Downloads 20
4224 Corpus-Based Model of Key Concepts Selection for the Master English Language Course "Government Relations"

Authors: Elena Pozdnyakova

Abstract:

“Government Relations” is a field of knowledge presently taught at the majority of universities around the globe. English as the default language can become the language of teaching since the issues discussed are both global and national in character. However for this field of knowledge key concepts and their word representations in English don’t often coincide with those in other languages. International master’s degree students abroad as well as students, taught the course in English at their national universities, are exposed to difficulties, connected with correct conceptualizing of terminology of GR in British and American academic traditions. The study was carried out during the GR English language course elaboration (pilot research: 2013 -2015) at Moscow State Institute of Foreign Relations (University), Russian Federation. Within this period, English language instructors designed and elaborated the three-semester course of GR. Methodologically the course design was based on elaboration model with the special focus on conceptual elaboration sequence and theoretical elaboration sequence. The course designers faced difficulties in concept selection and theoretical elaboration sequence. To improve the results and eliminate the problems with concept selection, a new, corpus-based approach was worked out. The computer-based tool WordSmith 6.0 was used with the aim to build a model of key concept selection. The corpus of GR English texts consisted of 1 million words (the study corpus). The approach was based on measuring effect size, i.e. the percent difference of the frequency of a word in the study corpus when compared to that in the reference corpus. The results obtained proved significant improvement in the process of concept selection. The corpus-based model also facilitated theoretical elaboration of teaching materials.

Keywords: corpus-based study, English as the default language, key concepts, measuring effect size, model of key concept selection

Procedia PDF Downloads 267
4223 OPEN-EmoRec-II-A Multimodal Corpus of Human-Computer Interaction

Authors: Stefanie Rukavina, Sascha Gruss, Steffen Walter, Holger Hoffmann, Harald C. Traue

Abstract:

OPEN-EmoRecII is an open multimodal corpus with experimentally induced emotions. In the first half of the experiment, emotions were induced with standardized picture material and in the second half during a human-computer interaction (HCI), realized with a wizard-of-oz design. The induced emotions are based on the dimensional theory of emotions (valence, arousal and dominance). These emotional sequences - recorded with multimodal data (mimic reactions, speech, audio and physiological reactions) during a naturalistic-like HCI-environment one can improve classification methods on a multimodal level. This database is the result of an HCI-experiment, for which 30 subjects in total agreed to a publication of their data including the video material for research purposes. The now available open corpus contains sensory signal of: video, audio, physiology (SCL, respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus Major) and mimic annotations.

Keywords: open multimodal emotion corpus, annotated labels, intelligent interaction

Procedia PDF Downloads 376
4222 Compilation and Statistical Analysis of an Arabic-English Legal Corpus in Sketch Engine

Authors: C. Brierley, H. El-Farahaty, A. Farhan

Abstract:

The Leeds Parallel Corpus of Arabic-English Constitutions is a parallel corpus for the Arabic legal domain. Analysis of legal language via Corpus Linguistics techniques is an important development. In legal proceedings, a corpus-based approach to disambiguating meaning is set to replace the dictionary as an interpretative tool, and legal scholarship in the States is now attuned to the potential for Text Analytics over vast quantities of text-based legal material, following the business and medical industries. This trend is reflected in Europe: the interdisciplinary research group in Computer Assisted Legal Linguistics mines big data collections of legal and non-legal texts to analyse: legal interpretations; legal discourse; the comprehensibility of legal texts; conflict resolution; and linguistic human rights. This paper focuses on ‘dignity’ as an important aspect of the overarching concept of human rights in current constitutions across the Arab world. We have compiled a parallel, Arabic-English raw text corpus (169,861 Arabic words and 205,893 English words) from reputable websites such as the World Intellectual Property Organisation and CONSTITUTE, and uploaded and queried our corpus in Sketch Engine. Our most challenging task was sentence-level alignment of Arabic-English data. This entailed manual intervention to ensure correspondence on a one-to-many basis since Arabic sentences differ from English in length and punctuation. We have searched for morphological variants of ‘dignity’ (رامة ك, karāma) in the Arabic data and inspected their English translation equivalents. The term occurs most frequently in the Sudanese constitution (10 instances), and not at all in the constitution of Palestine. Its most frequent collocate, determined via the logDice statistic in Sketch Engine, is ‘human’ as in ‘human dignity’.

Keywords: Arabic constitution, corpus-based legal linguistics, human rights, parallel Arabic-English legal corpora

Procedia PDF Downloads 144
4221 Corpus-Assisted Study of Gender Related Tiger Metaphors in the Chinese Context

Authors: Na Xiao

Abstract:

Animal metaphors have many different connotations, ranging from loving emotions to derogatory epithets, but gender expressions using animal metaphors are often imbalanced. Generally, animal metaphors related to females tend to be negative. Little known about the reasons for the negative expressions of animal female metaphors in Chinese contexts still have not been quantified. The Modern Chinese Corpus at the Center for Chinese Linguistics at Peking University (CCL Corpus) provided the data for this research, which aims to identify the influencing variables of gender differences in the description of animal metaphors mapping humans in Chinese by observing the percentage of "tiger" metaphor, which is based on the conceptual metaphor theory. A quantitative research method was used in this study to statistically examine the gender attitude percentage of the "tiger" metaphor using corpus data. This study has proved that the tiger metaphors associated with humans in the Chinese context tend to be negative. Importantly, this study has also shown that the high proportion of tiger metaphorical idioms is what causes the high proportion of negative tiger metaphors that are related to women. This finding can be used as crucial information for future studies on other gender-related animal metaphorical idioms and can offer additional insights for understanding trends in other animal metaphors.

Keywords: Chinese, CCL corpus, gender differences, metaphorical idioms, tigers

Procedia PDF Downloads 68
4220 Translation Quality Assessment in Fansubbed English-Chinese Swearwords: A Corpus-Based Study of the Big Bang Theory

Authors: Qihang Jiang

Abstract:

Fansubbing, the combination of fan and subtitling, is one of the main branches of Audiovisual Translation (AVT) having kindled more and more interest of researchers into the AVT field in recent decades. In particular, the quality of so-called non-professional translation seems questionable due to the non-transparent qualification of subtitlers in a huge community network. This paper attempts to figure out how YYeTs aka 'ZiMuZu', the largest fansubbing group in China, translates swearwords from English to Chinese for its fans of the prevalent American sitcom The Big Bang Theory, taking cultural, social and political elements into account in the context of China. By building a bilingual corpus containing both the source and target texts, this paper found that most of the original swearwords were translated in a toned-down manner, probably due to Chinese audiences’ cultural and social network features as well as the strict censorship under the Chinese government. Additionally, House (2015)’s newly revised model of Translation Quality Assessment (TQA) was applied and examined. Results revealed that most of the subtitled swearwords achieved their pragmatic functions and exerted a communicative effect for audiences. In conclusion, this paper enriches the empirical research concerning House’s new TQA model, gives a full picture of the subtitling of swearwords in AVT field and provides a practical guide for the practitioners in their career of subtitling.

Keywords: corpus-based approach, fansubbing, pragmatic functions, swearwords, translation quality assessment

Procedia PDF Downloads 115
4219 Redundancy in Malay Morphology: School Grammar versus Corpus Grammar

Authors: Zaharani Ahmad, Nor Hashimah Jalaluddin

Abstract:

The aim of this paper is to examine and identify the issue of linguistic redundancy in two competing grammars of Malay, namely the school grammar and the corpus grammar. The former is a normative grammar which is formally and prescriptively taught in the classroom, whereas the latter is a descriptive grammar that is informally acquired and mastered by the students as native speakers of the language outside the classroom. Corpus grammar is depicted based on its actual used in natural occurring texts, as attested in the corpus. It is observed that the grammar taught in schools is incompatible with the grammar used in the corpus. For instance, a noun phrase containing nominal reduplicated form which denotes plurality (i.e. murid-murid ‘students’ which is derived from murid ‘student’) and a modifier categorized as quantifiers (i.e. semua ‘all’, seluruh ‘entire’, and kebanyakan ‘most’) is not acceptable in the school grammar because the formation (i.e. semua murid-murid ‘all the students’ kebanyakan pelajar-pelajar ‘most of the students’) is claimed to be redundant, and redundancy is prohibited in the grammar. Redundancy is generally construed as the property of speech and language by which more information is provided than is precisely required for the message to be understood, so that, if some information is omitted, the remaining information will still be sufficient for the message to be comprehended. Thus, the correct construction to be used is strictly the reduplicated form (i.e. murid-murid ‘students’) or the quantifier plus the root (i.e. semua murid ‘all the students’) with the intention that the grammatical meaning of plural is not repeated. Nevertheless, the so-called redundant form (i.e. kebanyakan pelajar-pelajar ‘most of the students’) is frequently used in the corpus grammar. This study shows that there are a number of redundant forms occur in the morphology of the language, particularly in affixation, reduplication and combination of both. Apparently, the so-called redundancy has grammatical and socio-cultural functions in communication that is to give emphasis and to stress the importance of the information delivered by the speakers or writers.

Keywords: corpus grammar, morphology, redundancy, school grammar

Procedia PDF Downloads 306
4218 The Automatisation of Dictionary-Based Annotation in a Parallel Corpus of Old English

Authors: Ana Elvira Ojanguren Lopez, Javier Martin Arista

Abstract:

The aims of this paper are to present the automatisation procedure adopted in the implementation of a parallel corpus of Old English, as well as, to assess the progress of automatisation with respect to tagging, annotation, and lemmatisation. The corpus consists of an aligned parallel text with word-for-word comparison Old English-English that provides the Old English segment with inflectional form tagging (gloss, lemma, category, and inflection) and lemma annotation (spelling, meaning, inflectional class, paradigm, word-formation and secondary sources). This parallel corpus is intended to fill a gap in the field of Old English, in which no parallel and/or lemmatised corpora are available, while the average amount of corpus annotation is low. With this background, this presentation has two main parts. The first part, which focuses on tagging and annotation, selects the layouts and fields of lexical databases that are relevant for these tasks. Most information used for the annotation of the corpus can be retrieved from the lexical and morphological database Nerthus and the database of secondary sources Freya. These are the sources of linguistic and metalinguistic information that will be used for the annotation of the lemmas of the corpus, including morphological and semantic aspects as well as the references to the secondary sources that deal with the lemmas in question. Although substantially adapted and re-interpreted, the lemmatised part of these databases draws on the standard dictionaries of Old English, including The Student's Dictionary of Anglo-Saxon, An Anglo-Saxon Dictionary, and A Concise Anglo-Saxon Dictionary. The second part of this paper deals with lemmatisation. It presents the lemmatiser Norna, which has been implemented on Filemaker software. It is based on a concordance and an index to the Dictionary of Old English Corpus, which comprises around three thousand texts and three million words. In its present state, the lemmatiser Norna can assign lemma to around 80% of textual forms on an automatic basis, by searching the index and the concordance for prefixes, stems and inflectional endings. The conclusions of this presentation insist on the limits of the automatisation of dictionary-based annotation in a parallel corpus. While the tagging and annotation are largely automatic even at the present stage, the automatisation of alignment is pending for future research. Lemmatisation and morphological tagging are expected to be fully automatic in the near future, once the database of secondary sources Freya and the lemmatiser Norna have been completed.

Keywords: corpus linguistics, historical linguistics, old English, parallel corpus

Procedia PDF Downloads 166
4217 Tagging a corpus of Media Interviews with Diplomats: Challenges and Solutions

Authors: Roberta Facchinetti, Sara Corrizzato, Silvia Cavalieri

Abstract:

Increasing interconnection between data digitalization and linguistic investigation has given rise to unprecedented potentialities and challenges for corpus linguists, who need to master IT tools for data analysis and text processing, as well as to develop techniques for efficient and reliable annotation in specific mark-up languages that encode documents in a format that is both human and machine-readable. In the present paper, the challenges emerging from the compilation of a linguistic corpus will be taken into consideration, focusing on the English language in particular. To do so, the case study of the InterDiplo corpus will be illustrated. The corpus, currently under development at the University of Verona (Italy), represents a novelty in terms both of the data included and of the tag set used for its annotation. The corpus covers media interviews and debates with diplomats and international operators conversing in English with journalists who do not share the same lingua-cultural background as their interviewees. To date, this appears to be the first tagged corpus of international institutional spoken discourse and will be an important database not only for linguists interested in corpus analysis but also for experts operating in international relations. In the present paper, special attention will be dedicated to the structural mark-up, parts of speech annotation, and tagging of discursive traits, that are the innovational parts of the project being the result of a thorough study to find the best solution to suit the analytical needs of the data. Several aspects will be addressed, with special attention to the tagging of the speakers’ identity, the communicative events, and anthropophagic. Prominence will be given to the annotation of question/answer exchanges to investigate the interlocutors’ choices and how such choices impact communication. Indeed, the automated identification of questions, in relation to the expected answers, is functional to understand how interviewers elicit information as well as how interviewees provide their answers to fulfill their respective communicative aims. A detailed description of the aforementioned elements will be given using the InterDiplo-Covid19 pilot corpus. The data yielded by our preliminary analysis of the data will highlight the viable solutions found in the construction of the corpus in terms of XML conversion, metadata definition, tagging system, and discursive-pragmatic annotation to be included via Oxygen.

Keywords: spoken corpus, diplomats’ interviews, tagging system, discursive-pragmatic annotation, english linguistics

Procedia PDF Downloads 151
4216 Statistical Comparison of Machine and Manual Translation: A Corpus-Based Study of Gone with the Wind

Authors: Yanmeng Liu

Abstract:

This article analyzes and compares the linguistic differences between machine translation and manual translation, through a case study of the book Gone with the Wind. As an important carrier of human feeling and thinking, the literature translation poses a huge difficulty for machine translation, and it is supposed to expose distinct translation features apart from manual translation. In order to display linguistic features objectively, tentative uses of computerized and statistical evidence to the systematic investigation of large scale translation corpora by using quantitative methods have been deployed. This study compiles bilingual corpus with four versions of Chinese translations of the book Gone with the Wind, namely, Piao by Chunhai Fan, Piao by Huairen Huang, translations by Google Translation and Baidu Translation. After processing the corpus with the software of Stanford Segmenter, Stanford Postagger, and AntConc, etc., the study analyzes linguistic data and answers the following questions: 1. How does the machine translation differ from manual translation linguistically? 2. Why do these deviances happen? This paper combines translation study with the knowledge of corpus linguistics, and concretes divergent linguistic dimensions in translated text analysis, in order to present linguistic deviances in manual and machine translation. Consequently, this study provides a more accurate and more fine-grained understanding of machine translation products, and it also proposes several suggestions for machine translation development in the future.

Keywords: corpus-based analysis, linguistic deviances, machine translation, statistical evidence

Procedia PDF Downloads 109
4215 Theater Metaphor in Event Quantification: A Corpus Study

Authors: Zhuo Jing-Schmidt, Jun Lang

Abstract:

Numeral classifiers are common in Asian languages. Research on numeral classifiers primarily focuses on noun classifiers that quantify and individuate nominal referents. There is a scarcity of research on event quantification using verb classifiers. This study aims to understand the semantic and conceptual basis of event quantification in Chinese. From a usage-based Construction Grammar perspective, this study presents a corpus analysis of event quantification in Chinese. Drawing on a large balanced corpus of contemporary Chinese, we analyze 667 NOUN col-lexemes totaling 31136 tokens of a productive numeral classifier construction in Chinese. Using collostructional analysis of the collexemes, the results show that the construction quantifies and classifies dramatic events using a theater-based conceptual metaphor. We argue that the usage patterns reflect the cultural entrenchment of theater as in Chinese conceptualization and the construal of theatricality in linguistic expression. The study has implications for cognitive semantics and construction grammar.

Keywords: event quantification, classifier, corpus, metaphor

Procedia PDF Downloads 37
4214 An Online Corpus-Based Bilingual Collocations Dictionary for Second/Foreign Language Learners

Authors: Adriane Orenha-Ottaiano

Abstract:

Collocations are conventionalized, recurrent and arbitrary lexical combinations. Due to the fact that they are highly specific for a particular language and may be contextually restricted, collocations pose a problem to EFL/ESL learners with regard to production or encoding. Taking that into account, the compilation of monolingual and bilingual collocations dictionaries for the referred audience is highly crucial and significant. Thus, the aim of this paper is to discuss the importance of the compilation of an Online Corpus-based Bilingual Collocations Dictionary, in the English-Portuguese and Portuguese-English directions. On a first phase, with the use of WordSmith Tools, the collocations were extracted from a Translation Learner Corpus (TLC), a parallel corpus made up of university students’ translations in the Portuguese-English direction, with approximately 100,000 words. In a second stage, based on the keywords analyzed from the TLC, more collocational patterns were extracted using the Sketch Engine. In order to include more collocations as well as to ensure dictionary users will have access to more frequent and recurrent collocations, we also use the frequency list from The Corpus of Contemporary American English, with the purpose of extracting more patterns. The dictionary focuses on all types of collocations (verbal, noun, adjectival and adverbial collocations), in order to help the referred audience use them more accurately and productively – so far the dictionary has more than 330 entries, and more than 3,500 collocations extracted. The idea of having the proposed dictionary in online format may allow to incorporate more qualitatively and quantitatively collocational information. Besides, more examples may be included, different from conventional printed collocations dictionaries. Being the first bilingual collocations dictionary in the aforementioned directions, it is hoped to achieve the challenge of meeting learners’ collocational needs as the collocations have been selected according to learners’ difficulties regarding the use of collocations.

Keywords: Corpus-Based Collocations Dictionary, Collocations , Bilingual Collocations Dictionary, Collocational Patterns

Procedia PDF Downloads 277
4213 Corpus Linguistics as a Tool for Translation Studies Analysis: A Bilingual Parallel Corpus of Students’ Translations

Authors: Juan-Pedro Rica-Peromingo

Abstract:

Nowadays, corpus linguistics has become a key research methodology for Translation Studies, which broadens the scope of cross-linguistic studies. In the case of the study presented here, the approach used focuses on learners with little or no experience to study, at an early stage, general mistakes and errors, the correct or incorrect use of translation strategies, and to improve the translational competence of the students. Led by Sylviane Granger and Marie-Aude Lefer of the Centre for English Corpus Linguistics of the University of Louvain, the MUST corpus (MUltilingual Student Translation Corpus) is an international project which brings together partners from Europe and worldwide universities and connects Learner Corpus Research (LCR) and Translation Studies (TS). It aims to build a corpus of translations carried out by students including both direct (L2 > L1) an indirect (L1 > L2) translations, from a great variety of text types, genres, and registers in a wide variety of languages: audiovisual translations (including dubbing, subtitling for hearing population and for deaf population), scientific, humanistic, literary, economic and legal translation texts. This paper focuses on the work carried out by the Spanish team from the Complutense University (UCMA), which is part of the MUST project, and it describes the specific features of the corpus built by its members. All the texts used by UCMA are either direct or indirect translations between English and Spanish. Students’ profiles comprise translation trainees, foreign language students with a major in English, engineers studying EFL and MA students, all of them with different English levels (from B1 to C1); for some of the students, this would be their first experience with translation. The MUST corpus is searchable via Hypal4MUST, a web-based interface developed by Adam Obrusnik from Masaryk University (Czech Republic), which includes a translation-oriented annotation system (TAS). A distinctive feature of the interface is that it allows source texts and target texts to be aligned, so we can be able to observe and compare in detail both language structures and study translation strategies used by students. The initial data obtained point out the kind of difficulties encountered by the students and reveal the most frequent strategies implemented by the learners according to their level of English, their translation experience and the text genres. We have also found common errors in the graduate and postgraduate university students’ translations: transfer errors, lexical errors, grammatical errors, text-specific translation errors, and cultural-related errors have been identified. Analyzing all these parameters will provide more material to bring better solutions to improve the quality of teaching and the translations produced by the students.

Keywords: corpus studies, students’ corpus, the MUST corpus, translation studies

Procedia PDF Downloads 113