Search results for: Corpus interlanguage analysis
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 26961

Search results for: Corpus interlanguage analysis

26841 Men Act, Women Are Acted Upon: Morphosyntactic Framing of the Sexual Intercourse in Online Pornography Titles

Authors: Aleksandra Tomic

Abstract:

According to reliable sources, 4% of all websites is devoted to pornographic material, yet these estimates are often reported to be much higher. The largest internet pornography streaming website reports 21.2 billion visits in 2015 only. Considering the ubiquity of online pornography and the frequency of use, it is necessary to examine its potential influence on the construal of the sexual act and the roles of participants. Apart from the verbal and physical interactions in the pornographic movies themselves, the language in the titles of movies has the power to frame the sexual intercourse. In this study, Critical Discourse Analysis and corpus linguistics approaches will be used to examine the way the sexual intercourse and the roles of the participants are ideologically construed and perpetuated in the Internet pornography discourse. To this end, the study will explore the association between the specific morphosyntactic aspects of the references to performers of both genders, the person and the thematic role, and the gender of referred performer in the corpus of online pornographic movie titles. Distinctive collexeme analysis will be conducted to uncover possible associations between for gender of the performer denoted by the linguistic expression, and the person and thematic role assigned to it in the titles of online pornography movies. Initial results of the chi-square procedure performed on a sample of 295 online pornography movie titles on the largest pornography streaming website ‘Pornhub’ yielded significant results. The use of the three person categories was not equally distributed between genders, X2 (2, N = 106) = 32.52, p < 0.001, with female performers being referred to in the third person in 71.7% of the instances, and speaking in the first person 20.8% of the time, whereas male performers spoke in the first person 68% of the time, and were referred to in the third person in 17% of the instances. Moreover, there was a gender disparity in the assignment of thematic roles, with linguistic expressions for women being assigned the Patient role and men the Agent role in 58.8% of the cases, whereas the roles were reversed in 41.2% of the instances, X2 (1, N = 262) = 8.07633, p < 0.005. The results are discussed in terms of the ideologies surrounding female and male sexuality in the pornography discourse. Potential patterns of power imbalance, objectification, and discrimination are highlighted. Finally, the evidence from psycholinguistic studies on the influence of the language structure on event construal is related to the results of the study.

Keywords: corpus linguistics, gender studies, pornography, thematic roles

Procedia PDF Downloads 149
26840 Assessment of the Validity of Sentiment Analysis as a Tool to Analyze the Emotional Content of Text

Authors: Trisha Malhotra

Abstract:

Sentiment analysis is a recent field of study that computationally assesses the emotional nature of a body of text. To assess its test-validity, sentiment analysis was carried out on the emotional corpus of text from a personal 15-day mood diary. Self-reported mood scores varied more or less accurately with daily mood evaluation score given by the software. On further assessment, it was found that while sentiment analysis was good at assessing ‘global’ mood, it was not able to ‘locally’ identify and differentially score synonyms of various emotional words. It is further critiqued for treating the intensity of an emotion as universal across cultures. Finally, the software is shown not to account for emotional complexity in sentences by treating emotions as strictly positive or negative. Hence, it is posited that a better output could be two (positive and negative) affect scores for the same body of text.

Keywords: analysis, data, diary, emotions, mood, sentiment

Procedia PDF Downloads 239
26839 The Diary of Dracula, by Marin Mincu: Inquiries into a Romanian 'Book of Wisdom' as a Fictional Counterpart for Corpus Hermeticum

Authors: Lucian Vasile Bagiu, Paraschiva Bagiu

Abstract:

The novel written in Italian and published in Italy in 1992 by the Romanian scholar Marin Mincu is meant for the foreign reader, aiming apparently at a better knowledge of the historical character of Vlad the Empalor (Vlad Dracul), within the European cultural, political and historical context of 1463. Throughout the very well written tome, one comes to realize that one of the underlining levels of the fiction is the exposing of various fundamental features of the Romanian culture and civilization. The author of the diary, Dracula, makes mention of Corpus Hermeticum no less than fifteen times, suggesting his own diary is some sort of a philosophical counterpart. The essay focuses on several ‘truths’ and ‘wisdom’ revealed in the fictional teachings of Dracula. The boycott of History by the Romanians is identified as an echo of the philosophical approach of the famous Romanian scholar and writer Lucian Blaga. The orality of the Romanian culture is a landmark opposed to written culture of the Western Europe. The religion of the ancient Dacian God Zalmoxis is seen as the basis for the Romanian existential and/or metaphysical ethnic philosophy (a feature tackled by the famous Romanian historian of religion Mircea Eliade), with a suggestion that Hermes Trismegistus may have written his Corpus Hermeticum being influenced by Zalmoxis. The historical figure of the last Dacian king Decebalus (death 106 AD) is a good pretext for a tantalizing Indo-European suggestion that the prehistoric Thraco-Dacian people may have been the ancestors of the first Romans settled in Latium. The lost diary of the Emperor Trajan The Bello Dacico may have proved that the unknown language of the Dacians was very much alike Latin language (a secret well hidden by the Vatican). The attitude towards death of the Dacians, as described by Herodotus, may have later inspired Pitagora, Socrates, the Eleusinian and Orphic Mysteries, etc. All of these within the Humanistic and Renascentist European context of the epoch, Dracula having a close relationship with scholars such as Nicolaus Cusanus, Cosimo de Medici, Marsilio Ficino, Pope Pius II, etc. Thus The Diary of Dracula turns out as exciting and stupefying as Corpus Hermeticum, a book impossible to assimilate entirely, yet a reference not wise to be ignored.

Keywords: Corpus Hermeticum, Dacians, Dracula, Zalmoxis

Procedia PDF Downloads 133
26838 Use of Ing-Formed and Derived Verbal Nominalization in American English: A Survey Applied to Native American English Speakers

Authors: Yujia Sun

Abstract:

Research on nominalizations in English can be traced back to at least the 1960s and even centered in the field nowadays. At the very beginning, the discussion was about the relationship between verbs and nouns, but then it moved to the distinct senses embodied in different forms of nominals, namely, various types of nominalizations. This paper tries to address the issue that how speakers perceive different forms of verbal nouns, and what might influence their perceptions. The data are collected through a self-designed questionnaire targeted at native speakers of American English, and the employment of the Corpus of Contemporary American English (COCA). The results show that semantic differences between different forms of nominals do play a role in people’s preference to certain form than another. But it still awaits more explorations to see how the frequency of usage is interrelates to this issue.

Keywords: corpus of contemporary American English, derived nominalization, frequency of usage, ing-formed nominalization

Procedia PDF Downloads 152
26837 Direct Translation vs. Pivot Language Translation for Persian-Spanish Low-Resourced Statistical Machine Translation System

Authors: Benyamin Ahmadnia, Javier Serrano

Abstract:

In this paper we compare two different approaches for translating from Persian to Spanish, as a language pair with scarce parallel corpus. The first approach involves direct transfer using an statistical machine translation system, which is available for this language pair. The second approach involves translation through English, as a pivot language, which has more translation resources and more advanced translation systems available. The results show that, it is possible to achieve better translation quality using English as a pivot language in either approach outperforms direct translation from Persian to Spanish. Our best result is the pivot system which scores higher than direct translation by (1.12) BLEU points.

Keywords: statistical machine translation, direct translation approach, pivot language translation approach, parallel corpus

Procedia PDF Downloads 459
26836 A Framework for Chinese Domain-Specific Distant Supervised Named Entity Recognition

Authors: Qin Long, Li Xiaoge

Abstract:

The Knowledge Graphs have now become a new form of knowledge representation. However, there is no consensus in regard to a plausible and definition of entities and relationships in the domain-specific knowledge graph. Further, in conjunction with several limitations and deficiencies, various domain-specific entities and relationships recognition approaches are far from perfect. Specifically, named entity recognition in Chinese domain is a critical task for the natural language process applications. However, a bottleneck problem with Chinese named entity recognition in new domains is the lack of annotated data. To address this challenge, a domain distant supervised named entity recognition framework is proposed. The framework is divided into two stages: first, the distant supervised corpus is generated based on the entity linking model of graph attention neural network; secondly, the generated corpus is trained as the input of the distant supervised named entity recognition model to train to obtain named entities. The link model is verified in the ccks2019 entity link corpus, and the F1 value is 2% higher than that of the benchmark method. The re-pre-trained BERT language model is added to the benchmark method, and the results show that it is more suitable for distant supervised named entity recognition tasks. Finally, it is applied in the computer field, and the results show that this framework can obtain domain named entities.

Keywords: distant named entity recognition, entity linking, knowledge graph, graph attention neural network

Procedia PDF Downloads 69
26835 Variables, Annotation, and Metadata Schemas for Early Modern Greek

Authors: Eleni Karantzola, Athanasios Karasimos, Vasiliki Makri, Ioanna Skouvara

Abstract:

Historical linguistics unveils the historical depth of languages and traces variation and change by analyzing linguistic variables over time. This field of linguistics usually deals with a closed data set that can only be expanded by the (re)discovery of previously unknown manuscripts or editions. In some cases, it is possible to use (almost) the entire closed corpus of a language for research, as is the case with the Thesaurus Linguae Graecae digital library for Ancient Greek, which contains most of the extant ancient Greek literature. However, concerning ‘dynamic’ periods when the production and circulation of texts in printed as well as manuscript form have not been fully mapped, representative samples and corpora of texts are needed. Such material and tools are utterly lacking for Early Modern Greek (16th-18th c.). In this study, the principles of the creation of EMoGReC, a pilot representative corpus of Early Modern Greek (16th-18th c.) are presented. Its design follows the fundamental principles of historical corpora. The selection of texts aims to create a representative and balanced corpus that gives insight into diachronic, diatopic and diaphasic variation. The pilot sample includes data derived from fully machine-readable vernacular texts, which belong to 4-5 different textual genres and come from different geographical areas. We develop a hierarchical linguistic annotation scheme, further customized to fit the characteristics of our text corpus. Regarding variables and their variants, we use as a point of departure the bundle of twenty-four features (or categories of features) for prose demotic texts of the 16th c. Tags are introduced bearing the variants [+old/archaic] or [+novel/vernacular]. On the other hand, further phenomena that are underway (cf. The Cambridge Grammar of Medieval and Early Modern Greek) are selected for tagging. The annotated texts are enriched with metalinguistic and sociolinguistic metadata to provide a testbed for the development of the first comprehensive set of tools for the Greek language of that period. Based on a relational management system with interconnection of data, annotations, and their metadata, the EMoGReC database aspires to join a state-of-the-art technological ecosystem for the research of observed language variation and change using advanced computational approaches.

Keywords: early modern Greek, variation and change, representative corpus, diachronic variables.

Procedia PDF Downloads 32
26834 An Automatic Speech Recognition Tool for the Filipino Language Using the HTK System

Authors: John Lorenzo Bautista, Yoon-Joong Kim

Abstract:

This paper presents the development of a Filipino speech recognition tool using the HTK System. The system was trained from a subset of the Filipino Speech Corpus developed by the DSP Laboratory of the University of the Philippines-Diliman. The speech corpus was both used in training and testing the system by estimating the parameters for phonetic HMM-based (Hidden-Markov Model) acoustic models. Experiments on different mixture-weights were incorporated in the study. The phoneme-level word-based recognition of a 5-state HMM resulted in an average accuracy rate of 80.13 for a single-Gaussian mixture model, 81.13 after implementing a phoneme-alignment, and 87.19 for the increased Gaussian-mixture weight model. The highest accuracy rate of 88.70% was obtained from a 5-state model with 6 Gaussian mixtures.

Keywords: Filipino language, Hidden Markov Model, HTK system, speech recognition

Procedia PDF Downloads 442
26833 A Lexicographic Approach to Obstacles Identified in the Ontological Representation of the Tree of Life

Authors: Sandra Young

Abstract:

The biodiversity literature is vast and heterogeneous. In today’s data age, numbers of data integration and standardisation initiatives aim to facilitate simultaneous access to all the literature across biodiversity domains for research and forecasting purposes. Ontologies are being used increasingly to organise this information, but the rationalisation intrinsic to ontologies can hit obstacles when faced with the intrinsic fluidity and inconsistency found in the domains comprising biodiversity. Essentially the problem is a conceptual one: biological taxonomies are formed on the basis of specific, physical specimens yet nomenclatural rules are used to provide labels to describe these physical objects. These labels are ambiguous representations of the physical specimen. An example of this is with the genus Melpomene, the scientific nomenclatural representation of a genus of ferns, but also for a genus of spiders. The physical specimens for each of these are vastly different, but they have been assigned the same nomenclatural reference. While there is much research into the conceptual stability of the taxonomic concept versus the nomenclature used, to the best of our knowledge as yet no research has looked empirically at the literature to see the conceptual plurality or singularity of the use of these species’ names, the linguistic representation of a physical entity. Language itself uses words as symbols to represent real world concepts, whether physical entities or otherwise, and as such lexicography has a well-founded history in the conceptual mapping of words in context for dictionary making. This makes it an ideal candidate to explore this problem. The lexicographic approach uses corpus-based analysis to look at word use in context, with a specific focus on collocated word frequencies (the frequencies of words used in specific grammatical and collocational contexts). It allows for inconsistencies and contradictions in the source data and in fact includes these in the word characterisation so that 100% of the available evidence is counted. Corpus analysis is indeed suggested as one of the ways to identify concepts for ontology building, because of its ability to look empirically at data and show patterns in language usage, which can indicate conceptual ideas which go beyond words themselves. In this sense it could potentially be used to identify if the hierarchical structures present within the empirical body of literature match those which have been identified in ontologies created to represent them. The first stages of this research have revealed a hierarchical structure that becomes apparent in the biodiversity literature when annotating scientific species’ names, common names and more general names as classes, which will be the focus of this paper. The next step in the research is focusing on a larger corpus in which specific words can be analysed and then compared with existing ontological structures looking at the same material, to evaluate the methods by means of an alternative perspective. This research aims to provide evidence as to the validity of the current methods in knowledge representation for biological entities, and also shed light on the way that scientific nomenclature is used within the literature.

Keywords: ontology, biodiversity, lexicography, knowledge representation, corpus linguistics

Procedia PDF Downloads 108
26832 The Universal Cultural Associations in the Conceptual Metaphors Used in the Headlines of Arab News and Saudi Gazette Newspapers: A Critical Cognitive Study

Authors: Hind Hassan Arruwaite

Abstract:

Conceptual metaphor is a cognitive semantic tool that provides access to people's conceptual systems. The correlation in the human conceptual system surpasses limited time and specific cultures. The universal associations provide universal schemas that organize people's conceptualization of the world. The study aims to explore how the cultural associations used in conceptual metaphors create commonalities and harmony between people of the world. In the research methodology, the researcher implemented Critical Metaphor Analysis, Metaphor Candidate Identification and Metaphor Identification Procedure models to deliver qualitative and descriptive findings. The semantic tension was the key criterion in identifying metaphorically used words in the headlines. The research materials are the oil trade conceptual metaphors used in the headlines of Arab News and Saudi Gazette Newspapers. The data will be uploaded to the self-constructed corpus to examine electronic lists for identifying conceptual metaphors. The study investigates the types of conceptual metaphors used in the headlines of the newspapers, the cultural associations identified in the conceptual metaphors, and whether the identified cultural associations in conceptual metaphors create universal conceptual schemas. The study aligned with previous seminal works on conceptual metaphor theory in emphasizing the distinctive power of conceptual metaphors in exposing the cultural associations that unify people's perceptions. The correlation of people conceptualization provides universal schemas that involve elements of human sensorimotor experiences. The study contributes to exposing the shared cultural associations that ensure the commonality of all humankind's thinking mechanism.

Keywords: critical discourse analysis, critical metaphor analysis, conceptual metaphor theory, primary and specific metaphors, corpus-driven approach, universal associations, image schema, sensorimotor experience, oil trade

Procedia PDF Downloads 175
26831 Differences in Assessing Hand-Written and Typed Student Exams: A Corpus-Linguistic Study

Authors: Jutta Ransmayr

Abstract:

The digital age has long arrived at Austrian schools, so both society and educationalists demand that digital means should be integrated accordingly to day-to-day school routines. Therefore, the Austrian school-leaving exam (A-levels) can now be written either by hand or by using a computer. However, the choice of writing medium (pen and paper or computer) for written examination papers, which are considered 'high-stakes' exams, raises a number of questions that have not yet been adequately investigated and answered until recently, such as: What effects do the different conditions of text production in the written German A-levels have on the component of normative linguistic accuracy? How do the spelling skills of German A-level papers written with a pen differ from those that the students wrote on the computer? And how is the teacher's assessment related to this? Which practical desiderata for German didactics can be derived from this? In a trilateral pilot project of the Austrian Center for Digital Humanities (ACDH) of the Austrian Academy of Sciences and the University of Vienna in cooperation with the Austrian Ministry of Education and the Council for German Orthography, these questions were investigated. A representative Austrian learner corpus, consisting of around 530 German A-level papers from all over Austria (pen and computer written), was set up in order to subject it to a quantitative (corpus-linguistic and statistical) and qualitative investigation with regard to the spelling and punctuation performance of the high school graduates and the differences between pen- and computer-written papers and their assessments. Relevant studies are currently available mainly from the Anglophone world. These have shown that writing on the computer increases the motivation to write, has positive effects on the length of the text, and, in some cases, also on the quality of the text. Depending on the writing situation and other technical aids, better results in terms of spelling and punctuation could also be found in the computer-written texts as compared to the handwritten ones. Studies also point towards a tendency among teachers to rate handwritten texts better than computer-written texts. In this paper, the first comparable results from the German-speaking area are to be presented. Research results have shown that, on the one hand, there are significant differences between handwritten and computer-written work with regard to performance in orthography and punctuation. On the other hand, the corpus linguistic investigation and the subsequent statistical analysis made it clear that not only the teachers' assessments of the students’ spelling performance vary enormously but also the overall assessments of the exam papers – the factor of the production medium (pen and paper or computer) also seems to play a decisive role.

Keywords: exam paper assessment, pen and paper or computer, learner corpora, linguistics

Procedia PDF Downloads 138
26830 A Corpus Study of English Verbs in Chinese EFL Learners’ Academic Writing Abstracts

Authors: Shuaili Ji

Abstract:

The correct use of verbs is an important element of high-quality research articles, and thus for Chinese EFL learners, it is significant to master characteristics of verbs and to precisely use verbs. However, some researches have shown that there are differences in using verbs between learners and native speakers and learners have difficulty in using English verbs. This corpus-based quantitative research can enhance learners’ knowledge of English verbs and promote the quality of research article abstracts even of the whole academic writing. The aim of this study is to find the differences between learners’ and native speakers’ use of verbs and to study the factors that contribute to those differences. To this end, the research question is as follows: What are the differences between most frequently used verbs by learners and those by native speakers? The research question is answered through a study that uses corpus-based data-driven approach to analyze the verbs used by learners in their abstract writings in terms of collocation, colligation and semantic prosody. The results show that: (1) EFL learners obviously overused ‘be, can, find, make’ and underused ‘investigate, examine, may’. As to modal verbs, learners obviously overused ‘can’ while underused ‘may’. (2) Learners obviously overused ‘we find + object clauses’ while underused ‘nouns (results, findings, data) + suggest/indicate/reveal + object clauses’ when expressing research results. (3) Learners tended to transfer the collocation, colligation and semantic prosody of shǐ and zuò to make. (4) Learners obviously overused ‘BE+V-ed’ and used BE as the main verb. They also obviously overused the basic forms of BE such as be, is, are, while obviously underused its inflections (was, were). These results manifested learners’ lack of accuracy and idiomatic property in verb usage. Due to the influence of the concept transfer of Chinese, the verbs in learners’ abstracts showed obvious transfer of mother language. In addition, learners have not fully mastered the use of verbs, avoiding using complex colligations to prevent errors. Based on these findings, the present study has implications for English teaching, seeking to have implications for English academic abstract writing in China. Further research could be undertaken to study the use of verbs in the whole dissertation to find out whether the characteristic of the verbs in abstracts can apply in the whole dissertation or not.

Keywords: academic writing abstracts, Chinese EFL learners, corpus-based, data-driven, verbs

Procedia PDF Downloads 304
26829 MicroRNA in Bovine Corpus Luteum during Early Pregnancy

Authors: Rreze Gecaj, Corina Schanzenbach, Benedikt Kirchner, Michael Pfaffl, Bajram Berisha

Abstract:

The maintenance of corpus lutem (CL) during early pregnancy in cattle is a critical and multifarious process. A luteotrophic mechanism originating from the embryo is widely accepted as the triggering signal for the CL maintenance. In the cattle, it is the interferon-tau (IFNT) secretion form conceptus that prevents CL regression and ensures progesterone production for the establishment of pregnancy. In addition to endocrine and paracrine signals, microRNA (miRNA) can also support CL sustainability during early pregnancy. MiRNA are small non-coding nucleic acids that regulate gene expression post-transcriptionally and are shown to be involved in the modulation of CL function. However, the examination of miRNAs in corpus luteum function at the early pregnancy still remains largely uncovered. This study aims at profiling the expression of miRNA in CL during the early pregnancy in cattle by comparing it with the CL form late cycle and with the regressed CL. Corpora lutea were assigned in two different groups during the cycle (C13 group, late CL: days 13-18 and C18, regressed CL group: day >18) and during the early pregnancy (group P: 1-2 month). The estrous cycle was determined by macroscopic examination and to age the fetus crown-rump length measurement was applied. A total of 9 corpora lutea from individual animals were included in the study, three corpora lutea for each group. MiRNAs population was profiled using small RNA next-generation sequencing and biologically significant miRNAs were evaluated for their differential expression using the DESeq2-methodology. We show that 6 differentially expressed miRNAs (bta-mir-2890, -2332, -2441-3p, -148b, -1248 and -29c) are common to both comparisons, P vs C13 and P vs C18. While for each stage individually we have identified unique miRNAs differentially expressed only for the given comparison. bta-miR-23a and -769 were unique miRNAs differentially expressed in P vs C13, whereas forty-four unique miRNAs were identified as differentially expressed in P vs C18. These data confirm that miRNAs are highly abundant in luteal tissue during early pregnancy and potentially regulate the CL maintenance at this stage of fetus development.

Keywords: bovine, corpus luteum, microRNA, pregnancy, RNA-Seq

Procedia PDF Downloads 230
26828 Wavelets Contribution on Textual Data Analysis

Authors: Habiba Ben Abdessalem

Abstract:

The emergence of giant set of textual data was the push that has encouraged researchers to invest in this field. The purpose of textual data analysis methods is to facilitate access to such type of data by providing various graphic visualizations. Applying these methods requires a corpus pretreatment step, whose standards are set according to the objective of the problem studied. This step determines the forms list contained in contingency table by keeping only those information carriers. This step may, however, lead to noisy contingency tables, so the use of wavelet denoising function. The validity of the proposed approach is tested on a text database that offers economic and political events in Tunisia for a well definite period.

Keywords: textual data, wavelet, denoising, contingency table

Procedia PDF Downloads 256
26827 Corpus-Based Neural Machine Translation: Empirical Study Multilingual Corpus for Machine Translation of Opaque Idioms - Cloud AutoML Platform

Authors: Khadija Refouh

Abstract:

Culture bound-expressions have been a bottleneck for Natural Language Processing (NLP) and comprehension, especially in the case of machine translation (MT). In the last decade, the field of machine translation has greatly advanced. Neural machine translation NMT has recently achieved considerable development in the quality of translation that outperformed previous traditional translation systems in many language pairs. Neural machine translation NMT is an Artificial Intelligence AI and deep neural networks applied to language processing. Despite this development, there remain some serious challenges that face neural machine translation NMT when translating culture bounded-expressions, especially for low resources language pairs such as Arabic-English and Arabic-French, which is not the case with well-established language pairs such as English-French. Machine translation of opaque idioms from English into French are likely to be more accurate than translating them from English into Arabic. For example, Google Translate Application translated the sentence “What a bad weather! It runs cats and dogs.” to “يا له من طقس سيء! تمطر القطط والكلاب” into the target language Arabic which is an inaccurate literal translation. The translation of the same sentence into the target language French was “Quel mauvais temps! Il pleut des cordes.” where Google Translate Application used the accurate French corresponding idioms. This paper aims to perform NMT experiments towards better translation of opaque idioms using high quality clean multilingual corpus. This Corpus will be collected analytically from human generated idiom translation. AutoML translation, a Google Neural Machine Translation Platform, is used as a custom translation model to improve the translation of opaque idioms. The automatic evaluation of the custom model will be compared to the Google NMT using Bilingual Evaluation Understudy Score BLEU. BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Human evaluation is integrated to test the reliability of the Blue Score. The researcher will examine syntactical, lexical, and semantic features using Halliday's functional theory.

Keywords: multilingual corpora, natural language processing (NLP), neural machine translation (NMT), opaque idioms

Procedia PDF Downloads 108
26826 A Bayesian Approach for Analyzing Academic Article Structure

Authors: Jia-Lien Hsu, Chiung-Wen Chang

Abstract:

Research articles may follow a simple and succinct structure of organizational patterns, called move. For example, considering extended abstracts, we observe that an extended abstract usually consists of five moves, including Background, Aim, Method, Results, and Conclusion. As another example, when publishing articles in PubMed, authors are encouraged to provide a structured abstract, which is an abstract with distinct and labeled sections (e.g., Introduction, Methods, Results, Discussions) for rapid comprehension. This paper introduces a method for computational analysis of move structures (i.e., Background-Purpose-Method-Result-Conclusion) in abstracts and introductions of research documents, instead of manually time-consuming and labor-intensive analysis process. In our approach, sentences in a given abstract and introduction are automatically analyzed and labeled with a specific move (i.e., B-P-M-R-C in this paper) to reveal various rhetorical status. As a result, it is expected that the automatic analytical tool for move structures will facilitate non-native speakers or novice writers to be aware of appropriate move structures and internalize relevant knowledge to improve their writing. In this paper, we propose a Bayesian approach to determine move tags for research articles. The approach consists of two phases, training phase and testing phase. In the training phase, we build a Bayesian model based on a couple of given initial patterns and the corpus, a subset of CiteSeerX. In the beginning, the priori probability of Bayesian model solely relies on initial patterns. Subsequently, with respect to the corpus, we process each document one by one: extract features, determine tags, and update the Bayesian model iteratively. In the testing phase, we compare our results with tags which are manually assigned by the experts. In our experiments, the promising accuracy of the proposed approach reaches 56%.

Keywords: academic English writing, assisted writing, move tag analysis, Bayesian approach

Procedia PDF Downloads 304
26825 Understanding the Top Questions Asked about Hong Kong by Travellers Worldwide through a Corpus-Based Discourse Analytic Approach

Authors: Phoenix W. Y. Lam

Abstract:

As one of the most important service-oriented industries in contemporary society, tourism has increasingly seen the influence of the Internet on all aspects of travelling. Travellers nowadays habitually research online before making travel-related decisions. One platform on which such research is conducted is destination forums. The emergence of such online destination forums in the last decade has allowed tourists to share their travel experiences quickly and easily with a large number of online users around the world. As such, these destination forums also provide invaluable data for tourism bodies to better understand travellers’ views on their destinations. Collecting posts from the Hong Kong travel forum on the world’s largest travel website TripAdvisor®, the present study identifies the top questions asked by TripAdvisor users about Hong Kong through a corpus-based discourse analytic approach. Based on questions posted on the forum and their associated meta-data gathered in a one-year period, the study examines the top questions asked by travellers around the world to identify the key geographical locations in which users have shown the greatest interest in the city. Questions raised by travellers from different geographical locations are also compared to see if traveller communities by location vary in terms of their areas of interest. This analysis involves the study of key words and concordance of frequently-occurring items and a close reading of representative examples in context. Findings from the present study show that travellers who asked the most questions about Hong Kong are from North America and Asia, and that travellers from different locations have different concerns and interests, which are clearly reflected in the language of the questions asked on the travel forum. These findings can therefore provide tourism organisations with useful information about the key markets that should be targeted for promotional purposes, and can also allow such organisations to design advertising campaigns which better address the specific needs of such markets. The present study thus demonstrates the value of applying linguistic knowledge and methodologies to the domain of tourism to address practical issues.

Keywords: corpus, hong kong, online travel forum, tourism, TripAdvisor

Procedia PDF Downloads 155
26824 Number Variation of the Personal Pronoun We in American Spoken English

Authors: Qiong Hu, Ming Yue

Abstract:

Language variation signals the newest usage of language community, which might become the developmental trend of that language. The personal pronoun we is prescribed as a plural pronoun in grammar, but its number value is more flexible in actual use. Based on the homemade Friends corpus, the present research explores the number value of the first person pronoun we in nowadays American spoken English. With consideration of the subjectivity of we, this paper used ‘we+ PCU (Perception-cognation-utterance) verbs’ collocations and ‘we+ plural categories’ as the parameters. Results from corpus data and manual annotation show that: 1) the overall frequency of we has been increasing; 2) we has been increasingly used with other plural categories, indicating a weakening of its plural reference; and 3) we has been increasingly used with PCU (perception-cognition-utterance) verbs of strong subjectivity, indicating a strengthening of its singular reference. All these seem to support our hypothesis that we is undergoing the process of further grammaticalization towards a singular reference, though future evidence is needed to attest the bold prediction.

Keywords: number, PCU verbs, personal pronoun we,

Procedia PDF Downloads 204
26823 Critical Discourse Analysis of President Mamnoon Hussain Speech in the Joint Session of Parliament.

Authors: Saeed Qaisrani

Abstract:

This article briefly reviews the rise of Critical Discourse Analysis about the Pakistani President Mamnoon Hussain speech which delivered in the joint session of Parliament and teases out a detailed analysis of the various critiques that have been levelled at CDA and its practitioners over the last twenty years, both by scholars working within the “critical” paradigm and by other critics. A range of criticisms are discussed which target the underlying premises, the analytical methodology and the disputed areas of reader response and the integration of contextual factors. Controversial issues such as the predominantly negative focus of much CDA scholarship, and the status of CDA as an emergent “intellectual orthodoxy”, are also reviewed. The conclusions offer a summary of the principal criticisms that emerge from this overview, and suggest some ways in which these problems could be attenuated. It also focused on the different views about president speech and how it is presented in the Pakistani print and electronic media.

Keywords: Critical Discourse Analysis, Analytical methodology, Corpus linguistics, Reader response theory, Critical paradigm, Contextualization.

Procedia PDF Downloads 453
26822 Neologisms and Word-Formation Processes in Board Game Rulebook Corpus: Preliminary Results

Authors: Athanasios Karasimos, Vasiliki Makri

Abstract:

This research focuses on the design and development of the first text Corpus based on Board Game Rulebooks (BGRC) with direct application on the morphological analysis of neologisms and tendencies in word-formation processes. Corpus linguistics is a dynamic field that examines language through the lens of vast collections of texts. These corpora consist of diverse written and spoken materials, ranging from literature and newspapers to transcripts of everyday conversations. By morphologically analyzing these extensive datasets, morphologists can gain valuable insights into how language functions and evolves, as these extensive datasets can reflect the byproducts of inflection, derivation, blending, clipping, compounding, and neology. This entails scrutinizing how words are created, modified, and combined to convey meaning in a corpus of challenging, creative, and straightforward texts that include rules, examples, tutorials, and tips. Board games teach players how to strategize, consider alternatives, and think flexibly, which are critical elements in language learning. Their rulebooks reflect not only their weight (complexity) but also the language properties of each genre and subgenre of these games. Board games are a captivating realm where strategy, competition, and creativity converge. Beyond the excitement of gameplay, board games also spark the art of word creation. Word games, like Scrabble, Codenames, Bananagrams, Wordcraft, Alice in the Wordland, Once uUpona Time, challenge players to construct words from a pool of letters, thus encouraging linguistic ingenuity and vocabulary expansion. These games foster a love for language, motivating players to unearth obscure words and devise clever combinations. On the other hand, the designers and creators produce rulebooks, where they include their joy of discovering the hidden potential of language, igniting the imagination, and playing with the beauty of words, making these games a delightful fusion of linguistic exploration and leisurely amusement. In this research, more than 150 rulebooks in English from all types of modern board games, either language-independent or language-dependent, are used to create the BGRC. A representative sample of each genre (family, party, worker placement, deckbuilding, dice, and chance games, strategy, eurogames, thematic, role-playing, among others) was selected based on the score from BoardGameGeek, the size of the texts and the level of complexity (weight) of the game. A morphological model with morphological networks, multi-word expressions, and word-creation mechanics based on the complexity of the textual structure, difficulty, and board game category will be presented. In enabling the identification of patterns, trends, and variations in word formation and other morphological processes, this research aspires to make avail of this creative yet strict text genre so as to (a) give invaluable insight into morphological creativity and innovation that (re)shape the lexicon of the English language and (b) test morphological theories. Overall, it is shown that corpus linguistics empowers us to explore the intricate tapestry of language, and morphology in particular, revealing its richness, flexibility, and adaptability in the ever-evolving landscape of human expression.

Keywords: board game rulebooks, corpus design, morphological innovations, neologisms, word-formation processes

Procedia PDF Downloads 58
26821 Sentence Structure for Free Word Order Languages in Context with Anaphora Resolution: A Case Study of Hindi

Authors: Pardeep Singh, Kamlesh Dutta

Abstract:

Many languages have fixed sentence structure and others are free word order. The accuracy of anaphora resolution of syntax based algorithm depends on structure of the sentence. So, it is important to analyze the structure of any language before implementing these algorithms. In this study, we analyzed the sentence structure exploiting the case marker in Hindi as well as some special tag for subject and object. We also investigated the word order for Hindi. Word order typology refers to the study of the order of the syntactic constituents of a language. We analyzed 165 news items of Ranchi Express from EMILEE corpus of plain text. It consisted of 1745 sentences. Eight file of dialogue based from the same corpus has been analyzed which will have 1521 sentences. The percentages of subject object verb structure (SOV) and object subject verb (OSV) are 66.90 and 33.10, respectively.

Keywords: anaphora resolution, free word order languages, SOV, OSV

Procedia PDF Downloads 443
26820 Displaying Compostela: Literature, Tourism and Cultural Representation, a Cartographic Approach

Authors: Fernando Cabo Aseguinolaza, Víctor Bouzas Blanco, Alberto Martí Ezpeleta

Abstract:

Santiago de Compostela became a stable object of literary representation during the period between 1840 and 1915, approximately. This study offers a partial cartographical look at this process, suggesting that a cultural space like Compostela’s becoming an object of literary representation paralleled the first stages of its becoming a tourist destination. We use maps as a method of analysis to show the interaction between a corpus of novels and the emerging tradition of tourist guides on Compostela during the selected period. Often, the novels constitute ways to present a city to the outside, marking it for the gaze of others, as guidebooks do. That leads us to examine the ways of constructing and rendering communicable the local in other contexts. For that matter, we should also acknowledge the fact that a good number of the narratives in the corpus evoke the representation of the city through the figure of one who comes from elsewhere: a traveler, a student or a professor. The guidebooks coincide in this with the emerging fiction, of which the mimesis of a city is a key characteristic. The local cannot define itself except through a process of symbolic negotiation, in which recognition and self-recognition play important roles. Cartography shows some of the forms that these processes of symbolic representation take through the treatment of space. The research uses GIS to find significant models of representation. We used the program ArcGIS for the mapping, defining the databases starting from an adapted version of the methodology applied by Barbara Piatti and Lorenz Hurni’s team at the University of Zurich. First, we designed maps that emphasize the peripheral position of Compostela from a historical and institutional perspective using elements found in the texts of our corpus (novels and tourist guides). Second, other maps delve into the parallels between recurring techniques in the fictional texts and characteristic devices of the guidebooks (sketching itineraries and the selection of zones and indexicalization), like a foreigner’s visit guided by someone who knows the city or the description of one’s first entrance into the city’s premises. Last, we offer a cartography that demonstrates the connection between the best known of the novels in our corpus (Alejandro Pérez Lugín’s 1915 novel La casa de la Troya) and the first attempt to create package tourist tours with Galicia as a destination, in a joint venture of Galician and British business owners, in the years immediately preceding the Great War. Literary cartography becomes a crucial instrument for digging deeply into the methods of cultural production of places. Through maps, the interaction between discursive forms seemingly so far removed from each other as novels and tourist guides becomes obvious and suggests the need to go deeper into a complex process through which a city like Compostela becomes visible on the contemporary cultural horizon.

Keywords: compostela, literary geography, literary cartography, tourism

Procedia PDF Downloads 368
26819 Studying Language of Immediacy and Language of Distance from a Corpus Linguistic Perspective: A Pilot Study of Evaluation Markers in French Television Weather Reports

Authors: Vince Liégeois

Abstract:

Language of immediacy and distance: Within their discourse theory, Koch & Oesterreicher establish a distinction between a language of immediacy and a language of distance. The former refers to those discourses which are oriented more towards a spoken norm, whereas the latter entails discourses oriented towards a written norm, regardless of whether they are realised phonically or graphically. This means that an utterance can be realised phonically but oriented more towards the written language norm (e.g., a scientific presentation or eulogy) or realised graphically but oriented towards a spoken norm (e.g., a scribble or chat messages). Research desiderata: The methodological approach from Koch & Oesterreicher has often been criticised for not providing a corpus-linguistic methodology, which makes it difficult to work with quantitative data or address large text collections within this research paradigm. Consequently, the Koch & Oesterreicher approach has difficulties gaining ground in those research areas which rely more on corpus linguistic research models, like text linguistics and LSP-research. A combinatory approach: Accordingly, we want to establish a combinatory approach with corpus-based linguistic methodology. To this end, we propose to (i) include data about the context of an utterance (e.g., monologicity/dialogicity, familiarity with the speaker) – which were called “conditions of communication” in the original work of Koch & Oesterreicher – and (ii) correlate the linguistic phenomenon at the centre of the inquiry (e.g., evaluation markers) to a group of linguistic phenomena deemed typical for either distance- or immediacy-language. Based on these two parameters, linguistic phenomena and texts could then be mapped on an immediacy-distance continuum. Pilot study: To illustrate the benefits of this approach, we will conduct a pilot study on evaluation phenomena in French television weather reports, a form of domain-sensitive discourse which has often been cited as an example of a “text genre”. Within this text genre, we will look at so-called “evaluation markers,” e.g., fixed strings like bad weather, stifling hot, and “no luck today!”. These evaluation markers help to communicate the coming weather situation towards the lay audience but have not yet been studied within the Koch & Oesterreicher research paradigm. Accordingly, we want to figure out whether said evaluation markers are more typical for those weather reports which tend more towards immediacy or those which tend more towards distance. To this aim, we collected a corpus with different kinds of television weather reports,e.g., as part of the news broadcast, including dialogue. The evaluation markers themselves will be studied according to the explained methodology, by correlating them to (i) metadata about the context and (ii) linguistic phenomena characterising immediacy-language: repetition, deixis (personal, spatial, and temporal), a freer choice of tense and right- /left-dislocation. Results: Our results indicate that evaluation markers are more dominantly present in those weather reports inclining towards immediacy-language. Based on the methodology established above, we have gained more insight into the working of evaluation markers in the domain-sensitive text genre of (television) weather reports. For future research, it will be interesting to determine whether said evaluation markers are also typical for immediacy-language-oriented in other domain-sensitive discourses.

Keywords: corpus-based linguistics, evaluation markers, language of immediacy and distance, weather reports

Procedia PDF Downloads 182
26818 Ideological Manipulations and Cultural-Norm Constraints

Authors: Masoud Hassanzade Novin, Bahloul Salmani

Abstract:

Translation cannot be considered as a simple linguistic act. Through the rise of descriptive approach in the late 1970s and 1980s, translation process managed to meet the requirements of social aspects as well as linguistic approaches. To have the translation considered as the cross-cultural communication through which various cultures communicate in ideological and cultural constraints, the contrastive analysis was conducted in this paper to reveal the distortions imposed in the translated texts. The corpus of the study involved the novel 1984 written by George Orwell and its Persian translated texts which were analyzed through the qualitative type of the research based on critical discourse analysis (CDA) and Toury's norms as well as Lefever's concepts of ideology. Results of the study revealed the point that ideology and the cultural constraints were considered as an important stimulus which can control the process of the translation.

Keywords: critical discourse analysis, ideology, norms, translated texts

Procedia PDF Downloads 311
26817 Comparison of Verb Complementation Patterns in Selected Pakistani and British English Newspaper Social Columns: A Corpus-Based Study

Authors: Zafar Iqbal Bhatti

Abstract:

The present research aims to examine and evaluate the frequencies and practices of verb complementation patterns in English newspaper social columns published in Pakistan and Britain. The research will demonstrate that Pakistani English is a non-native variety of English having its own unique usual and logical characteristics, affected by way of the native languages and the culture, upon syntactic levels, making the variety users aware that any differences from British or American English that are systematic and regular, or another English language, are not even if they are unique, erroneous forms and typical characteristics of several kinds. The objectives are to examine the verb complementation patterns that British and Pakistani social columnists use in relation to their syntactic categories. Secondly, to compare the verb complementation patterns used in Pakistani and British English newspapers social columns. This study will figure out various verb complementation patterns in Pakistani and British English newspaper social columns and their occurrence and distribution. The word classes express different functions of words, such as action, event, or state of being. This research aims to evaluate whether there are any appreciable differences in the verb complementation patterns used in Pakistani and British English newspaper social columns. The results will show the number of varieties of verb complementation patterns in selected English newspapers social columns. This study will fill the gap of previous studies conducted in this field as they only explore a little about the differences between Pakistani and British English newspapers. It will also figure out a variety of languages used in Pakistani and British English journals, as well as regional and cultural values and variations. The researcher will use AntConc software in this study to extract the data for analysis. The researcher will use a concordance tool to identify verb complementation patterns in selected data. Then the researcher will manually categorize them because the same type of adverb can sometimes be used for various purposes. From 1st June 2022 to 30th Sep. 2022, a four-month written corpus of the social columns of PE and BE newspapers will be collected and analyzed. For the analysis of the research questions, 50 social columns will be selected from Pakistani newspapers and 50 from British newspapers. The researcher will collect a representative sample of data from Pakistani and British English newspaper social columns. The researcher will manually analyze the complementation patterns of each verb in each sentence, and then the researcher will determine how frequently each pattern occurs. The researcher will use syntactic characteristics of the verb complementation elements according to the description by Downing and Locke (2006). The researcher will examine all of the verb complementation patterns in the data, and the frequency and distribution of each verb complementation pattern will be evaluated using the software. The researcher will explore every possible verb complementation pattern in Pakistani and British English before calculating the occurrence and abundance of each verb pattern. The researcher will explore every possible verb complementation pattern in Pakistani English before calculating the frequency and distribution of each pattern.

Keywords: verb complementation, syntactic categories, newspaper social columns, corpus

Procedia PDF Downloads 17
26816 A Corpus-Based Approach to Understanding Market Access in Fisheries and Aquaculture: A Systematic Literature Review

Authors: Cheryl Marie Cordeiro

Abstract:

Although fisheries and aquaculture studies might seem marginal to international business (IB) studies in general, fisheries and aquaculture IB (FAIB) management is currently facing increasing pressure to meet global demand and consumption for fish in the next coming decades. In part address to this challenge, the purpose of this systematic review of literature (SLR) study is to investigate the use of the term ‘market access’ in its context of use in the generic literature and business sector discourse, in comparison to the more specific literature and discourse in fisheries, aquaculture and seafood. This SLR aims to uncover the knowledge/interest gaps between the academic subject discourses and business sector practices. Corpus driven in methodology and using a triangulation method of three different text analysis software including AntConc, VOSviewer and Web of Science (WoS) analytics, the SLR results indicate a gap in conceptual knowledge and business practices in how ‘market access’ is conceived and used in the context of the pharmaceutical healthcare industry and FAIB research and practice. While it is acknowledged that the product orientation of different business sectors might differ, this SLR study works with the assumption that both business sectors are global in orientation. These business sectors are complex in their operations from product to market. This SLR suggests a conceptual model in understanding the challenges, the potential barriers as well as avenues for solutions to developing market access for FAIB.

Keywords: market access, fisheries and aquaculture, international business, systematic literature review

Procedia PDF Downloads 123
26815 L2 Reading in Distance Education: Analysis of Students' Reading Attitude and Interests

Authors: Ma. Junithesmer, D. Rosales

Abstract:

The study is a baseline description of students’ attitude and interests about L2 reading in a state university in the Philippines that uses distance education as a delivery mode. Most research conducted on this area dealt with the analysis of reading in a traditional school set-up. For this reason, this research was written to discover if there are implications as regards students’ preferences, interests and attitude reveal about L2 reading in a non-traditional set-up. To form the corpus of this study, it included the literature and studies about reading, preferred technological devices, titles of books and authors, reading medium traditional/ print and electronic books that juxtapose with students’ interest and feelings when reading at home and in school; and their views about their strengths and weaknesses as readers.

Keywords: distance education, L2 reading, reading, reading attitude

Procedia PDF Downloads 319
26814 Preparation on Sentimental Analysis on Social Media Comments with Bidirectional Long Short-Term Memory Gated Recurrent Unit and Model Glove in Portuguese

Authors: Leonardo Alfredo Mendoza, Cristian Munoz, Marco Aurelio Pacheco, Manoela Kohler, Evelyn Batista, Rodrigo Moura

Abstract:

Natural Language Processing (NLP) techniques are increasingly more powerful to be able to interpret the feelings and reactions of a person to a product or service. Sentiment analysis has become a fundamental tool for this interpretation but has few applications in languages other than English. This paper presents a classification of sentiment analysis in Portuguese with a base of comments from social networks in Portuguese. A word embedding's representation was used with a 50-Dimension GloVe pre-trained model, generated through a corpus completely in Portuguese. To generate this classification, the bidirectional long short-term memory and bidirectional Gated Recurrent Unit (GRU) models are used, reaching results of 99.1%.

Keywords: natural processing language, sentiment analysis, bidirectional long short-term memory, BI-LSTM, gated recurrent unit, GRU

Procedia PDF Downloads 130
26813 Spanish Language Violence Corpus: An Analysis of Offensive Language in Twitter

Authors: Beatriz Botella-Gil, Patricio Martínez-Barco, Lea Canales

Abstract:

The Internet and ICT are an integral element of and omnipresent in our daily lives. Technologies have changed the way we see the world and relate to it. The number of companies in the ICT sector is increasing every year, and there has also been an increase in the work that occurs online, from sending e-mails to the way companies promote themselves. In social life, ICT’s have gained momentum. Social networks are useful for keeping in contact with family or friends that live far away. This change in how we manage our relationships using electronic devices and social media has been experienced differently depending on the age of the person. According to currently available data, people are increasingly connected to social media and other forms of online communication. Therefore, it is no surprise that violent content has also made its way to digital media. One of the important reasons for this is the anonymity provided by social media, which causes a sense of impunity in the victim. Moreover, it is not uncommon to find derogatory comments, attacking a person’s physical appearance, hobbies, or beliefs. This is why it is necessary to develop artificial intelligence tools that allow us to keep track of violent comments that relate to violent events so that this type of violent online behavior can be deterred. The objective of our research is to create a guide for detecting and recording violent messages. Our annotation guide begins with a study on the problem of violent messages. First, we consider the characteristics that a message should contain for it to be categorized as violent. Second, the possibility of establishing different levels of aggressiveness. To download the corpus, we chose the social network Twitter for its ease of obtaining free messages. We chose two recent, highly visible violent cases that occurred in Spain. Both of them experienced a high degree of social media coverage and user comments. Our corpus has a total of 633 messages, manually tagged, according to the characteristics we considered important, such as, for example, the verbs used, the presence of exclamations or insults, and the presence of negations. We consider it necessary to create wordlists that are present in violent messages as indicators of violence, such as lists of negative verbs, insults, negative phrases. As a final step, we will use automatic learning systems to check the data obtained and the effectiveness of our guide.

Keywords: human language technologies, language modelling, offensive language detection, violent online content

Procedia PDF Downloads 99
26812 Named Entity Recognition System for Tigrinya Language

Authors: Sham Kidane, Fitsum Gaim, Ibrahim Abdella, Sirak Asmerom, Yoel Ghebrihiwot, Simon Mulugeta, Natnael Ambassager

Abstract:

The lack of annotated datasets is a bottleneck to the progress of NLP in low-resourced languages. The work presented here consists of large-scale annotated datasets and models for the named entity recognition (NER) system for the Tigrinya language. Our manually constructed corpus comprises over 340K words tagged for NER, with over 118K of the tokens also having parts-of-speech (POS) tags, annotated with 12 distinct classes of entities, represented using several types of tagging schemes. We conducted extensive experiments covering convolutional neural networks and transformer models; the highest performance achieved is 88.8% weighted F1-score. These results are especially noteworthy given the unique challenges posed by Tigrinya’s distinct grammatical structure and complex word morphologies. The system can be an essential building block for the advancement of NLP systems in Tigrinya and other related low-resourced languages and serve as a bridge for cross-referencing against higher-resourced languages.

Keywords: Tigrinya NER corpus, TiBERT, TiRoBERTa, BiLSTM-CRF

Procedia PDF Downloads 67