Search results for: multilingual corpora
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 239

Search results for: multilingual corpora

119 Mondoc: Informal Lightweight Ontology for Faceted Semantic Classification of Hypernymy

Authors: M. Regina Carreira-Lopez

Abstract:

Lightweight ontologies seek to concrete union relationships between a parent node, and a secondary node, also called "child node". This logic relation (L) can be formally defined as a triple ontological relation (LO) equivalent to LO in ⟨LN, LE, LC⟩, and where LN represents a finite set of nodes (N); LE is a set of entities (E), each of which represents a relationship between nodes to form a rooted tree of ⟨LN, LE⟩; and LC is a finite set of concepts (C), encoded in a formal language (FL). Mondoc enables more refined searches on semantic and classified facets for retrieving specialized knowledge about Atlantic migrations, from the Declaration of Independence of the United States of America (1776) and to the end of the Spanish Civil War (1939). The model looks forward to increasing documentary relevance by applying an inverse frequency of co-ocurrent hypernymy phenomena for a concrete dataset of textual corpora, with RMySQL package. Mondoc profiles archival utilities implementing SQL programming code, and allows data export to XML schemas, for achieving semantic and faceted analysis of speech by analyzing keywords in context (KWIC). The methodology applies random and unrestricted sampling techniques with RMySQL to verify the resonance phenomena of inverse documentary relevance between the number of co-occurrences of the same term (t) in more than two documents of a set of texts (D). Secondly, the research also evidences co-associations between (t) and their corresponding synonyms and antonyms (synsets) are also inverse. The results from grouping facets or polysemic words with synsets in more than two textual corpora within their syntagmatic context (nouns, verbs, adjectives, etc.) state how to proceed with semantic indexing of hypernymy phenomena for subject-heading lists and for authority lists for documentary and archival purposes. Mondoc contributes to the development of web directories and seems to achieve a proper and more selective search of e-documents (classification ontology). It can also foster on-line catalogs production for semantic authorities, or concepts, through XML schemas, because its applications could be used for implementing data models, by a prior adaptation of the based-ontology to structured meta-languages, such as OWL, RDF (descriptive ontology). Mondoc serves to the classification of concepts and applies a semantic indexing approach of facets. It enables information retrieval, as well as quantitative and qualitative data interpretation. The model reproduces a triple tuple ⟨LN, LE, LT, LCF L, BKF⟩ where LN is a set of entities that connect with other nodes to concrete a rooted tree in ⟨LN, LE⟩. LT specifies a set of terms, and LCF acts as a finite set of concepts, encoded in a formal language, L. Mondoc only resolves partial problems of linguistic ambiguity (in case of synonymy and antonymy), but neither the pragmatic dimension of natural language nor the cognitive perspective is addressed. To achieve this goal, forthcoming programming developments should target at oriented meta-languages with structured documents in XML.

Keywords: hypernymy, information retrieval, lightweight ontology, resonance

Procedia PDF Downloads 101
118 Multimodal Database of Emotional Speech, Video and Gestures

Authors: Tomasz Sapiński, Dorota Kamińska, Adam Pelikant, Egils Avots, Cagri Ozcinar, Gholamreza Anbarjafari

Abstract:

People express emotions through different modalities. Integration of verbal and non-verbal communication channels creates a system in which the message is easier to understand. Expanding the focus to several expression forms can facilitate research on emotion recognition as well as human-machine interaction. In this article, the authors present a Polish emotional database composed of three modalities: facial expressions, body movement and gestures, and speech. The corpora contains recordings registered in studio conditions, acted out by 16 professional actors (8 male and 8 female). The data is labeled with six basic emotions categories, according to Ekman’s emotion categories. To check the quality of performance, all recordings are evaluated by experts and volunteers. The database is available to academic community and might be useful in the study on audio-visual emotion recognition.

Keywords: body movement, emotion recognition, emotional corpus, facial expressions, gestures, multimodal database, speech

Procedia PDF Downloads 322
117 The Landscape of Multilingualism in the Urban Community of Limassol

Authors: Antigoni Parmaxi, Anna Nicolaou, Salomi Papadima-Sophocleous, Dimitrios Boglou

Abstract:

This study provides an overview of the socio linguistic situation of an under-researched city, Limassol, Cyprus, with regard to multilingualism and plurilingualism. More specifically, it explores issues pertaining to multilingualism and plurilingualism in education, the public sphere, economic life, the private sphere, and urban spaces. Through an examination of Limassol’s history of language diversity, as well as through an analysis of the city from a contemporary point of view, the study attempts to portray the multilingual Limassol of yesterday and of today. Findings demonstrate several aspects of multilingualism, such as how communication is achieved among the citizens, how the city encourages multilingualism, as well as what policies and practices are implemented in the various spheres in order to promote intercultural dialogue and mutual understanding. As a result of the findings, suggestions for best practices, introduction or improvement of policies and visions of the city are put forward.

Keywords: language diversity, social inclusion, multilingualism, language visibility, language policy

Procedia PDF Downloads 445
116 ‘Daily Speaking’: Designing an App for Construction of Language Learning Model Supporting ‘Seamless Flipped’ Environment

Authors: Zhou Hong, Gu Xiao-Qing, Lıu Hong-Jiao, Leng Jing

Abstract:

Seamless learning is becoming a research hotspot in recent years, and the emerging of micro-lectures, flipped classroom has strengthened the development of seamless learning. Based on the characteristics of the seamless learning across time and space and the course structure of the flipped classroom, and the theories of language learning, we put forward the language learning model which can support ‘seamless flipped’ environment (abbreviated as ‘S-F’). Meanwhile, the characteristics of the ‘S-F’ learning environment, the corresponding framework construction and the activity design of diversified corpora were introduced. Moreover, a language learning app named ‘Daily Speaking’ was developed to facilitate the practice of the language learning model in ‘S-F’ environment. In virtue of the learning case of Shanghai language, the rationality and feasibility of this framework were examined, expecting to provide a reference for the design of ‘S-F’ learning in different situations.

Keywords: seamless learning, flipped classroom, seamless-flipped environment, language learning model

Procedia PDF Downloads 151
115 The Grammatical Dictionary Compiler: A System for Kartvelian Languages

Authors: Liana Lortkipanidze, Nino Amirezashvili, Nino Javashvili

Abstract:

The purpose of the grammatical dictionary is to provide information on the morphological and syntactic characteristics of the basic word in the dictionary entry. The electronic grammatical dictionaries are used as a tool of automated morphological analysis for texts processing. The Georgian Grammatical Dictionary should contain grammatical information for each word: part of speech, type of declension/conjugation, grammatical forms of the word (paradigm), alternative variants of basic word/lemma. In this paper, we present the system for compiling the Georgian Grammatical Dictionary automatically. We propose dictionary-based methods for extending grammatical lexicons. The input lexicon contains only a few number of words with identical grammatical features. The extension is based on similarity measures between features of words; more precisely, we add words to the extended lexicons, which are similar to those, which are already in the grammatical dictionary. Our dictionaries are corpora-based, and for the compiling, we introduce the method for lemmatization of unknown words, i.e., words of which neither full form nor lemma is in the grammatical dictionary.

Keywords: acquisition of lexicon, Georgian grammatical dictionary, lemmatization rules, morphological processor

Procedia PDF Downloads 116
114 Bridging the Data Gap for Sexism Detection in Twitter: A Semi-Supervised Approach

Authors: Adeep Hande, Shubham Agarwal

Abstract:

This paper presents a study on identifying sexism in online texts using various state-of-the-art deep learning models based on BERT. We experimented with different feature sets and model architectures and evaluated their performance using precision, recall, F1 score, and accuracy metrics. We also explored the use of pseudolabeling technique to improve model performance. Our experiments show that the best-performing models were based on BERT, and their multilingual model achieved an F1 score of 0.83. Furthermore, the use of pseudolabeling significantly improved the performance of the BERT-based models, with the best results achieved using the pseudolabeling technique. Our findings suggest that BERT-based models with pseudolabeling hold great promise for identifying sexism in online texts with high accuracy.

Keywords: large language models, semi-supervised learning, sexism detection, data sparsity

Procedia PDF Downloads 36
113 Armenian in the Jordanian Linguistic Landscape: Marginalisation and Revitalisation

Authors: Omar Alomoush

Abstract:

This paper examines the Armenian language in the linguistic landscape of Jordanian cities. The results indicate that Armenian is chiefly marginalised in the LL. By quantitative and qualitative methods, the current study attempts to identify the main reasons behind this marginalisation. In the light of the fact that Armenian is completely absent from the commercial streets of major Jordanian cities, all monolingual and multilingual signs in Armenian Neighbourhood in Amman city are photographed to identify them according to function and language. To provide plausible explanations for the marginalisation of the Armenian language in the LL, the current study builds upon issues of language maintenance and underlying language policy. According to the UNESCO Endangerment Framework, it can be assumed that Armenian is a vulnerable language, even though the Armenian Church exerted great efforts to revitalise Armenian in all social settings, including the LL. It was found that language policies enacted by the state of Jordan, language shift, language hostility, voluntary migration and economic pressures are among the reasons behind this marginalisation.

Keywords: linguistic landscape, multilingualism, Armenian, marginalisation and revitalisation

Procedia PDF Downloads 235
112 Integrating ICT- Based Applications for Sustainable Tourism Development in Algeria

Authors: Boutkhil Guemide, Chellali Benachaiba

Abstract:

Information and Communication Technology (ICT) has an inevitable impact on different industries and their performances. The tourism industry as the largest and fastest growing industry in the world cannot be excluded from this technology and its huge impacts. ICT provides information about tourist attractions in the different destinations before travelling and may improve tourists’ satisfaction. Although Algeria has great tourism potentials, it still needs to be performed well in promoting its attractions to international tourists via ICT tools yet. This research explores the impact of ICT on foreign tourists’ satisfaction of the tourism industry and uses Algerian tourist agencies as a case study, and proposes a model for the impact of ICT on sustainable tourism. Finally, it is concluded that e-ticketing, e-reservation, online payment, multilingual and updated information websites are essential needs for planning strategies in the field of e-tourism. Also, it is recommended that the tourism authorities should develop e-tourism infrastructures in order to keep up with the competitiveness of this field to enable the country to benefit from the global benefits of the tourism industry.

Keywords: Information and Communications Technology (ICT), tourism, tourists’ satisfaction, sustainable tourism

Procedia PDF Downloads 506
111 The Assessment of Bilingual Students: How Bilingual Can It Really Be?

Authors: Serge Lacroix

Abstract:

The proposed study looks at the psychoeducational assessment of bilingual students, in English and French in this case. It will be the opportunity to look at language of assessment and specifically how certain tests can be administered in one language and others in another language. It is also a look into the questioning of the validity of the test scores that are obtained as well as the quality and generalizability of the conclusions that can be drawn. Bilingualism and multiculturalism, although in constant expansion, is not considered in norms development and remains a poorly understood factor when it is at play in the context of a psychoeducational assessment. Student placement, diagnoses, accurate measures of intelligence and achievement are all impacted by the quality of the assessment procedure. The same is true for questionnaires administered to parents and self-reports completed by bilingual students who, more often than not, are assessed in a language that is not their primary one or are compared to monolinguals not dealing with the same challenges or the same skills. Results show that students, when offered to work in a bilingual fashion, chooses to do so in a significant proportion. Recommendations will be offered to support educators aiming at expanding their skills when confronted with multilingual students in an assessment context.

Keywords: psychoeducational assessment, bilingualism, multiculturalism, intelligence, achievement

Procedia PDF Downloads 427
110 Development of Fake News Model Using Machine Learning through Natural Language Processing

Authors: Sajjad Ahmed, Knut Hinkelmann, Flavio Corradini

Abstract:

Fake news detection research is still in the early stage as this is a relatively new phenomenon in the interest raised by society. Machine learning helps to solve complex problems and to build AI systems nowadays and especially in those cases where we have tacit knowledge or the knowledge that is not known. We used machine learning algorithms and for identification of fake news; we applied three classifiers; Passive Aggressive, Naïve Bayes, and Support Vector Machine. Simple classification is not completely correct in fake news detection because classification methods are not specialized for fake news. With the integration of machine learning and text-based processing, we can detect fake news and build classifiers that can classify the news data. Text classification mainly focuses on extracting various features of text and after that incorporating those features into classification. The big challenge in this area is the lack of an efficient way to differentiate between fake and non-fake due to the unavailability of corpora. We applied three different machine learning classifiers on two publicly available datasets. Experimental analysis based on the existing dataset indicates a very encouraging and improved performance.

Keywords: fake news detection, natural language processing, machine learning, classification techniques.

Procedia PDF Downloads 130
109 Silence the Silence No More: A Translanguaging Analysis of Two Silent Scenes in Wong Kar-Wai’s Multi-Genre Film ‘2046’

Authors: Liu M. Hanmin

Abstract:

Wong Kar-Wai’s multi-genre film 2046, world premiered in 2004, comes with a vibrant mediascape made up of multiple named languages, code-switching, intertitles, news footage from the real world, and extra-linguistic means of communication. In film- and multilingual studies it is still a challenge to incorporate non-languages into an analytical framework with conventional languages. This paper uses translanguaging theory to read silent practices in Wong Kar-Wai’ 2046. Two scenes that feature the silence experience the most are taken as case studies. In these two scenes, we can identify two tropes of intersemiotic relationships that are co-articulated by silence: patriarchy and unfinished romance, respectively. The conclusion argues that silence in Wong Kar-Wai’s 2046 exerts multimodal agency by ‘speaking’ directly to the audience and in mutual directions to characters. Thereby, it moves beyond the passive role of merely accentuating or assisting the aural register of a film.

Keywords: translanguaging, Wong Kar-Wai, multimodality, semiotics, inter semiotics, Hong Kong media, film culture

Procedia PDF Downloads 55
108 The Language of Fliptop among Filipino Youth: A Discourse Analysis

Authors: Bong Borero Lumabao

Abstract:

This qualitative research is a study on the lines of Fliptop talks performed by the Fliptop rappers employing Finnegan’s (2008) discourse analysis. This paper aimed to analyze the phonological, morphological, and semantic features of the fliptop talk, to explore the structures in the lines of Fliptop among Filipino youth, and to uncover the various insights that can be gained from it. The corpora of the study included all the 20 Fliptop Videos downloaded from the Youtube Channel of Fliptop. Results revealed that Fliptop contains phonological features such as assonance, consonance, deletion, lengthening, and rhyming. Morphological features include acronym, affixation, blending, borrowing, code-mixing and switching, compounding, conversion or functional shifts, and dysphemism. Semantics presented the lexical category, meaning, and words used in the fliptop talks. Structure of Fliptop revolves on the personal attack (physical attributes), attack on the bars (rapping skills), extension: family members and friends, antithesis, profane words, figurative languages, sexual undertones, anime characters, homosexuality, and famous celebrities involvement.

Keywords: discourse analysis, fliptop talks, filipino youth, fliptop videos, Philippines

Procedia PDF Downloads 201
107 Corporate Cautionary Statement: A Genre of Professional Communication

Authors: Chie Urawa

Abstract:

Cautionary statements or disclaimers in corporate annual reports need to be carefully designed because clear cautionary statements may protect a company in the case of legal disputes and may undermine positive impressions. This study compares the language of cautionary statements using two corpora, Sony’s cautionary statement corpus (S-corpus) and Panasonic’s cautionary statement corpus (P-corpus), illustrating the differences and similarities in relation to the use of meaningful cautionary statements and critically analyzing why practitioners use the way. The findings describe the distinct differences between the two companies in the presentation of the risk factors and the way how they make the statements. The word ability is used more for legal protection in S-corpus whereas the word possibility is used more to convey a better impression in P-corpus. The main similarities are identified in the use of lexical words and pronouns, and almost the same wordings for eight years. The findings show how they make the statements unique to the company in the presentation of risk factors, and the characteristics of specific genre of professional communication. Important implications of this study are that more comprehensive approach can be applied in other contexts, and be used by companies to reflect upon their cautionary statements.

Keywords: cautionary statements, corporate annual reports, corpus, risk factors

Procedia PDF Downloads 138
106 A Syntactic Errors Analysis in the Malaysian ESL Learners' Written Composition

Authors: Annie Gedion, Johan Severinus Tati, Jacinta Caroline Peter

Abstract:

Syntax error analysis studies have a significant role in English language teaching especially in the second language. This study investigates the syntax errors in written composition by 50 multilingual ESL learners in Politeknik Kota Kinabalu Sabah, Malaysia. The subjects speak their own dialect, Malay as their second language and English as their third or foreign language. Data were collected from the written discourse in the form of descriptive essays. The subjects were asked to write in the classroom within 45 minutes. 15 categories of errors were classified into a set of syntactic categories and were analysed based on the five steps of the syntactic analysis procedure. The findings of the study showed that the mother tongue interference, as well as lack of vocabulary and grammar knowledge, were the major sources of syntax errors in the learners’ written composition. Learners should be exposed to the differentiation of Malay and English grammar to avoid interference and effective learning of second language writing.

Keywords: errors analysis, syntactic analysis, English as a second language, ESL writing

Procedia PDF Downloads 259
105 A Comparison of the First Language Vocabulary Used by Indonesian Year 4 Students and the Vocabulary Taught to Them in English Language Textbooks

Authors: Fitria Ningsih

Abstract:

This study concerns on the process of making corpus obtained from Indonesian year 4 students’ free writing compared to the vocabulary taught in English language textbooks. 369 students’ sample writings from 19 public elementary schools in Malang, East Java, Indonesia and 5 selected English textbooks were analyzed through corpus in linguistics method using AdTAT -the Adelaide Text Analysis Tool- program. The findings produced wordlists of the top 100 words most frequently used by students and the top 100 words given in English textbooks. There was a 45% match between the two lists. Furthermore, the classifications of the top 100 most frequent words from the two corpora based on part of speech found that both the Indonesian and English languages employed a similar use of nouns, verbs, adjectives, and prepositions. Moreover, to see the contextualizing the vocabulary of learning materials towards the students’ need, a depth-analysis dealing with the content and the cultural views from the vocabulary taught in the textbooks was discussed through the criteria developed from the checklist. Lastly, further suggestions are addressed to language teachers to understand the students’ background such as recognizing the basic words students acquire before teaching them new vocabulary in order to achieve successful learning of the target language.

Keywords: corpus, frequency, English, Indonesian, linguistics, textbooks, vocabulary, wordlists, writing

Procedia PDF Downloads 158
104 Study of Multimodal Resources in Interactions Involving Children with Autistic Spectrum Disorders

Authors: Fernanda Miranda da Cruz

Abstract:

This paper aims to systematize, descriptively and analytically, the relations between language, body and material world explored in a specific empirical context: everyday co-presence interactions between children diagnosed with Autistic Spectrum Disease ASD and various interlocutors. We will work based on 20 hours of an audiovisual corpus in Brazilian Portuguese language. This analysis focuses on 1) the analysis of daily interactions that have the presence/participation of subjects with a diagnosis of ASD based on an embodied interaction perspective; 2) the study of the status and role of gestures, body and material world in the construction and constitution of human interaction and its relation with linguistic-cognitive processes and Autistic Spectrum Disorders; 3) to highlight questions related to the field of videoanalysis, such as: procedures for recording interactions in complex environments (involving many participants, use of objects and body movement); the construction of audiovisual corpora for linguistic-interaction research; the invitation to a visual analytical mentality of human social interactions involving not only the verbal aspects that constitute it, but also the physical space, the body and the material world.

Keywords: autism spectrum disease, multimodality, social interaction, non-verbal interactions

Procedia PDF Downloads 86
103 Socioeconomic Status and Gender Influence on Linguistic Change: A Case Study on Language Competence and Confidence of Multilingual Minority Language Speakers

Authors: Stefanie Siebenhütter

Abstract:

Male and female speakers use language differently and with varying confidence levels. This paper contrasts gendered differences in language use with socioeconomic status and age factors. It specifically examines how Kui minority language use and competence are conditioned by the variable of gender and discusses potential reasons for this variation by examining gendered language awareness and sociolinguistic attitudes. Moreover, it discusses whether women in Kui society function as 'leaders of linguistic change', as represented in Labov’s sociolinguistic model. It discusses whether societal role expectations in collectivistic cultures influence the model of linguistic change. The findings reveal current Kui speaking preferences and give predictions on the prospective language use, which is a stable situation of multilingualism because the current Kui speakers will socialize and teach the prospective Kui speakers in the near future. It further confirms that Lao is losing importance in Kui speaker’s (female’s) daily life.

Keywords: gender, identity construction, language change, minority language, multilingualism, sociolinguistics, social Networks

Procedia PDF Downloads 142
102 Algerian Case Study of Age Effect and Cross Linguistic Influence in Third Language Phonology Acquisition

Authors: Zouleykha Belabbes

Abstract:

Learning foreign languages is sine qua non in the era of globalization, mobility, and communications, which grants access and connectedness to the world. This urgent need is highlighted in monolingual settings, however, in multilingual contexts the case is, to some extent, complicated. In effect, research on bilingualism and multilingualism lead to the issue of Cross Linguistic Influence (CLI) which seeks to explain how and under which conditions prior linguistic knowledge of first language (L1) and / or second language (L2) influences the production, comprehension and development of a third language (L3) or additional language (Ln). Moreover, the issue of age is also one of the persistent topics in the field of language acquisition. This paper aims to scrutinize the effect of age and two previously known languages: Arabic (L1) and French (L2) in acquiring English (L3) phonology in Algerian context. The study consisted of 20 participants of different age range who were presented with recorded samples of English (L3). The findings confirm the results of some previous studies on the issue of Critical Period Hypothesis (CPH) and demonstrate a tendency for the L2 phonological transfer in L3 production at the initial stages of acquisition within young and later learners that for some circumstances diminished as L3 proficiency develop.

Keywords: acquisition, age effect, cross linguistic influence, L3 phonology

Procedia PDF Downloads 205
101 Cross-Dialect Sentence Transformation: A Comparative Analysis of Language Models for Adapting Sentences to British English

Authors: Shashwat Mookherjee, Shruti Dutta

Abstract:

This study explores linguistic distinctions among American, Indian, and Irish English dialects and assesses various Language Models (LLMs) in their ability to generate British English translations from these dialects. Using cosine similarity analysis, the study measures the linguistic proximity between original British English translations and those produced by LLMs for each dialect. The findings reveal that Indian and Irish English translations maintain notably high similarity scores, suggesting strong linguistic alignment with British English. In contrast, American English exhibits slightly lower similarity, reflecting its distinct linguistic traits. Additionally, the choice of LLM significantly impacts translation quality, with Llama-2-70b consistently demonstrating superior performance. The study underscores the importance of selecting the right model for dialect translation, emphasizing the role of linguistic expertise and contextual understanding in achieving accurate translations.

Keywords: cross-dialect translation, language models, linguistic similarity, multilingual NLP

Procedia PDF Downloads 24
100 Investigating Differential Psychological Impact of Translated Movies: An Experimental Design

Authors: Sonakshi Saxena, Moosath Harishankar Vasudevan

Abstract:

The current study seeks to investigate the differences in the psychological impact of movies in their original and translated versions. International cinema is exemplar of the success of globalization. The multitude of languages in the global village does not seem to impede the common cinematic goal of filmmakers across linguistic boundaries. To understand, hence, whether the psychological impact of movies, intentional or otherwise, is preserved when the original is translated into a different language, an experimental design was adopted. Multilingual participants in the age group 18-25 years were recruited for the same. A control group and an experimental group were randomly assigned and the psychological impacts of movies were studied under two conditions- a) watching the movie in its original language, and b) watching the movie in its original language as well as translated version. For the second condition, the experimental group was further divided into two groups randomly to balance order effects. The major aspects of psychological impact assessed were emotional impact and attitude towards the movie. The scores were compared for the two groups. It is further discussed whether the experience is salient across language or do languages inherently possess the ability to alter experiences of the audience.

Keywords: experimental design, movies, psychological impact, translation

Procedia PDF Downloads 364
99 Semantic Preference across Research Articles: A Corpus-Based Study of Adjectives in English

Authors: Valdênia Carvalho e Almeida

Abstract:

The goal of the present study is to investigate the semantic preference of the most frequent adjectives in research articles through a corpus-based analysis of texts published in journals in Applied Linguistics (AL). The corpus used in this study contains texts published in the period from 2014 to 2018 in the three journals: Language Learning and Technology; English for Academic Purposes, and TESOL Quaterly, totaling more than one million words. A corpus-based analysis was carried out on the corpus to identify the most frequent adjectives that co-occurred in the three journals. By observing the concordance lines of the adjectives and analyzing the words they associated with, the semantic preferences of each adjective were determined. Later, the AL corpus analysis was compared to the investigation of the same adjectives in a corpus of Chemistry. This second part of the study aimed to identify possible differences and similarities between the two corpora in relation to the use of the adjectives in research articles from both areas. The results show that there are some preferences which seem to be closely related not only to the academic genre of the texts but also to the specific domain of the discipline and, to a lesser extent, to the context of research in each journal. This research illustrates a possible contribution of Corpus Linguistics to explore the concept of semantic preference in more detail, considering the complex nature of the phenomenon.

Keywords: applied linguistics, corpus linguistics, chemistry, research article, semantic preference

Procedia PDF Downloads 153
98 Corpus-Based Analysis on the Translatability of Conceptual Vagueness in Traditional Chinese Medicine Classics Huang Di Nei Jing

Authors: Yan Yue

Abstract:

Huang Di Nei Jing (HDNJ) is one of the significant traditional Chinese medicine (TCM) classics which lays the foundation of TCM theory and practice. It is an important work for the world to study the ancient civilizations and medical history of China. Language in HDNJ is highly concise and vague, and notably challenging to translate. This paper investigates the translatability of one particular vagueness in HDNJ: the conceptual vagueness which carries the Chinese philosophical and cultural connotations. The corpora tool Sketch Engine is used to provide potential online contexts and word behaviors. Selected two English translations of HDNJ by TCM practitioner and non-practitioner are used to examine frequency and distribution of linguistic features of the translation. It was found the hypothesis about the universals of translated language (explicitation, normalisation) is true in one translation, but it is on the sacrifice of some original contextual connotations. Transliteration is purposefully used in the second translation to retain the original flavor, which is argued as a violation of the principle of relevance in communication because it yields little contextual effects and demands more processing effort of the reader. The translatability of conceptual vagueness in HDNJ is constrained by source language context and the reader’s cognitive environment.

Keywords: corpus-based translation, translatability, TCM classics, vague language

Procedia PDF Downloads 343
97 A Study on Sentiment Analysis Using Various ML/NLP Models on Historical Data of Indian Leaders

Authors: Sarthak Deshpande, Akshay Patil, Pradip Pandhare, Nikhil Wankhede, Rushali Deshmukh

Abstract:

Among the highly significant duties for any language most effective is the sentiment analysis, which is also a key area of NLP, that recently made impressive strides. There are several models and datasets available for those tasks in popular and commonly used languages like English, Russian, and Spanish. While sentiment analysis research is performed extensively, however it is lagging behind for the regional languages having few resources such as Hindi, Marathi. Marathi is one of the languages that included in the Indian Constitution’s 8th schedule and is the third most widely spoken language in the country and primarily spoken in the Deccan region, which encompasses Maharashtra and Goa. There isn’t sufficient study on sentiment analysis methods based on Marathi text due to lack of available resources, information. Therefore, this project proposes the use of different ML/NLP models for the analysis of Marathi data from the comments below YouTube content, tweets or Instagram posts. We aim to achieve a short and precise analysis and summary of the related data using our dataset (Dates, names, root words) and lexicons to locate exact information.

Keywords: multilingual sentiment analysis, Marathi, natural language processing, text summarization, lexicon-based approaches

Procedia PDF Downloads 37
96 Evaluation and Compression of Different Language Transformer Models for Semantic Textual Similarity Binary Task Using Minority Language Resources

Authors: Ma. Gracia Corazon Cayanan, Kai Yuen Cheong, Li Sha

Abstract:

Training a language model for a minority language has been a challenging task. The lack of available corpora to train and fine-tune state-of-the-art language models is still a challenge in the area of Natural Language Processing (NLP). Moreover, the need for high computational resources and bulk data limit the attainment of this task. In this paper, we presented the following contributions: (1) we introduce and used a translation pair set of Tagalog and English (TL-EN) in pre-training a language model to a minority language resource; (2) we fine-tuned and evaluated top-ranking and pre-trained semantic textual similarity binary task (STSB) models, to both TL-EN and STS dataset pairs. (3) then, we reduced the size of the model to offset the need for high computational resources. Based on our results, the models that were pre-trained to translation pairs and STS pairs can perform well for STSB task. Also, having it reduced to a smaller dimension has no negative effect on the performance but rather has a notable increase on the similarity scores. Moreover, models that were pre-trained to a similar dataset have a tremendous effect on the model’s performance scores.

Keywords: semantic matching, semantic textual similarity binary task, low resource minority language, fine-tuning, dimension reduction, transformer models

Procedia PDF Downloads 175
95 Statistical Comparison of Machine and Manual Translation: A Corpus-Based Study of Gone with the Wind

Authors: Yanmeng Liu

Abstract:

This article analyzes and compares the linguistic differences between machine translation and manual translation, through a case study of the book Gone with the Wind. As an important carrier of human feeling and thinking, the literature translation poses a huge difficulty for machine translation, and it is supposed to expose distinct translation features apart from manual translation. In order to display linguistic features objectively, tentative uses of computerized and statistical evidence to the systematic investigation of large scale translation corpora by using quantitative methods have been deployed. This study compiles bilingual corpus with four versions of Chinese translations of the book Gone with the Wind, namely, Piao by Chunhai Fan, Piao by Huairen Huang, translations by Google Translation and Baidu Translation. After processing the corpus with the software of Stanford Segmenter, Stanford Postagger, and AntConc, etc., the study analyzes linguistic data and answers the following questions: 1. How does the machine translation differ from manual translation linguistically? 2. Why do these deviances happen? This paper combines translation study with the knowledge of corpus linguistics, and concretes divergent linguistic dimensions in translated text analysis, in order to present linguistic deviances in manual and machine translation. Consequently, this study provides a more accurate and more fine-grained understanding of machine translation products, and it also proposes several suggestions for machine translation development in the future.

Keywords: corpus-based analysis, linguistic deviances, machine translation, statistical evidence

Procedia PDF Downloads 112
94 Effects of Bilingual Education in the Teaching and Learning Practices in the Continuous Improvement and Development of k12 Program

Authors: Miriam Sebastian

Abstract:

This research focused on the effects of bilingual education as medium of instruction to the academic performance of selected intermediate students of Miriam’s Academy of Valenzuela Inc. . An experimental design was used, with language of instruction as the independent variable and the different literacy skills as dependent variables. The sample consisted of experimental students comprises of 30 students were exposed to bilingual education (Filipino and English) . They were given pretests and were divided into three groups: Monolingual Filipino, Monolingual English, and Bilingual. They were taught different literacy skills for eight weeks and were then administered the posttests. Data was analyzed and evaluated in the light of the central processing and script-dependent hypotheses. Based on the data, it can be inferred that monolingual instruction in either Filipino or English had a stronger effect on the students’ literacy skills compared to bilingual instruction. Moreover, mother tongue-based instruction, as compared to second-language instruction, had stronger effect on the preschoolers’ literacy skills. Such results have implications not only for mother tongue-based (MTB) but also for English as a second language (ESL) instruction in the country

Keywords: bilingualism, effects, monolingual, function, multilingual, mother tongue

Procedia PDF Downloads 102
93 Engagement Resources Use by Expert and Novice EFL Academic Writers

Authors: Moharram Sharifi

Abstract:

The purpose of this study was to show how expert and novice writers take positions and stances in Research Articles and Master of Art theses Introductions, so Engagement resources were investigated in 30 Research Articles and 30 Master of Art theses written by Iranian non-native speakers. Through paired samples t-test analysis, we found out that the mean occurrences of heteroglossic items in both RA and Master thesis Introductions were larger than those of monoglossic items, indicating the awareness of both groups of writers to ‘engage’ alternative positions in Introduction sections. The results also revealed that expansive choices were preferred over contractive options in both corpora, implying both groups of writers respect alternative voices cautiously by welcoming rather than closing down the possibility of different perspectives and stances. Furthermore, unlike novice academic writers who used more Attribute features than Entertainment ones in their MATs introduction sections, expert academic writers employed a balanced number of Entertainment and Attribute in their RA introduction sections. The balanced deployment of entertaining and Attribute features in RA Introductions by expert writers might be characteristics of the writers’ demonstration of politeness, which is commonly accepted as an essential feature in academic writing discourse. Finally, through qualitative analysis, it was demonstrated that MAT writers, as novice academic writers, suffered from lacking appropriate evaluative stances and authorial voices toward propositions.

Keywords: novice, expert, engagement, RA Introductions, MA Thesis

Procedia PDF Downloads 14
92 The Development of Chinese-English Homophonic Word Pairs Databases for English Teaching and Learning

Authors: Yuh-Jen Wu, Chun-Min Lin

Abstract:

Homophonic words are common in Mandarin Chinese which belongs to the tonal language family. Using homophonic cues to study foreign languages is one of the learning techniques of mnemonics that can aid the retention and retrieval of information in the human memory. When learning difficult foreign words, some learners transpose them with words in a language they are familiar with to build an association and strengthen working memory. These phonological clues are beneficial means for novice language learners. In the classroom, if mnemonic skills are used at the appropriate time in the instructional sequence, it may achieve their maximum effectiveness. For Chinese-speaking students, proper use of Chinese-English homophonic word pairs may help them learn difficult vocabulary. In this study, a database program is developed by employing Visual Basic. The database contains two corpora, one with Chinese lexical items and the other with English ones. The Chinese corpus contains 59,053 Chinese words that were collected by a web crawler. The pronunciations of this group of words are compared with words in an English corpus based on WordNet, a lexical database for the English language. Words in both databases with similar pronunciation chunks and batches are detected. A total of approximately 1,000 Chinese lexical items are located in the preliminary comparison. These homophonic word pairs can serve as a valuable tool to assist Chinese-speaking students in learning and memorizing new English vocabulary.

Keywords: Chinese, corpus, English, homophonic words, vocabulary

Procedia PDF Downloads 147
91 A Study of Transferable Strategies in Multilanguage Learning

Authors: Zixi You

Abstract:

With the demand of multilingual speakers increasing in the job market, multi-language learning programs have become more and more popular among undergraduate students. A study on multi-language learning strategies is therefore highly demanded on both practical and theoretical levels. Based on previous classification of learning strategies in SLA, and an investigation of BA Modern Language program students (with post-A level L2 and ab initio L3 learning experience from year one), this study explores and compares different types of learning strategies used by multi-language speakers and learners, transferable learning strategies between L2 and L3, and factors affecting the transfer. The results indicate that all the 23 types of learning strategies of L2 are employed when learning L3 from ab initio level, yet with different tendencies. Learning strategy transfer from L2 to L3 (i.e., the learners attribute the applying of these L3 learning strategies to be a direct result of their L2 learning experience) are observed in all 23 types of learning strategies. Comparatively, six types of “cognitive strategies” have higher transfer tendency than others. With regard to the failure of the transfer of some particular L2 strategies and the development of independent L3 strategies of individual learners, factors such as language proficiency, language typology and learning environment have played important roles among others. The presentation of this study will provide audiences with detailed data, insightful analysis and discussion on both theoretical and practical aspects of multi-language learning that will benefit both students and educators.

Keywords: learning strategy, multi-language acquisition, second language acquisition, strategy transfer

Procedia PDF Downloads 544
90 One-Shot Text Classification with Multilingual-BERT

Authors: Hsin-Yang Wang, K. M. A. Salam, Ying-Jia Lin, Daniel Tan, Tzu-Hsuan Chou, Hung-Yu Kao

Abstract:

Detecting user intent from natural language expression has a wide variety of use cases in different natural language processing applications. Recently few-shot training has a spike of usage on commercial domains. Due to the lack of significant sample features, the downstream task performance has been limited or leads to an unstable result across different domains. As a state-of-the-art method, the pre-trained BERT model gathering the sentence-level information from a large text corpus shows improvement on several NLP benchmarks. In this research, we are proposing a method to change multi-class classification tasks into binary classification tasks, then use the confidence score to rank the results. As a language model, BERT performs well on sequence data. In our experiment, we change the objective from predicting labels into finding the relations between words in sequence data. Our proposed method achieved 71.0% accuracy in the internal intent detection dataset and 63.9% accuracy in the HuffPost dataset. Acknowledgment: This work was supported by NCKU-B109-K003, which is the collaboration between National Cheng Kung University, Taiwan, and SoftBank Corp., Tokyo.

Keywords: OSML, BERT, text classification, one shot

Procedia PDF Downloads 78