Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 15586

Search results for: large language models (LLMS)

15556 On the Existence of Homotopic Mapping Between Knowledge Graphs and Graph Embeddings

Abstract:

Knowledge Graphs KG) and their relation to Graph Embeddings (GE) represent a unique data structure in the landscape of machine learning (relative to image, text and acoustic data). Unlike the latter, GEs are the only data structure sufficient for representing hierarchically dense, semantic information needed for use-cases like supply chain data and protein folding where the search space exceeds the limits traditional search methods (e.g. page-rank, Dijkstra, etc.). While GEs are effective for compressing low rank tensor data, at scale, they begin to introduce a new problem of ’data retreival’ which we observe in Large Language Models. Notable attempts by transE, TransR and other prominent industry standards have shown a peak performance just north of 57% on WN18 and FB15K benchmarks, insufficient practical industry applications. They’re also limited, in scope, to next node/link predictions. Traditional linear methods like Tucker, CP, PARAFAC and CANDECOMP quickly hit memory limits on tensors exceeding 6.4 million nodes. This paper outlines a topological framework for linear mapping between concepts in KG space and GE space that preserve cardinality. Most importantly we introduce a traceable framework for composing dense linguistic strcutures. We demonstrate performance on WN18 benchmark this model hits. This model does not rely on Large Langauge Models (LLM) though the applications are certainy relevant here as well.

Keywords: representation theory, large language models, graph embeddings, applied algebraic topology, applied knot theory, combinatorics

Procedia PDF Downloads 45

15555 Bridging the Data Gap for Sexism Detection in Twitter: A Semi-Supervised Approach

Authors: Adeep Hande, Shubham Agarwal

Abstract:

This paper presents a study on identifying sexism in online texts using various state-of-the-art deep learning models based on BERT. We experimented with different feature sets and model architectures and evaluated their performance using precision, recall, F1 score, and accuracy metrics. We also explored the use of pseudolabeling technique to improve model performance. Our experiments show that the best-performing models were based on BERT, and their multilingual model achieved an F1 score of 0.83. Furthermore, the use of pseudolabeling significantly improved the performance of the BERT-based models, with the best results achieved using the pseudolabeling technique. Our findings suggest that BERT-based models with pseudolabeling hold great promise for identifying sexism in online texts with high accuracy.

Keywords: large language models, semi-supervised learning, sexism detection, data sparsity

Procedia PDF Downloads 40

15554 A BERT-Based Model for Financial Social Media Sentiment Analysis

Authors: Josiel Delgadillo, Johnson Kinyua, Charles Mutigwe

Abstract:

The purpose of sentiment analysis is to determine the sentiment strength (e.g., positive, negative, neutral) from a textual source for good decision-making. Natural language processing in domains such as financial markets requires knowledge of domain ontology, and pre-trained language models, such as BERT, have made significant breakthroughs in various NLP tasks by training on large-scale un-labeled generic corpora such as Wikipedia. However, sentiment analysis is a strong domain-dependent task. The rapid growth of social media has given users a platform to share their experiences and views about products, services, and processes, including financial markets. StockTwits and Twitter are social networks that allow the public to express their sentiments in real time. Hence, leveraging the success of unsupervised pre-training and a large amount of financial text available on social media platforms could potentially benefit a wide range of financial applications. This work is focused on sentiment analysis using social media text on platforms such as StockTwits and Twitter. To meet this need, SkyBERT, a domain-specific language model pre-trained and fine-tuned on financial corpora, has been developed. The results show that SkyBERT outperforms current state-of-the-art models in financial sentiment analysis. Extensive experimental results demonstrate the effectiveness and robustness of SkyBERT.

Keywords: BERT, financial markets, Twitter, sentiment analysis

Procedia PDF Downloads 124

15553 Language Switching Errors of Bilinguals: Role of Top down and Bottom up Process

Authors: Numra Qayyum, Samina Sarwat, Noor ul Ain

Abstract:

Bilingual speakers generally can speak both languages with the same competency without mixing them intentionally and making mistakes, but sometimes errors occur in language selection. This quantitative study particularly deals with the language errors made by Urdu-English bilinguals. In this research, researchers have given special attention to the part played by bottom-up priming and top-down cognitive control in these errors. Unstable Urdu-English bilingual participants termed pictures and were prompted to shift from one language to another under the pressure of time. Different situations were given to manipulate the participants. The long and short runs trials of the same language were also given before switching to another language. The study is concluded with the findings that bilinguals made more errors when switching to the first language from their second language, and these errors are large in number, especially when a speaker is switching from L2 (second language) to L1 (first language) after a long run. When the switching is reversed, i.e., from L2 to LI, it had no effect at all. These results gave the clear responsibility of all these errors to top-down cognitive control.

Keywords: bottom up priming, language error, language switching, top down cognitive control

Procedia PDF Downloads 109

15552 Emerging Virtual Linguistic Landscape Created by Members of Language Community in TikTok

Authors: Kai Zhu, Shanhua He, Yujiao Chang

Abstract:

This paper explores the virtual linguistic landscape of an emerging virtual language community in TikTok, a language community realizing immediate and non-immediate communication without a precise Spatio-temporal domain or a specific socio-cultural boundary or interpersonal network. This kind of language community generates a large number and various forms of virtual linguistic landscape, with which we conducted a virtual ethnographic survey together with telephone interviews to collect data from coping. We have been following two language communities in TikTok for several months so that we can illustrate the composition of the two language communities and some typical virtual language landscapes in both language communities first. Then we try to explore the reasons why and how they are formed through the organization, transcription, and analysis of the interviews. Our analysis reveals the richness and diversity of the virtual linguistic landscape, and finally, we summarize some of the characteristics of this language community.

Keywords: virtual linguistic landscape, virtual language community, virtual ethnographic survey, TikTok

Procedia PDF Downloads 76

15551 Native Language Identification with Cross-Corpus Evaluation Using Social Media Data: ’Reddit’

Authors: Yasmeen Bassas, Sandra Kuebler, Allen Riddell

Abstract:

Native language identification is one of the growing subfields in natural language processing (NLP). The task of native language identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs. content independent features, when they are evaluated on a different corpus (using social media data “Reddit”). In this NLI task, the predefined models are trained on one corpus (TOEFL), and then the trained models are evaluated on different data using an external corpus (Reddit). Three classifiers are used in this task; the baseline, linear SVM, and logistic regression. Results show that content-based features are more accurate and robust than content independent ones when tested within the corpus and across corpus.

Keywords: NLI, NLP, content-based features, content independent features, social media corpus, ML

Procedia PDF Downloads 105

15550 Recurrent Neural Networks with Deep Hierarchical Mixed Structures for Chinese Document Classification

Authors: Zhaoxin Luo, Michael Zhu

Abstract:

In natural languages, there are always complex semantic hierarchies. Obtaining the feature representation based on these complex semantic hierarchies becomes the key to the success of the model. Several RNN models have recently been proposed to use latent indicators to obtain the hierarchical structure of documents. However, the model that only uses a single-layer latent indicator cannot achieve the true hierarchical structure of the language, especially a complex language like Chinese. In this paper, we propose a deep layered model that stacks arbitrarily many RNN layers equipped with latent indicators. After using EM and training it hierarchically, our model solves the computational problem of stacking RNN layers and makes it possible to stack arbitrarily many RNN layers. Our deep hierarchical model not only achieves comparable results to large pre-trained models on the Chinese short text classification problem but also achieves state of art results on the Chinese long text classification problem.

Keywords: nature language processing, recurrent neural network, hierarchical structure, document classification, Chinese

Procedia PDF Downloads 37

15549 Challenges of Teaching English Language in Polytechnics

Authors: Jyoti Sanjay Pathrikar

Abstract:

The 21st century is marked by increased industrialization and a great spurt of technical institutes in almost all parts of the country. In this changing scenario, teaching English language to the students of polytechnic institutes, situated in the small towns of the country is a great challenge as well as responsibility. The learners have very strong vernacular roots and their adaptation to the English language is really slow, as a result teaching English language to them is a herculean task. The students of polytechnics get admission despite of low grades, the base of English has to be prepared at the plus two level, the influence of the local language looms large and the reluctance to learn the English language is obvious. However, the needs of the industries have to be kept in mind and the prospective engineers have to be taught the language. There is an urgent need to devise new ways of teaching the language keeping in mind the requirements of the industry, the capability of the students and maintaining the sanctity of the language. A way has to be carved out.

Keywords: industrialization, herculean, prospective, sanctity, vernacular

Procedia PDF Downloads 411

15548 Computational Linguistic Implications of Gender Bias: Machines Reflect Misogyny in Society

Authors: Irene Yi

Abstract:

Machine learning, natural language processing, and neural network models of language are becoming more and more prevalent in the fields of technology and linguistics today. Training data for machines are at best, large corpora of human literature and at worst, a reflection of the ugliness in society. Computational linguistics is a growing field dealing with such issues of data collection for technological development. Machines have been trained on millions of human books, only to find that in the course of human history, derogatory and sexist adjectives are used significantly more frequently when describing females in history and literature than when describing males. This is extremely problematic, both as training data, and as the outcome of natural language processing. As machines start to handle more responsibilities, it is crucial to ensure that they do not take with them historical sexist and misogynistic notions. This paper gathers data and algorithms from neural network models of language having to deal with syntax, semantics, sociolinguistics, and text classification. Computational analysis on such linguistic data is used to find patterns of misogyny. Results are significant in showing the existing intentional and unintentional misogynistic notions used to train machines, as well as in developing better technologies that take into account the semantics and syntax of text to be more mindful and reflect gender equality. Further, this paper deals with the idea of non-binary gender pronouns and how machines can process these pronouns correctly, given its semantic and syntactic context. This paper also delves into the implications of gendered grammar and its effect, cross-linguistically, on natural language processing. Languages such as French or Spanish not only have rigid gendered grammar rules, but also historically patriarchal societies. The progression of society comes hand in hand with not only its language, but how machines process those natural languages. These ideas are all extremely vital to the development of natural language models in technology, and they must be taken into account immediately.

Keywords: computational analysis, gendered grammar, misogynistic language, neural networks

Procedia PDF Downloads 90

15547 Transportation Language Register as One of Language Community

Authors: Diyah Atiek Mustikawati

Abstract:

Language register refers to a variety of a language used for particular purpose or in a particular social setting. Language register also means as a concept of adapting one’s use of language to conform to standards or tradition in a given professional or social situation. This descriptive study tends to discuss about the form of language register in transportation aspect, factors, also the function of use it. Mostly, language register in transportation aspect uses short sentences in form of informal register. The factor caused language register used are speaker, word choice, background of language. The functions of language register in transportations aspect are to make communication between crew easily, also to keep safety when they were in bad condition. Transportation language register developed naturally as one of variety of language used.

Keywords: language register, language variety, communication, transportation

Procedia PDF Downloads 444

15546 Probing Syntax Information in Word Representations with Deep Metric Learning

Authors: Bowen Ding, Yihao Kuang

Abstract:

In recent years, with the development of large-scale pre-trained lan-guage models, building vector representations of text through deep neural network models has become a standard practice for natural language processing tasks. From the performance on downstream tasks, we can know that the text representation constructed by these models contains linguistic information, but its encoding mode and extent are unclear. In this work, a structural probe is proposed to detect whether the vector representation produced by a deep neural network is embedded with a syntax tree. The probe is trained with the deep metric learning method, so that the distance between word vectors in the metric space it defines encodes the distance of words on the syntax tree, and the norm of word vectors encodes the depth of words on the syntax tree. The experiment results on ELMo and BERT show that the syntax tree is encoded in their parameters and the word representations they produce.

Keywords: deep metric learning, syntax tree probing, natural language processing, word representations

Procedia PDF Downloads 34

15545 Document-level Sentiment Analysis: An Exploratory Case Study of Low-resource Language Urdu

Authors: Ammarah Irum, Muhammad Ali Tahir

Abstract:

Document-level sentiment analysis in Urdu is a challenging Natural Language Processing (NLP) task due to the difficulty of working with lengthy texts in a language with constrained resources. Deep learning models, which are complex neural network architectures, are well-suited to text-based applications in addition to data formats like audio, image, and video. To investigate the potential of deep learning for Urdu sentiment analysis, we implemented five different deep learning models, including Bidirectional Long Short Term Memory (BiLSTM), Convolutional Neural Network (CNN), Convolutional Neural Network with Bidirectional Long Short Term Memory (CNN-BiLSTM), and Bidirectional Encoder Representation from Transformer (BERT). In this study, we developed a hybrid deep learning model called BiLSTM-Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN) by fusing BiLSTM and CNN architecture. The proposed and baseline techniques are applied on Urdu Customer Support data set and IMDB Urdu movie review data set by using pre-trained Urdu word embedding that are suitable for sentiment analysis at the document level. Results of these techniques are evaluated and our proposed model outperforms all other deep learning techniques for Urdu sentiment analysis. BiLSTM-SLMFCNN outperformed the baseline deep learning models and achieved 83%, 79%, 83% and 94% accuracy on small, medium and large sized IMDB Urdu movie review data set and Urdu Customer Support data set respectively.

Keywords: urdu sentiment analysis, deep learning, natural language processing, opinion mining, low-resource language

Procedia PDF Downloads 37

15544 A Comparative Study of Approaches in User-Centred Health Information Retrieval

Authors: Harsh Thakkar, Ganesh Iyer

Abstract:

In this paper, we survey various user-centered or context-based biomedical health information retrieval systems. We present and discuss the performance of systems submitted in CLEF eHealth 2014 Task 3 for this purpose. We classify and focus on comparing the two most prevalent retrieval models in biomedical information retrieval namely: Language Model (LM) and Vector Space Model (VSM). We also report on the effectiveness of using external medical resources and ontologies like MeSH, Metamap, UMLS, etc. We observed that the LM based retrieval systems outperform VSM based systems on various fronts. From the results we conclude that the state-of-art system scores for MAP was 0.4146, P@10 was 0.7560 and NDCG@10 was 0.7445, respectively. All of these score were reported by systems built on language modeling approaches.

Keywords: clinical document retrieval, concept-based information retrieval, query expansion, language models, vector space models

Procedia PDF Downloads 291

15543 A Mathematical Agent-Based Model to Examine Two Patterns of Language Change

Authors: Gareth Baxter

Abstract:

We use a mathematical model of language change to examine two recently observed patterns of language change: one in which most speakers change gradually, following the mean of the community change, and one in which most individuals use predominantly one variant or another, and change rapidly if they change at all. The model is based on Croft’s Utterance Selection account of language change, which views language change as an evolutionary process, in which different variants (different ‘ways of saying the same thing’) compete for usage in a population of speakers. Language change occurs when a new variant replaces an older one as the convention within a given population. The present model extends a previous simpler model to include effects related to speaker aging and interspeaker variation in behaviour. The two patterns of individual change (one more centralized and the other more polarized) were recently observed in historical language changes, and it was further observed that slower changes were more associated with the centralized pattern, while quicker changes were more polarized. Our model suggests that the two patterns of change can be explained by different balances between the preference of speakers to use one variant over another and the degree of accommodation to (propensity to adapt towards) other speakers. The correlation with the rate of change appears naturally in our model, and results from the fact that both differential weighting of variants and the degree of accommodation affect the time for change to occur, while also determining the patterns of change. This work represents part of an ongoing effort to examine phenomena in language change through the use of mathematical models. This offers another way to evaluate qualitative explanations that cannot be practically tested (or cannot be tested at all) in a real-world, large-scale speech community.

Keywords: agent based modeling, cultural evolution, language change, social behavior modeling, social influence

Procedia PDF Downloads 209

15542 Dialect as a Means of Identification among Hausa Speakers

Authors: Hassan Sabo

Abstract:

Language is a system of conventionally spoken, manual and written symbols by human beings that members of a certain social group and participants in its culture express themselves. Communication, expression of identity and imaginative expression are among the functions of language. Dialect is a form of language, or a regional variety of language that is spoken in a particular geographical setting by a particular group of people. Hausa is one of the major languages in Africa, in terms of large number of people for whom it is the first language. Hausa is one of the western Chadic groups of languages. It constitutes one of the five or six branches of Afro-Asiatic family. The predominant Hausa speakers are in Nigeria and they live in different geographical locations which resulted to variety of dialects within the Hausa language apart of the standard Hausa language, the Hausa language has a variety of dialect that distinguish from one another by such features as phonology, grammar and vocabulary. This study intends to examine such features that serve as means of identification among Hausa speakers who are set off from others, geographically or socially.

Keywords: dialect, features, geographical location, Hausa language

Procedia PDF Downloads 166

15541 Recurrent Patterns of Netspeak among Selected Nigerians on WhatsApp Platform: A Quest for Standardisation

Authors: Lily Chimuanya, Esther Ajiboye, Emmanuel Uba

Abstract:

One of the consequences of online communication is the birth of new orthography genres characterised by novel conventions of abbreviation and acronyms usually referred to as Netspeak. Netspeak, also known as internet slang, is a style of writing mainly used in online communication to limit the length of text characters and to save time. The aim of this study is to evaluate how second language users of the English language have internalised this new convention of writing; identify the recurrent patterns of Netspeak; and assess the consistency of the use of the identified patterns in relation to their meanings. The study is corpus-based, and data drawn from WhatsApp chart pages of selected groups of Nigerian English speakers show a large occurrence of inconsistencies in the patterns of Netspeak and their meanings. The study argues that rather than emphasise the negative impact of Netspeak on the communicative competence of second language users, studies should focus on suggesting models as yardsticks for standardising the usage of Netspeak and indeed all other emerging language conventions resulting from online communication. This stance stems from the inevitable global language transformation that is eminent with the coming of age of information technology.

Keywords: abbreviation, acronyms, Netspeak, online communication, standardisation

Procedia PDF Downloads 354

15540 Evaluation of Modern Natural Language Processing Techniques via Measuring a Company's Public Perception

Authors: Burak Oksuzoglu, Savas Yildirim, Ferhat Kutlu

Abstract:

Opinion mining (OM) is one of the natural language processing (NLP) problems to determine the polarity of opinions, mostly represented on a positive-neutral-negative axis. The data for OM is usually collected from various social media platforms. In an era where social media has considerable control over companies’ futures, it’s worth understanding social media and taking actions accordingly. OM comes to the fore here as the scale of the discussion about companies increases, and it becomes unfeasible to gauge opinion on individual levels. Thus, the companies opt to automize this process by applying machine learning (ML) approaches to their data. For the last two decades, OM or sentiment analysis (SA) has been mainly performed by applying ML classification algorithms such as support vector machines (SVM) and Naïve Bayes to a bag of n-gram representations of textual data. With the advent of deep learning and its apparent success in NLP, traditional methods have become obsolete. Transfer learning paradigm that has been commonly used in computer vision (CV) problems started to shape NLP approaches and language models (LM) lately. This gave a sudden rise to the usage of the pretrained language model (PTM), which contains language representations that are obtained by training it on the large datasets using self-supervised learning objectives. The PTMs are further fine-tuned by a specialized downstream task dataset to produce efficient models for various NLP tasks such as OM, NER (Named-Entity Recognition), Question Answering (QA), and so forth. In this study, the traditional and modern NLP approaches have been evaluated for OM by using a sizable corpus belonging to a large private company containing about 76,000 comments in Turkish: SVM with a bag of n-grams, and two chosen pre-trained models, multilingual universal sentence encoder (MUSE) and bidirectional encoder representations from transformers (BERT). The MUSE model is a multilingual model that supports 16 languages, including Turkish, and it is based on convolutional neural networks. The BERT is a monolingual model in our case and transformers-based neural networks. It uses a masked language model and next sentence prediction tasks that allow the bidirectional training of the transformers. During the training phase of the architecture, pre-processing operations such as morphological parsing, stemming, and spelling correction was not used since the experiments showed that their contribution to the model performance was found insignificant even though Turkish is a highly agglutinative and inflective language. The results show that usage of deep learning methods with pre-trained models and fine-tuning achieve about 11% improvement over SVM for OM. The BERT model achieved around 94% prediction accuracy while the MUSE model achieved around 88% and SVM did around 83%. The MUSE multilingual model shows better results than SVM, but it still performs worse than the monolingual BERT model.

Keywords: BERT, MUSE, opinion mining, pretrained language model, SVM, Turkish

Procedia PDF Downloads 114

15539 The Pen Is Mightier than the Sword: Kurdish Language Policy in Turkey

Authors: Irene Yi

Abstract:

This paper analyzes the development of Kurdish language endangerment in Turkey and Kurdish language education over time. It examines the historical context of the Turkish state, as well as reasons for the Turkish language hegemony. From a linguistic standpoint, the Kurdish language is in danger of extinction despite a large number of speakers, lest Kurdish language education is more widely promoted. The paper argues that Kurdish is no longer in a stable diglossic state; if the current trends continue, the language will lose its vitality. This paper recognizes the importance of education in preserving the language while discussing the changing political and institutional regard for Kurdish education. Lastly, the paper outlines solutions to the issue by looking at a variety of proposals, from creating a Kurdistan to merely changing the linguistic landscape in Turkey. After analysis of possible solutions in terms of realistic ability and effectiveness, the paper concludes that changing linguistic landscape and increasing Kurdish language education are the most ideal first steps in a long fight for Kurdish linguistic equality.

Keywords: endangered, Kurdish, oppression, policy

Procedia PDF Downloads 123

15538 Exploring Teachers’ Beliefs about Diagnostic Language Assessment Practices in a Large-Scale Assessment Program

Authors: Oluwaseun Ijiwade, Chris Davison, Kelvin Gregory

Abstract:

In Australia, like other parts of the world, the debate on how to enhance teachers using assessment data to inform teaching and learning of English as an Additional Language (EAL, Australia) or English as a Foreign Language (EFL, United States) have occupied the centre of academic scholarship. Traditionally, this approach was conceptualised as ‘Formative Assessment’ and, in recent times, ‘Assessment for Learning (AfL)’. The central problem is that teacher-made tests are limited in providing data that can inform teaching and learning due to variability of classroom assessments, which are hindered by teachers’ characteristics and assessment literacy. To address this concern, scholars in language education and testing have proposed a uniformed large-scale computer-based assessment program to meet the needs of teachers and promote AfL in language education. In Australia, for instance, the Victoria state government commissioned a large-scale project called 'Tools to Enhance Assessment Literacy (TEAL) for Teachers of English as an additional language'. As part of the TEAL project, a tool called ‘Reading and Vocabulary assessment for English as an Additional Language (RVEAL)’, as a diagnostic language assessment (DLA), was developed by language experts at the University of New South Wales for teachers in Victorian schools to guide EAL pedagogy in the classroom. Therefore, this study aims to provide qualitative evidence for understanding beliefs about the diagnostic language assessment (DLA) among EAL teachers in primary and secondary schools in Victoria, Australia. To realize this goal, this study raises the following questions: (a) How do teachers use large-scale assessment data for diagnostic purposes? (b) What skills do language teachers think are necessary for using assessment data for instruction in the classroom? and (c) What factors, if any, contribute to teachers’ beliefs about diagnostic assessment in a large-scale assessment? Semi-structured interview method was used to collect data from at least 15 professional teachers who were selected through a purposeful sampling. The findings from the resulting data analysis (thematic analysis) provide an understanding of teachers’ beliefs about DLA in a classroom context and identify how these beliefs are crystallised in language teachers. The discussion shows how the findings can be used to inform professional development processes for language teachers as well as informing important factor of teacher cognition in the pedagogic processes of language assessment. This, hopefully, will help test developers and testing organisations to align the outcome of this study with their test development processes to design assessment that can enhance AfL in language education.

Keywords: beliefs, diagnostic language assessment, English as an additional language, teacher cognition

Procedia PDF Downloads 171

15537 Towards Efficient Reasoning about Families of Class Diagrams Using Union Models

Authors: Tejush Badal, Sanaa Alwidian

Abstract:

Class diagrams are useful tools within the Unified Modelling Language (UML) to model and visualize the relationships between, and properties of objects within a system. As a system evolves over time and space (e.g., products), a series of models with several commonalities and variabilities create what is known as a model family. In circumstances where there are several versions of a model, examining each model individually, becomes expensive in terms of computation resources. To avoid performing redundant operations, this paper proposes an approach for representing a family of class diagrams into Union Models to represent model families using a single generic model. The paper aims to analyze and reason about a family of class diagrams using union models as opposed to individual analysis of each member model in the family. The union algorithm provides a holistic view of the model family, where the latter cannot be otherwise obtained from an individual analysis approach, this in turn, enhances the analysis performed in terms of speeding up the time needed to analyze a family of models together as opposed to analyzing individual models, one model at a time.

Keywords: analysis, class diagram, model family, unified modeling language, union model

Procedia PDF Downloads 46

15536 Neural Machine Translation for Low-Resource African Languages: Benchmarking State-of-the-Art Transformer for Wolof

Authors: Cheikh Bamba Dione, Alla Lo, Elhadji Mamadou Nguer, Siley O. Ba

Abstract:

In this paper, we propose two neural machine translation (NMT) systems (French-to-Wolof and Wolof-to-French) based on sequence-to-sequence with attention and transformer architectures. We trained our models on a parallel French-Wolof corpus of about 83k sentence pairs. Because of the low-resource setting, we experimented with advanced methods for handling data sparsity, including subword segmentation, back translation, and the copied corpus method. We evaluate the models using the BLEU score and find that transformer outperforms the classic seq2seq model in all settings, in addition to being less sensitive to noise. In general, the best scores are achieved when training the models on word-level-based units. For subword-level models, using back translation proves to be slightly beneficial in low-resource (WO) to high-resource (FR) language translation for the transformer (but not for the seq2seq) models. A slight improvement can also be observed when injecting copied monolingual text in the target language. Moreover, combining the copied method data with back translation leads to a substantial improvement of the translation quality.

Keywords: backtranslation, low-resource language, neural machine translation, sequence-to-sequence, transformer, Wolof

Procedia PDF Downloads 115

15535 Named Entity Recognition System for Tigrinya Language

Authors: Sham Kidane, Fitsum Gaim, Ibrahim Abdella, Sirak Asmerom, Yoel Ghebrihiwot, Simon Mulugeta, Natnael Ambassager

Abstract:

The lack of annotated datasets is a bottleneck to the progress of NLP in low-resourced languages. The work presented here consists of large-scale annotated datasets and models for the named entity recognition (NER) system for the Tigrinya language. Our manually constructed corpus comprises over 340K words tagged for NER, with over 118K of the tokens also having parts-of-speech (POS) tags, annotated with 12 distinct classes of entities, represented using several types of tagging schemes. We conducted extensive experiments covering convolutional neural networks and transformer models; the highest performance achieved is 88.8% weighted F1-score. These results are especially noteworthy given the unique challenges posed by Tigrinya’s distinct grammatical structure and complex word morphologies. The system can be an essential building block for the advancement of NLP systems in Tigrinya and other related low-resourced languages and serve as a bridge for cross-referencing against higher-resourced languages.

Keywords: Tigrinya NER corpus, TiBERT, TiRoBERTa, BiLSTM-CRF

Procedia PDF Downloads 67

15534 Efficient Layout-Aware Pretraining for Multimodal Form Understanding

Authors: Armineh Nourbakhsh, Sameena Shah, Carolyn Rose

Abstract:

Layout-aware language models have been used to create multimodal representations for documents that are in image form, achieving relatively high accuracy in document understanding tasks. However, the large number of parameters in the resulting models makes building and using them prohibitive without access to high-performing processing units with large memory capacity. We propose an alternative approach that can create efficient representations without the need for a neural visual backbone. This leads to an 80% reduction in the number of parameters compared to the smallest SOTA model, widely expanding applicability. In addition, our layout embeddings are pre-trained on spatial and visual cues alone and only fused with text embeddings in downstream tasks, which can facilitate applicability to low-resource of multi-lingual domains. Despite using 2.5% of training data, we show competitive performance on two form understanding tasks: semantic labeling and link prediction.

Keywords: layout understanding, form understanding, multimodal document understanding, bias-augmented attention

Procedia PDF Downloads 121

15533 Diagonal Vector Autoregressive Models and Their Properties

Authors: Usoro Anthony E., Udoh Emediong

Abstract:

Diagonal Vector Autoregressive Models are special classes of the general vector autoregressive models identified under certain conditions, where parameters are restricted to the diagonal elements in the coefficient matrices. Variance, autocovariance, and autocorrelation properties of the upper and lower diagonal VAR models are derived. The new set of VAR models is verified with empirical data and is found to perform favourably with the general VAR models. The advantage of the diagonal models over the existing models is that the new models are parsimonious, given the reduction in the interactive coefficients of the general VAR models.

Keywords: VAR models, diagonal VAR models, variance, autocovariance, autocorrelations

Procedia PDF Downloads 84

15532 Models of Bilingual Education in Majority Language Contexts: An Exploratory Study of Bilingual Programmes in Qatari Primary Schools

Authors: Fatma Al-Maadheed

Abstract:

Following an ethnographic approach this study explored bilingual programmes offered by two types of primary schools in Qatar: international and Independent schools. Qatar with its unique linguistic and socio-economic situation launched a new initiative for educatiobnal development in 2001 but with hardly any research linked to theses changes. The study reveals that the Qatari bilingual schools context was one of heteroglossia, with three codes in operation: Modern Standard Arabic, Colloquial Arabic dialects and English. The two schools adopted different models of bilingualism. The international school adopted a strict separation policy between the two languages following a monoglossic belief. The independent school was found to apply a flexible language policy. The study also highlighted the daily challnges produced from the diglossia situation in Qatar, the difference between students and teacher dialect as well as acquiring literacy in the formal language. In addition to an abscence of a clear language policy in Schools, the study brought attention to the instructional methods utilised in language teaching which are mostly associated with successful bilingual education.

Keywords: diglossia, instructional methods, language policy, qatari primary schools

Procedia PDF Downloads 451

15531 Improving Academic Literacy in the Secondary History Classroom

Authors: Wilhelmina van den Berg

Abstract:

Through intentionally developing the Register Continuum and the Functional Model of Language in the secondary history classroom, teachers can effectively build a teaching and learning cycle geared towards literacy improvement and EAL differentiation. Developing an understanding of and engaging students in the field, tenor, and tone of written and spoken language, allows students to build the foundation for greater academic achievement due to integrated literacy skills in the history classroom. Building a variety of scaffolds during lessons within these models means students can improve their academic language and communication skills.

Keywords: academic language, EAL, functional model of language, international baccalaureate, literacy skills

Procedia PDF Downloads 39

15530 CFD Simulation of a Large Scale Unconfined Hydrogen Deflagration

Authors: I. C. Tolias, A. G. Venetsanos, N. Markatos

Abstract:

In the present work, CFD simulations of a large scale open deflagration experiment are performed. Stoichiometric hydrogen-air mixture occupies a 20 m hemisphere. Two combustion models are compared and are evaluated against the experiment. The Eddy Dissipation Model and a Multi-physics combustion model which is based on Yakhot’s equation for the turbulent flame speed. The values of models’ critical parameters are investigated. The effect of the turbulence model is also examined. k-ε model and LES approach were tested.

Keywords: CFD, deflagration, hydrogen, combustion model

Procedia PDF Downloads 471

15529 Generating Insights from Data Using a Hybrid Approach

Authors: Allmin Susaiyah, Aki Härmä, Milan Petković

Abstract:

Automatic generation of insights from data using insight mining systems (IMS) is useful in many applications, such as personal health tracking, patient monitoring, and business process management. Existing IMS face challenges in controlling insight extraction, scaling to large databases, and generalising to unseen domains. In this work, we propose a hybrid approach consisting of rule-based and neural components for generating insights from data while overcoming the aforementioned challenges. Firstly, a rule-based data 2CNL component is used to extract statistically significant insights from data and represent them in a controlled natural language (CNL). Secondly, a BERTSum-based CNL2NL component is used to convert these CNLs into natural language texts. We improve the model using task-specific and domain-specific fine-tuning. Our approach has been evaluated using statistical techniques and standard evaluation metrics. We overcame the aforementioned challenges and observed significant improvement with domain-specific fine-tuning.

Keywords: data mining, insight mining, natural language generation, pre-trained language models

Procedia PDF Downloads 77

15528 A Syntactic Approach to Applied and Socio-Linguistics in Arabic Language in Modern Communications

Authors: Adeyemo Abduljeeel Taiwo

Abstract:

This research is an attempt that creates a conducive atmosphere of a phonological and morphological compendium of Arabic language in Modern Standard Arabic (MSA) for modern day communications. The research is carried out with the chief aim of grammatical analysis of the two broad fields of Arabic linguistics namely: Applied and Socio-Linguistics. It draws a pictorial record of Applied and Socio-Linguistics in Arabic phonology and morphology. Thematically, it postulates and contemplates to a large degree, the theory of concord in contemporary modern Arabic language acquisition. It utilizes an analytical method while it portrays Arabic as a Semitic language that promotes linguistics and syntax among the scholars of the fields.

Keywords: Arabic language, applied linguistics, socio-linguistics, modern communications

Procedia PDF Downloads 297

15527 Hand Motion Trajectory Analysis for Dynamic Hand Gestures Used in Indian Sign Language

Authors: Daleesha M. Viswanathan, Sumam Mary Idicula

Abstract:

Dynamic hand gestures are an intrinsic component in sign language communication. Extracting spatial temporal features of the hand gesture trajectory plays an important role in a dynamic gesture recognition system. Finding a discrete feature descriptor for the motion trajectory based on the orientation feature is the main concern of this paper. Kalman filter algorithm and Hidden Markov Models (HMM) models are incorporated with this recognition system for hand trajectory tracking and for spatial temporal classification, respectively.

Keywords: orientation features, discrete feature vector, HMM., Indian sign language

Procedia PDF Downloads 342