Search results for: semantics
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 132

Search results for: semantics

42 Intensifier as Changed from the Impolite Word in Thai

Authors: Methawee Yuttapongtada

Abstract:

Intensifier is the linguistic term and device that is generally found in different languages in order to enhance and give additional quantity, quality or emotion to the words of each language. In fact, each language in the world has both of the similar and dissimilar intensifying device. More specially, the wide variety of intensifying device is used for Thai language and one of those is usage of the impolite word or the word that used to mean something negative as intensifier. The data collection in this study was done throughout the spoken language style by collecting from intensifiers regarded as impolite words because these words as employed in the other contexts will be held as the rude, swear words or the words with negative meaning. Then, backward study to the past was done in order to consider the historical change. Explanation of the original meaning and the contexts of words use from the past till the present time were done by use of both textual documents and dictionaries available in different periods. It was found that regarding the semantics and pragmatic aspects, subjectification also is the significant motivation that changed the impolite words to intensifiers. At last, it can explain pathway of the semantic change of these very words undoubtedly. Moreover, it is found that use tendency in the impolite word or the word that used to mean something negative will more be increased and this phenomenon is commonly found in many languages in the world and results of this research may support to the belief that human language in the world is universal and the same still reflected that human has the fundamental thought as the same to each other basically.

Keywords: impolite word, intensifier, Thai, semantic change

Procedia PDF Downloads 169
41 Unraveling the Phonosignological Foundations of Human Language and Semantic Analysis of Linguistic Elements in Cross-Cultural Contexts

Authors: Mahmudjon Kuchkarov, Marufjon Kuchkarov, Mukhayyo Sobirjanova

Abstract:

The origins of human language remain a profound scientific mystery, characterized by speculative theories often lacking empirical support. This study presents findings that may illuminate the genesis of human language, emphasizing its roots in natural, systematic, and repetitive sound patterns. Also, this paper presents the phonosignological and semantic analysis of linguistic elements across various languages and cultures. By utilizing the principles of the "Human Language" theory, we analyze the symbolic, phonetic, and semantic characteristics of elements such as "A", "L", "I", "F", and "四" (pronounced /si/ in Chinese and /shi/ in Japanese). Our findings reveal that natural sounds and their symbolic representations form the foundation of language, with significant implications for understanding religious and secular myths. This paper explores the intricate relationships between these elements and their cultural connotations, particularly focusing on the concept of "descent" in the context of the phonetic sequence "A, L, I, F," and the symbolic associations of the number four with death.

Keywords: empirical research, human language, phonosignology, semantics, sound patterns, symbolism, body shape, body language, coding, Latin alphabet, merging method, natural sound, origin of language, pairing, phonetics, sound and shape production, word origin, word semantic

Procedia PDF Downloads 11
40 A Domain Specific Modeling Language Semantic Model for Artefact Orientation

Authors: Bunakiye R. Japheth, Ogude U. Cyril

Abstract:

Since the process of transforming user requirements to modeling constructs are not very well supported by domain-specific frameworks, it became necessary to integrate domain requirements with the specific architectures to achieve an integrated customizable solutions space via artifact orientation. Domain-specific modeling language specifications of model-driven engineering technologies focus more on requirements within a particular domain, which can be tailored to aid the domain expert in expressing domain concepts effectively. Modeling processes through domain-specific language formalisms are highly volatile due to dependencies on domain concepts or used process models. A capable solution is given by artifact orientation that stresses on the results rather than expressing a strict dependence on complicated platforms for model creation and development. Based on this premise, domain-specific methods for producing artifacts without having to take into account the complexity and variability of platforms for model definitions can be integrated to support customizable development. In this paper, we discuss methods for the integration capabilities and necessities within a common structure and semantics that contribute a metamodel for artifact-orientation, which leads to a reusable software layer with concrete syntax capable of determining design intents from domain expert. These concepts forming the language formalism are established from models explained within the oil and gas pipelines industry.

Keywords: control process, metrics of engineering, structured abstraction, semantic model

Procedia PDF Downloads 130
39 The Diminished Online Persona: A Semantic Change of Chinese Classifier Mei on Weibo

Authors: Hui Shi

Abstract:

This study investigates a newly emerged usage of Chinese numeral classifier mei (枚) in the cyberspace. In modern Chinese grammar, mei as a classifier should occupy the pre-nominal position, and its valid accompanying nouns are restricted to small, flat, fragile inanimate objects rather than humans. To examine the semantic change of mei, two types of data from Weibo.com were collected. First, 500 mei-included Weibo posts constructed a corpus for analyzing this classifier's word order distribution (post-nominal or pre-nominal) as well as its accompanying nouns' semantics (inanimate or human). Second, considering that mei accompanies a remarkable number of human nouns in the first corpus, the second corpus is composed of mei-involved Weibo IDs from users located in first and third-tier cities (n=8 respectively). The findings show that in the cyber community, mei frequently classifies human-related neologisms at the archaic post-normal position. Besides, the 23 to 29-year-old females as well as Weibo users from third-tier cities are the major populations who adopt mei in their user IDs for self-description and identity expression. This paper argues that the creative usage of mei gains popularity in the Chinese internet due to a humor effect. The marked word order switch and semantic misapplication combined to trigger incongruity and jocularity. This study has significance for research on Chinese cyber neologism. It may also lay a foundation for further studies on Chinese classifier change and Chinese internet communication.

Keywords: Chinese classifier, humor, neologism, semantic change

Procedia PDF Downloads 244
38 Effect of the Keyword Strategy on Lexical Semantic Acquisition: Recognition, Retention and Comprehension in an English as Second Language Context

Authors: Fatima Muhammad Shitu

Abstract:

This study seeks to investigate the effect of the keyword strategy on lexico–semantic acquisition, recognition, retention and comprehension in an ESL context. The aim of the study is to determine whether the keyword strategy can be used to enhance acquisition. As a quasi- experimental research, the objectives of the study include: To determine the extent to which the scores obtained by the subjects, who were trained on the use of the keyword strategy for acquisition, differ at the pre-tests and the post–tests and also to find out the relationship in the scores obtained at these tests levels. The sample for the study consists of 300 hundred undergraduate ESL Students in the Federal College of Education, Kano. The seventy-five lexical items for acquisition belong to the lexical field category known as register, and they include Medical, Agriculture and Photography registers (MAP). These were divided in the ratio twenty-five (25) lexical items in each lexical field. The testing technique was used to collect the data while the descriptive and inferential statistics were employed for data analysis. For the purpose of testing, the two kinds of tests administered at each test level include the WARRT (Word Acquisition, Recognition, and Retention Test) and the CCPT (Cloze Comprehension Passage Test). The results of the study revealed that there are significant differences in the scores obtained between the pre-tests, and the post–tests and there are no correlations in the scores obtained as well. This implies that the keyword strategy has effectively enhanced the acquisition of the lexical items studied.

Keywords: keyword, lexical, semantics, strategy

Procedia PDF Downloads 299
37 Efficiency of a Semantic Approach in Teaching Foreign Languages

Authors: Genady Shlomper

Abstract:

During the process of language teaching, each teacher faces some general and some specific problems. Some of these problems are mutual to all languages because they yield to the rules of cognition, conscience, perception, understanding and memory; to the physiological and psychological principles pertaining to the human race irrespective of origin and nationality. Still, every language is a distinctive system, possessing individual properties and an obvious identity, as a result of a development in specific natural, geographical, cultural and historical conditions. The individual properties emerge in the script, in the phonetics, morphology and syntax. All these problems can and should be a subject of a detailed research and scientific analysis, mainly from practical considerations and language teaching requirements. There are some formidable obstacles in the language acquisition process. Among the first to be mentioned is the existence of concepts and entire categories in foreign languages, which are absent in the language of the students. Such phenomena reflect specific ways of thinking and the world-outlook, which were shaped during the evolution. Hindi is the national language of India, which belongs to the group of Indo-Iranian languages from the Indo-European family of languages. The lecturer has gained experience in teaching Hindi language to native speakers of Uzbek, Russian and Hebrew languages. He will show the difficulties in the field of phonetics, morphology and syntax, which the students have to deal with during the acquisition of the language. In the proposed lecture the lecturer will share his experience in making the process of language teaching more efficient by using non-formal semantic approach.

Keywords: applied linguistics, foreign language teaching, language teaching methodology, semantics

Procedia PDF Downloads 342
36 Intonation Salience as an Underframe to Text Intonation Models

Authors: Tatiana Stanchuliak

Abstract:

It is common knowledge that intonation is not laid over a ready text. On the contrary, intonation forms and accompanies the text on the level of its birth in the speaker’s mind. As a result, intonation plays one of the fundamental roles in the process of transferring a thought into external speech. Intonation structure can highlight the semantic significance of textual elements and become a ranging mark in understanding the information structure of the text. Intonation functions by means of prosodic characteristics, one of which is intonation salience, whose function in texts results in making some textual elements more prominent than others. This function of intonation, therefore, performs as organizing. It helps to form the frame of key elements of the text. The study under consideration made an attempt to look into the inner nature of salience and create a sort of a text intonation model. This general goal brought to some more specific intermediate results. First, there were established degrees of salience on the level of the smallest semantic element - intonation group, as well as prosodic means of creating salience, were examined. Second, the most frequent combinations of prosodic means made it possible to distinguish patterns of salience, which then became constituent elements of a text intonation model. Third, the analysis of the predicate structure allowed to divide the whole text into smaller parts, or units, which performed a specific function in the developing of the general communicative intention. It appeared that such units can be found in any text and they have common characteristics of their intonation arrangement. These findings are certainly very important both for the theory of intonation and their practical application.

Keywords: accentuation , inner speech, intention, intonation, intonation functions, models, patterns, predicate, salience, semantics, sentence stress, text

Procedia PDF Downloads 252
35 A Blending Analysis of Metaphors and Metonymies Used to Depict the Deal of the Century by Jordanian Cartoonists

Authors: Aseel Zibin, Abdel Rahman Altakhaineh

Abstract:

This study analyses 30 cartoons depicting THE DEAL OF THE CENTURY as envisaged by two Jordanian cartoonists, namely, EmadHajjaj and Osama Hajjaj. Conceptual Blending Theory (CBT) and Multimodal Metaphor Theory (MMT) are adopted as a theoretical framework to interpret the metaphors and metonymies used in the target cartoons. The results reveal that the target domain THE DEAL OF THE CENTURY was conceptualized mainly through layered metaphors that have metonymic basis and event metaphors\allegories. Specifically, 6 groups were identified: OBJECT or a situation involving OBJECTS, situations involving HUMANS\HYBRIDS of HUMANS and OBJECTS, an ANIMAL OR situation involving an ANIMAL, hybrids of WEAPONS and humans, and event metaphors used to build a story\allegory. The target domain was also depicted via event metaphors used to build a story; some of which are embedded in the Jordanian culture, while others could be perceivable cross-culturally. The results also demonstrate that the most widely used configurations to construe the metaphors was the pictorial source–verbal target in line with Lan and Zuo (2016); the motivation was probably the greater conceptual density and concreteness of visual representation since the target is better captured verbally because of its abstractness. The use of cross-modal mappings of this type was attributed to the abstractness of the target domain, THE DEAL OF THE CENTURY, which makes it more construable via verbal cues rather than visual ones. In contrast, the source domains used were mainly concrete and thus perceivable pictorially rather than verbally.

Keywords: semiotics, cognitive semantics, metaphor, culture, blending, cartoon

Procedia PDF Downloads 149
34 True and False Cognates of Japanese, Chinese and Philippine Languages: A Contrastive Analysis

Authors: Jose Marie E. Ocdenaria, Riceli C. Mendoza

Abstract:

Culturally, languages meet, merge, share, exchange, appropriate, donate, and divide in and to and from each other. Further, this type of recurrence manifests in East Asian cultures, where language influence diffuses across geographical proximities. Historically, China has notable impacts on Japan’s culture. For instance, Japanese borrowed words from China and their way of reading and writing. This qualitative and descriptive employing contrastive analysis study addressed the true and false cognates of Japanese-Philippine languages and Chinese-Philippine languages. It involved a rich collection of data from various sources like textual pieces of evidence or corpora to gain a deeper understanding of true and false cognates between L1 and L2. Cognates of Japanese-Philippine languages and Chinese-Philippine languages were analyzed contrastively according to orthography, phonology, and semantics. The words presented were the roots; however, derivatives, reduplications, and variants of stress were included when they shed emphases on the comparison. The basis of grouping the cognates was its phonetic-semantic resemblance. Based on the analysis, it revealed that there are words which may have several types of lexical relationship. Further, the study revealed that the Japanese language has more false cognates in the Philippine languages, particularly in Tagalog and Cebuano. On the other hand, there are more true cognates of Chinese in Tagalog. It is the hope of this study to provide a significant contribution to a diverse audience. These include the teachers and learners of foreign languages such as Japanese and Chinese, future researchers and investigators, applied linguists, curricular theorists, community, and publishers.

Keywords: Contrastive Analysis, Japanese, Chinese and Philippine languages, Qualitative and descriptive study, True and False Cognates

Procedia PDF Downloads 125
33 An Event-Related Potentials Study on the Processing of English Subjunctive Mood by Chinese ESL Learners

Authors: Yan Huang

Abstract:

Event-related potentials (ERPs) technique helps researchers to make continuous measures on the whole process of language comprehension, with an excellent temporal resolution at the level of milliseconds. The research on sentence processing has developed from the behavioral level to the neuropsychological level, which brings about a variety of sentence processing theories and models. However, the applicability of these models to L2 learners is still under debate. Therefore, the present study aims to investigate the neural mechanisms underlying English subjunctive mood processing by Chinese ESL learners. To this end, English subject clauses with subjunctive moods are used as the stimuli, all of which follow the same syntactic structure, “It is + adjective + that … + (should) do + …” Besides, in order to examine the role that language proficiency plays on L2 processing, this research deals with two groups of Chinese ESL learners (18 males and 22 females, mean age=21.68), namely, high proficiency group (Group H) and low proficiency group (Group L). Finally, the behavioral and neurophysiological data analysis reveals the following findings: 1) Syntax and semantics interact with each other on the SECOND phase (300-500ms) of sentence processing, which is partially in line with the Three-phase Sentence Model; 2) Language proficiency does affect L2 processing. Specifically, for Group H, it is the syntactic processing that plays the dominant role in sentence processing while for Group L, semantic processing also affects the syntactic parsing during the THIRD phase of sentence processing (500-700ms). Besides, Group H, compared to Group L, demonstrates a richer native-like ERPs pattern, which further demonstrates the role of language proficiency in L2 processing. Based on the research findings, this paper also provides some enlightenment for the L2 pedagogy as well as the L2 proficiency assessment.

Keywords: Chinese ESL learners, English subjunctive mood, ERPs, L2 processing

Procedia PDF Downloads 119
32 Code Embedding for Software Vulnerability Discovery Based on Semantic Information

Authors: Joseph Gear, Yue Xu, Ernest Foo, Praveen Gauravaran, Zahra Jadidi, Leonie Simpson

Abstract:

Deep learning methods have been seeing an increasing application to the long-standing security research goal of automatic vulnerability detection for source code. Attention, however, must still be paid to the task of producing vector representations for source code (code embeddings) as input for these deep learning models. Graphical representations of code, most predominantly Abstract Syntax Trees and Code Property Graphs, have received some use in this task of late; however, for very large graphs representing very large code snip- pets, learning becomes prohibitively computationally expensive. This expense may be reduced by intelligently pruning this input to only vulnerability-relevant information; however, little research in this area has been performed. Additionally, most existing work comprehends code based solely on the structure of the graph at the expense of the information contained by the node in the graph. This paper proposes Semantic-enhanced Code Embedding for Vulnerability Discovery (SCEVD), a deep learning model which uses semantic-based feature selection for its vulnerability classification model. It uses information from the nodes as well as the structure of the code graph in order to select features which are most indicative of the presence or absence of vulnerabilities. This model is implemented and experimentally tested using the SARD Juliet vulnerability test suite to determine its efficacy. It is able to improve on existing code graph feature selection methods, as demonstrated by its improved ability to discover vulnerabilities.

Keywords: code representation, deep learning, source code semantics, vulnerability discovery

Procedia PDF Downloads 143
31 Incorporating Lexical-Semantic Knowledge into Convolutional Neural Network Framework for Pediatric Disease Diagnosis

Authors: Xiaocong Liu, Huazhen Wang, Ting He, Xiaozheng Li, Weihan Zhang, Jian Chen

Abstract:

The utilization of electronic medical record (EMR) data to establish the disease diagnosis model has become an important research content of biomedical informatics. Deep learning can automatically extract features from the massive data, which brings about breakthroughs in the study of EMR data. The challenge is that deep learning lacks semantic knowledge, which leads to impracticability in medical science. This research proposes a method of incorporating lexical-semantic knowledge from abundant entities into a convolutional neural network (CNN) framework for pediatric disease diagnosis. Firstly, medical terms are vectorized into Lexical Semantic Vectors (LSV), which are concatenated with the embedded word vectors of word2vec to enrich the feature representation. Secondly, the semantic distribution of medical terms serves as Semantic Decision Guide (SDG) for the optimization of deep learning models. The study evaluate the performance of LSV-SDG-CNN model on four kinds of Chinese EMR datasets. Additionally, CNN, LSV-CNN, and SDG-CNN are designed as baseline models for comparison. The experimental results show that LSV-SDG-CNN model outperforms baseline models on four kinds of Chinese EMR datasets. The best configuration of the model yielded an F1 score of 86.20%. The results clearly demonstrate that CNN has been effectively guided and optimized by lexical-semantic knowledge, and LSV-SDG-CNN model improves the disease classification accuracy with a clear margin.

Keywords: convolutional neural network, electronic medical record, feature representation, lexical semantics, semantic decision

Procedia PDF Downloads 119
30 Self-Supervised Attributed Graph Clustering with Dual Contrastive Loss Constraints

Authors: Lijuan Zhou, Mengqi Wu, Changyong Niu

Abstract:

Attributed graph clustering can utilize the graph topology and node attributes to uncover hidden community structures and patterns in complex networks, aiding in the understanding and analysis of complex systems. Utilizing contrastive learning for attributed graph clustering can effectively exploit meaningful implicit relationships between data. However, existing attributed graph clustering methods based on contrastive learning suffer from the following drawbacks: 1) Complex data augmentation increases computational cost, and inappropriate data augmentation may lead to semantic drift. 2) The selection of positive and negative samples neglects the intrinsic cluster structure learned from graph topology and node attributes. Therefore, this paper proposes a method called self-supervised Attributed Graph Clustering with Dual Contrastive Loss constraints (AGC-DCL). Firstly, Siamese Multilayer Perceptron (MLP) encoders are employed to generate two views separately to avoid complex data augmentation. Secondly, the neighborhood contrastive loss is introduced to constrain node representation using local topological structure while effectively embedding attribute information through attribute reconstruction. Additionally, clustering-oriented contrastive loss is applied to fully utilize clustering information in global semantics for discriminative node representations, regarding the cluster centers from two views as negative samples to fully leverage effective clustering information from different views. Comparative clustering results with existing attributed graph clustering algorithms on six datasets demonstrate the superiority of the proposed method.

Keywords: attributed graph clustering, contrastive learning, clustering-oriented, self-supervised learning

Procedia PDF Downloads 31
29 Metaphors Underlying Idiomatic Expressions in Trilingual Perspective: Contributions to the Teaching of Lexicon and to Materials Development

Authors: Marilei Amadeu Sabino

Abstract:

Idiomatic expressions are linguistic phraseologisms present in natural languages. Known to be metaphorical linguistic combinations, a good majority of them provide elements that reveal important cultural aspects of their linguistic community through their metaphors. With the advent of Cognitive Linguistics (more specifically of Cognitive Semantics), the metaphor ceased to be related to poetic language and rhetorical embellishment and came to be seen as part of simple everyday language, reflecting the way human beings think, act and conceive reality, i. e., a fundamental mechanism of human conceptualizations of the world. In this sense, it came to be conceived as an inevitable mechanism for representing the nature of thought and language. The speakers, in conceptualizing reality, often use metaphorically parts of the body in expressions known as somatic. Several conceptual metaphors appear to be potentially universal or near-universal, because people across the world share certain bodily experiences. In these terms, many linguistic metaphors may be identical or very similar in several languages. These similarities, according to the Theory of Conceptual Metaphor, derive from universal aspects of the human body. Thus, this research aims to investigate the nature of some metaphors underlying somatic idiomatic expressions of Portuguese, Italian and English languages, establishing a pattern of similarities and differences among them from a trilingual perspective. The analysis shows that much of the studied expressions are really structurally, semantically and metaphorically identical or similar in the three languages. These findings incite relevant discussions concerning mother and foreign language learning and aim to contribute to the teaching of phraseological Lexicon as well as to materials development in mono and multilingual perspectives.

Keywords: idiomatic expressions, materials development, metaphors, phraseological lexicon, teaching and learning

Procedia PDF Downloads 176
28 Reading High Rise Residential Development in Istanbul on the Theory of Globalization

Authors: Tuba Sari

Abstract:

One of the major transformations caused by the industrial revolution, technological developments and globalization is undoubtedly acceleration of urbanization process. Globalization, in particular, is one of the major factors that trigger this transformation. In this context, as a result of the global metropolitan city system, multifunctional rising structure forms are becoming undeniable fact of the world’s leading metropolises as the manifestation of prestige and power with different life choices, easy accessibility to services related to the era of technology. The scope of research deals with five different urban centers in İstanbul where high-rise housing is increasing dramatically after 2000’s. Therefore, the research regards multi-centered urban residential pattern being created by high-rise housing structures in the city. The methodology of the research is based on two main issue, one of them is related to sampling method of high-rise housing projects in İstanbul, while the other method of the research is based on the model of Semantics. In the framework of research hypothesis, it is aimed to prove that the character of vertical intensive structuring in Istanbul is based on seeking of different forms and images in the expressive quality, considering the production of existing high-rise buildings in residential areas in recent years. In respect to rising discourse of 'World City' in the globalizing world, it is very important to state the place of Istanbul in other developing world metropolises. In the perspective of 'World City' discourse, Istanbul has different projects concerning with globalization, international finance companies, cultural activities, mega projects, etc. In brief, the aim of this research is examining transformation forms of high-rise housing development in Istanbul within the frame of developing world cities, searching and analyzing discourse and image related to these projects.

Keywords: globalization, high-rise, housing, image

Procedia PDF Downloads 268
27 Analysis of Travel Behavior Patterns of Frequent Passengers after the Section Shutdown of Urban Rail Transit - Taking the Huaqiao Section of Shanghai Metro Line 11 Shutdown During the COVID-19 Epidemic as an Example

Authors: Hongyun Li, Zhibin Jiang

Abstract:

The travel of passengers in the urban rail transit network is influenced by changes in network structure and operational status, and the response of individual travel preferences to these changes also varies. Firstly, the influence of the suspension of urban rail transit line sections on passenger travel along the line is analyzed. Secondly, passenger travel trajectories containing multi-dimensional semantics are described based on network UD data. Next, passenger panel data based on spatio-temporal sequences is constructed to achieve frequent passenger clustering. Then, the Graph Convolutional Network (GCN) is used to model and identify the changes in travel modes of different types of frequent passengers. Finally, taking Shanghai Metro Line 11 as an example, the travel behavior patterns of frequent passengers after the Huaqiao section shutdown during the COVID-19 epidemic are analyzed. The results showed that after the section shutdown, most passengers would transfer to the nearest Anting station for boarding, while some passengers would transfer to other stations for boarding or cancel their travels directly. Among the passengers who transferred to Anting station for boarding, most of passengers maintained the original normalized travel mode, a small number of passengers waited for a few days before transferring to Anting station for boarding, and only a few number of passengers stopped traveling at Anting station or transferred to other stations after a few days of boarding on Anting station. The results can provide a basis for understanding urban rail transit passenger travel patterns and improving the accuracy of passenger flow prediction in abnormal operation scenarios.

Keywords: urban rail transit, section shutdown, frequent passenger, travel behavior pattern

Procedia PDF Downloads 63
26 Multi-source Question Answering Framework Using Transformers for Attribute Extraction

Authors: Prashanth Pillai, Purnaprajna Mangsuli

Abstract:

Oil exploration and production companies invest considerable time and efforts to extract essential well attributes (like well status, surface, and target coordinates, wellbore depths, event timelines, etc.) from unstructured data sources like technical reports, which are often non-standardized, multimodal, and highly domain-specific by nature. It is also important to consider the context when extracting attribute values from reports that contain information on multiple wells/wellbores. Moreover, semantically similar information may often be depicted in different data syntax representations across multiple pages and document sources. We propose a hierarchical multi-source fact extraction workflow based on a deep learning framework to extract essential well attributes at scale. An information retrieval module based on the transformer architecture was used to rank relevant pages in a document source utilizing the page image embeddings and semantic text embeddings. A question answering framework utilizingLayoutLM transformer was used to extract attribute-value pairs incorporating the text semantics and layout information from top relevant pages in a document. To better handle context while dealing with multi-well reports, we incorporate a dynamic query generation module to resolve ambiguities. The extracted attribute information from various pages and documents are standardized to a common representation using a parser module to facilitate information comparison and aggregation. Finally, we use a probabilistic approach to fuse information extracted from multiple sources into a coherent well record. The applicability of the proposed approach and related performance was studied on several real-life well technical reports.

Keywords: natural language processing, deep learning, transformers, information retrieval

Procedia PDF Downloads 183
25 Linguistic Misinterpretation and the Dialogue of Civilizations

Authors: Oleg Redkin, Olga Bernikova

Abstract:

Globalization and migrations have made cross-cultural contacts more frequent and intensive. Sometimes, these contacts may lead to misunderstanding between partners of communication and misinterpretations of the verbal messages that some researchers tend to consider as the 'clash of civilizations'. In most cases, reasons for that may be found in cultural and linguistic differences and hence misinterpretations of intentions and behavior. The current research examines factors of verbal and non-verbal communication that should be taken into consideration in verbal and non-verbal contacts. Language is one of the most important manifestations of the cultural code, and it is often considered as one of the special features of a civilization. The Arabic language, in particular, is commonly associated with Islam and the language and the Arab-Muslim civilization. It is one of the most important markers of self-identification for more than 200 million of native speakers. Arabic is the language of the Quran and hence the symbol of religious affiliation for more than one billion Muslims around the globe. Adequate interpretation of Arabic texts requires profound knowledge of its grammar, semantics of its vocabulary. Communicating sides who belong to different cultural groups are guided by different models of behavior and hierarchy of values, besides that the vocabulary each of them uses in the dialogue may convey different semantic realities and vary in connotations. In this context direct, literal translation in most cases cannot adequately convey the original meaning of the original message. Besides that peculiarities and diversities of the extralinguistic information, such as the body language, communicative etiquette, cultural background and religious affiliations may make the dialogue even more difficult. It is very likely that the so called 'clash of civilizations' in most cases is due to misinterpretation of counterpart's means of discourse such as language, cultural codes, and models of behavior rather than lies in basic contradictions between partners of communication. In the process of communication, one has to rely on universal values rather than focus on cultural or religious peculiarities, to take into account current linguistic and extralinguistic context.

Keywords: Arabic, civilization, discourse, language, linguistic

Procedia PDF Downloads 211
24 Semantic Search Engine Based on Query Expansion with Google Ranking and Similarity Measures

Authors: Ahmad Shahin, Fadi Chakik, Walid Moudani

Abstract:

Our study is about elaborating a potential solution for a search engine that involves semantic technology to retrieve information and display it significantly. Semantic search engines are not used widely over the web as the majorities are still in Beta stage or under construction. Many problems face the current applications in semantic search, the major problem is to analyze and calculate the meaning of query in order to retrieve relevant information. Another problem is the ontology based index and its updates. Ranking results according to concept meaning and its relation with query is another challenge. In this paper, we are offering a light meta-engine (QESM) which uses Google search, and therefore Google’s index, with some adaptations to its returned results by adding multi-query expansion. The mission was to find a reliable ranking algorithm that involves semantics and uses concepts and meanings to rank results. At the beginning, the engine finds synonyms of each query term entered by the user based on a lexical database. Then, query expansion is applied to generate different semantically analogous sentences. These are generated randomly by combining the found synonyms and the original query terms. Our model suggests the use of semantic similarity measures between two sentences. Practically, we used this method to calculate semantic similarity between each query and the description of each page’s content generated by Google. The generated sentences are sent to Google engine one by one, and ranked again all together with the adapted ranking method (QESM). Finally, our system will place Google pages with higher similarities on the top of the results. We have conducted experimentations with 6 different queries. We have observed that most ranked results with QESM were altered with Google’s original generated pages. With our experimented queries, QESM generates frequently better accuracy than Google. In some worst cases, it behaves like Google.

Keywords: semantic search engine, Google indexing, query expansion, similarity measures

Procedia PDF Downloads 415
23 Enhancing Learners' Metacognitive, Cultural and Linguistic Proficiency through Egyptian Series

Authors: Hanan Eltayeb, Reem Al Refaie

Abstract:

To be able to connect and relate to shows spoken in a foreign language, advanced learners must understand not only linguistics inferences but also cultural, metacognitive, and pragmatic connotations in colloquial Egyptian TV series. These connotations are needed to both understand the different facets of the dramas put before them, and they’re also consistently grown and formulated through watching these shows. The inferences have become a staple in the Egyptian colloquial culture over the years, making their way into day-to-day conversations as Egyptians use them to speak, relate, joke, and connect with each other, without having known one another from previous times. As for advanced learners, they need to understand these inferences not only to watch these shows, but also to be able to converse with Egyptians on a level that surpasses the formal, or standard. When faced with some of the somewhat recent shows on the Egyptian screens, learners faced challenges in understanding pragmatics, cultural, and religious background of the target language and consequently not able to interact effectively with a native speaker in real-life situations. This study aims to enhance the linguistic and cultural proficiency of learners through studying two genres of TV Colloquial Egyptian series. Study samples derived from two recent comedian and social Egyptian series ('The Seventh Neighbor' سابع جار, and 'Nelly and Sherihan' نيللي و شريهان). When learners watch such series, they are usually faced with a problem understanding inferences that have to do with social, religious, and political events that are addressed in the series. Using discourse analysis of the sematic, semantic, pragmatic, cultural, and linguistic characteristics of the target language, some major deductions were highlighted and repeated, showing a pattern in both. The research paper concludes that there are many sets of lingual and para-lingual phrases, idioms, and proverbs to be acquired and used effectively by teaching these series. The strategies adopted in the study can be applied to different types of media, like movies, TV shows, and even cartoons, to enhance student proficiency.

Keywords: Egyptian series, culture, linguistic competence, pragmatics, semantics, social

Procedia PDF Downloads 130
22 Dido: An Automatic Code Generation and Optimization Framework for Stencil Computations on Distributed Memory Architectures

Authors: Mariem Saied, Jens Gustedt, Gilles Muller

Abstract:

We present Dido, a source-to-source auto-generation and optimization framework for multi-dimensional stencil computations. It enables a large programmer community to easily and safely implement stencil codes on distributed-memory parallel architectures with Ordered Read-Write Locks (ORWL) as an execution and communication back-end. ORWL provides inter-task synchronization for data-oriented parallel and distributed computations. It has been proven to guarantee equity, liveness, and efficiency for a wide range of applications, particularly for iterative computations. Dido consists mainly of an implicitly parallel domain-specific language (DSL) implemented as a source-level transformer. It captures domain semantics at a high level of abstraction and generates parallel stencil code that leverages all ORWL features. The generated code is well-structured and lends itself to different possible optimizations. In this paper, we enhance Dido to handle both Jacobi and Gauss-Seidel grid traversals. We integrate temporal blocking to the Dido code generator in order to reduce the communication overhead and minimize data transfers. To increase data locality and improve intra-node data reuse, we coupled the code generation technique with the polyhedral parallelizer Pluto. The accuracy and portability of the generated code are guaranteed thanks to a parametrized solution. The combination of ORWL features, the code generation pattern and the suggested optimizations, make of Dido a powerful code generation framework for stencil computations in general, and for distributed-memory architectures in particular. We present a wide range of experiments over a number of stencil benchmarks.

Keywords: stencil computations, ordered read-write locks, domain-specific language, polyhedral model, experiments

Procedia PDF Downloads 117
21 Lexical-Semantic Deficits in Sinhala Speaking Persons with Post Stroke Aphasia: Evidence from Single Word Auditory Comprehension Task

Authors: D. W. M. S. Samarathunga, Isuru Dharmarathne

Abstract:

In aphasia, various levels of symbolic language processing (semantics) are affected. It is shown that Persons with Aphasia (PWA) often experience more problems comprehending some categories of words than others. The study aimed to determine lexical semantic deficits seen in Auditory Comprehension (AC) and to describe lexical-semantic deficits across six selected word categories. Thirteen (n =13) persons diagnosed with post-stroke aphasia (PSA) were recruited to perform an AC task. Foods, objects, clothes, vehicles, body parts and animals were selected as the six categories. As the test stimuli, black and white line drawings were adapted from a picture set developed for semantic studies by Snodgrass and Vanderwart. A pilot study was conducted with five (n=5) healthy nonbrain damaged Sinhala speaking adults to decide familiarity and applicability of the test material. In the main study, participants were scored based on the accuracy and number of errors shown. The results indicate similar trends of lexical semantic deficits identified in the literature confirming ‘animals’ to be the easiest category to comprehend. Mann-Whitney U test was performed to determine the association between the selected variables and the participants’ performance on AC task. No statistical significance was found between the errors and the type of aphasia reflecting similar patterns described in aphasia literature in other languages. The current study indicates the presence of selectivity of lexical semantic deficits in AC and a hierarchy was developed based on the complexity of the categories to comprehend by Sinhala speaking PWA, which might be clinically beneficial when improving language skills of Sinhala speaking persons with post-stroke aphasia. However, further studies on aphasia should be conducted with larger samples for a longer period to study deficits in Sinhala and other Sri Lankan languages (Tamil and Malay).

Keywords: aphasia, auditory comprehension, selective lexical-semantic deficits, semantic categories

Procedia PDF Downloads 244
20 Meaning Interpretation of Persian Noun-Noun Compounds: A Conceptual Blending Approach

Authors: Bahareh Yousefian, Laurel Smith Stvan

Abstract:

Linguistic structures have two facades: form and meaning. These structures could have either literal meaning or figurative meaning (although it could also depend on the context in which that structure appears). The literal meaning is understandable more easily, but for the figurative meaning, a word or concept is understood from a different word or concept. In linguistic structures with a figurative meaning, it’s more difficult to relate their forms to the meanings than structures with literal meaning. In these cases, the relationship between form and figurative meaning could be studied from different perspectives. Various linguists have been curious about what happens in someone’s mind to understand figurative meaning through the forms; they have used different perspectives and theories to explain this process. It has been studied through cognitive linguistics as well, in which mind and mental activities are really important. In this viewpoint, meaning (in other words, conceptualization) is considered a mental process. In this descriptive-analytic study, 20 Persian compound nouns with figurative meanings have been collected from the Persian-language Moeen Encyclopedic Dictionary and other sources. Examples include [“Sofreh Xaneh”] (traditional restaurant) and [“Dast Yar”] (Assistant). These were studied in a cognitive semantics framework using “Conceptual Blending Theory” which hasn’t been tested on Persian compound nouns before. It was noted that “Conceptual Blending Theory” could lead to the process of understanding the figurative meanings of Persian compound nouns. Many cognitive linguists believe that “Conceptual Blending” is not only a linguistic theory but it’s also a basic human cognitive ability that plays important roles in thought, imagination, and even everyday life as well (though unconsciously). The ability to use mental spaces and conceptual blending (which is exclusive to humankind) is such a basic but unconscious ability that we are unaware of its existence and importance. What differentiates Conceptual Blending Theory from other ways of understanding figurative meaning, are arising new semantic aspects (emergent structure) that lead to a more comprehensive and precise meaning. In this study, it was found that Conceptual Blending Theory could explain reaching the figurative meanings of Persian compound nouns from their forms, such as [talkative for compound word of “Bolbol + Zabani” (nightingale + tongue)] and [wage for compound word of “Dast + Ranj” (hand + suffering)].

Keywords: cognitive linguistics, conceptual blending, figurative meaning, Persian compound nouns

Procedia PDF Downloads 59
19 Embedded Hybrid Intuition: A Deep Learning and Fuzzy Logic Approach to Collective Creation and Computational Assisted Narratives

Authors: Roberto Cabezas H

Abstract:

The current work shows the methodology developed to create narrative lighting spaces for the multimedia performance piece 'cluster: the vanished paradise.' This empirical research is focused on exploring unconventional roles for machines in subjective creative processes, by delving into the semantics of data and machine intelligence algorithms in hybrid technological, creative contexts to expand epistemic domains trough human-machine cooperation. The creative process in scenic and performing arts is guided mostly by intuition; from that idea, we developed an approach to embed collective intuition in computational creative systems, by joining the properties of Generative Adversarial Networks (GAN’s) and Fuzzy Clustering based on a semi-supervised data creation and analysis pipeline. The model makes use of GAN’s to learn from phenomenological data (data generated from experience with lighting scenography) and algorithmic design data (augmented data by procedural design methods), fuzzy logic clustering is then applied to artificially created data from GAN’s to define narrative transitions built on membership index; this process allowed for the creation of simple and complex spaces with expressive capabilities based on position and light intensity as the parameters to guide the narrative. Hybridization comes not only from the human-machine symbiosis but also on the integration of different techniques for the implementation of the aided design system. Machine intelligence tools as proposed in this work are well suited to redefine collaborative creation by learning to express and expand a conglomerate of ideas and a wide range of opinions for the creation of sensory experiences. We found in GAN’s and Fuzzy Logic an ideal tool to develop new computational models based on interaction, learning, emotion and imagination to expand the traditional algorithmic model of computation.

Keywords: fuzzy clustering, generative adversarial networks, human-machine cooperation, hybrid collective data, multimedia performance

Procedia PDF Downloads 131
18 Recursion, Merge and Event Sequence: A Bio-Mathematical Perspective

Authors: Noury Bakrim

Abstract:

Formalization is indeed a foundational Mathematical Linguistics as demonstrated by the pioneering works. While dialoguing with this frame, we nonetheless propone, in our approach of language as a real object, a mathematical linguistics/biosemiotics defined as a dialectical synthesis between induction and computational deduction. Therefore, relying on the parametric interaction of cycles, rules, and features giving way to a sub-hypothetic biological point of view, we first hypothesize a factorial equation as an explanatory principle within Category Mathematics of the Ergobrain: our computation proposal of Universal Grammar rules per cycle or a scalar determination (multiplying right/left columns of the determinant matrix and right/left columns of the logarithmic matrix) of the transformable matrix for rule addition/deletion and cycles within representational mapping/cycle heredity basing on the factorial example, being the logarithmic exponent or power of rule deletion/addition. It enables us to propone an extension of minimalist merge/label notions to a Language Merge (as a computing principle) within cycle recursion relying on combinatorial mapping of rules hierarchies on external Entax of the Event Sequence. Therefore, to define combinatorial maps as language merge of features and combinatorial hierarchical restrictions (governing, commanding, and other rules), we secondly hypothesize from our results feature/hierarchy exponentiation on graph representation deriving from Gromov's Symbolic Dynamics where combinatorial vertices from Fe are set to combinatorial vertices of Hie and edges from Fe to Hie such as for all combinatorial group, there are restriction maps representing different derivational levels that are subgraphs: the intersection on I defines pullbacks and deletion rules (under restriction maps) then under disjunction edges H such that for the combinatorial map P belonging to Hie exponentiation by intersection there are pullbacks and projections that are equal to restriction maps RM₁ and RM₂. The model will draw on experimental biomathematics as well as structural frames with focus on Amazigh and English (cases from phonology/micro-semantics, Syntax) shift from Structure to event (especially Amazigh formant principle resolving its morphological heterogeneity).

Keywords: rule/cycle addition/deletion, bio-mathematical methodology, general merge calculation, feature exponentiation, combinatorial maps, event sequence

Procedia PDF Downloads 115
17 From Context to Text and Back Again: Teaching Toni Morrison Overseas

Authors: Helena Maragou

Abstract:

Introducing Toni Morrison’s fiction to a classroom overseas entails a significant pedagogical investment, from monitoring students’ uncertain journey through Morrison’s shifty semantics to filling in the gaps of cultural knowledge and understanding for the students to be able to relate text to context. A rewarding process, as Morrison’s works present a tremendous opportunity for transnational dialogue, an opportunity that hinges upon Toni Morrison’s bringing to the fore the untold and unspeakable lives of racial ‘Others’, but also, crucially, upon her broader critique of Western ideological hegemony. This critique is a fundamental aspect of Toni Morrison’s politics and one that appeals to young readers of Toni Morrison in Greece at a time when the questioning of institutions and ideological traditions is precipitated by regional and global change. It is more or less self-evident that to help a class of international students get aboard a Morrison novel, an instructor should begin by providing them with cultural context. These days, students’ exposure to Hollywood representations of the African American past and present, as well as the use of documentaries, photography, music videos, etc., as supplementary class material, provide a starting point, a workable historical and cultural framework for textual comprehension. The true challenge, however, lies ahead: it is one thing for students to intellectually grasp the historical hardships and traumas of Morrison’s characters and to even engage in aesthetic appreciation of Morrison’s writing; quite another to relate to her works as articulations of experiences akin to their own. The great challenge, then, is in facilitating students’ discovery of the universal Morrison, the author who speaks across cultures while voicing the untold tales of her own people; this process of discovery entails, on a pedagogical level, that students be guided through the works’ historical context, to plunge into the intricacies of Morrison’s discourse, itself an elaborate linguistic booby trap, so as to be finally brought to reconsider their own historical experiences using the lens of Morrison’s fiction. The paper will be based on experience of teaching a Toni Morrison seminar to a class of Greek students at the American College of Greece and will draw from students’ exposure and responses to Toni Morrison’s “Nobel Prize Lecture,” as well as her novels Song of Solomon and Home.

Keywords: toni morrison, international classroom, pedagogy, African American literature

Procedia PDF Downloads 71
16 The Co-Existence of Multidominance and Movement in the Syntax of Chinese Bi-Comparatives

Authors: Yaqing Hu

Abstract:

This paper puts forward a syntactic analysis involving multidominance and rightward movement in Chinese bi-comparatives, as in 'Yuehan bi Mali gao (John is taller than Mary).' It is argued here that the predicate of comparison is a shared constituent in two small clauses, namely one for the target and one for the standard; and then it moves rightward to form a degree phrase with the comparative morpheme. This proposal comes from four aspects. First, the example above can also be expressed in this way, 'A: Yuehan he Mali, shui gao? (John and Mary, who is taller?) B: Yuehan gao./Yuehan geng gao. (John is taller).' This shows that the gradable adjective is predicated of the target. In addition, according to a constraint on Chinese bi-comparatives, namely the target and the standard must be arguments of the predicate simultaneously, it is not unreasonable to assume that the gradable adjective may also be predicated of the standard. Second, subcomparatives are totally disallowed in Chinese, as in '*zhe-zhang zhuozi bi zhe-zhang yizi kuan chang. (This table is longer than this chair is wide.)' In order to save it from ungrammaticality, the target and the standard should be compared along the same dimension denoted by the gradable adjective. It may follow that in Chinese comparatives, having equal roles in the same eventuality, the target and the standard bear the same thematic relationship with the predicate of comparison. Third, verb-copy can appear in Chinese bi-comparatives, as in 'Yuehan qi ma bi Mali qi ma qi de kuai. (John rides horses faster than Mary does.)' The predicate qi seems to form a small clause with both the target and the standard. This might be supporting evidence that both the target and the standard share the predicate of comparison. Fourth, Chinese comparatives do have comparative morphemes, as in 'Yuehan bi Mali geng gao. (John is taller than Mary)', which is semantically equivalent to the first example above. Thus, it follows that one feature of Chinese comparative morphemes is that they can remain overt or covert in the syntax, which will not affect semantics. This further shows that comparative morphemes in bi-comparatives may not be able to saturate the degree argument denoted by the predicate of comparison due to its optionality in the structure. These four aspects present a challenge to the Direct Analysis used in Chinese comparatives since this approach would presume that the target and the standard somehow show independency with the predicate in the syntax. Meanwhile, this study also rejects the previous analysis of multidomiance in bi-comparatives in which the degree phrase comprised of the comparative morpheme and the gradable adjective may be shared by the standard when the comparative morpheme is covert. This syntactic analysis proposed in this study will therefore offer a different perspective of how to treat degree phrase in Chinese comparatives and may offer evidence to argue whether there is degree phrase movement in bi-comparatives as in its English counterparts.

Keywords: Chinese comparatives, degree phrase, movement, multidominance, syntactic analysis

Procedia PDF Downloads 317
15 A Critical Discourse Analysis of Corporate Annual Reports in a Cross-Cultural Perspective: Views from Grammatical Metaphor and Systemic Functional Linguistics

Authors: Antonio Piga

Abstract:

The study of language strategies in financial and corporate discourse has always been vital for understanding how companies manage to communicate effectively with a wider customer base and offers new perspectives on how companies interact with key stakeholders, not only to convey transparency and an image of trustworthiness, but also to create affiliation and attract investment. In the light of Systemic Functional Linguistics, the purpose of this study is to examine and analyse the annual reports of Asian and Western joint-stock companies involved in oil refining and power generation from the point of view of the functions and frequency of grammatical metaphors. More specifically, grammatical metaphor - through the lens of Critical Discourse Analysis (CDA) - is used as a theoretical tool for analysing a synchronic cross-cultural study of the communicative strategies adopted by Asian and Western companies to communicate social and environmental sustainability and showcase their ethical values, performance and competitiveness to local and global communities and key stakeholders. According to Systemic Functional Linguistics, grammatical metaphor can be divided into two broad areas: ideational and interpersonal. This study focuses on the first type, ideational grammatical metaphor (IGM), which includes de-adjectival and de-verbal nominalisation. The dominant and more effective grammatical tropes used by Asian and Western corporations in their annual reports were examined from both a qualitative and quantitative perspective. The aim was to categorise and explain how ideational grammatical metaphor is constructed cross-culturally and presented through structural language patterns involving re-mapping between semantics and lexico-grammatical features. The results show that although there seem to be more differences than similarities in terms of the categorisation of the ideational grammatical metaphors conceptualised in the two case studies analysed, there are more similarities than differences in terms of the occurrence, the congruence of process types and the role and function of IGM. Through the immediacy and essentialism of compacting and condensing information, IGM seems to be an important linguistic strategy adopted in the rhetoric of corporate annual reports, contributing to the ideologies and actions of companies to report and promote efficiency, profit and social and environmental sustainability, thus advocating the engagement and investment of key stakeholders.

Keywords: corporate annual reports, cross-cultural perspective, ideational grammatical metaphor, rhetoric, systemic functional linguistics

Procedia PDF Downloads 27
14 Articles, Delimitation of Speech and Perception

Authors: Nataliya L. Ogurechnikova

Abstract:

The paper aims to clarify the function of articles in the English speech and specify their place and role in the English language, taking into account the use of articles for delimitation of speech. A focus of the paper is the use of the definite and the indefinite articles with different types of noun phrases which comprise either one noun with or without attributes, such as the King, the Queen, the Lion, the Unicorn, a dimple, a smile, a new language, an unknown dialect, or several nouns with or without attributes, such as the King and Queen of Hearts, the Lion and Unicorn, a dimple or smile, a completely isolated language or dialect. It is stated that the function of delimitation is related to perception: the number of speech units in a text correlates with the way the speaker perceives and segments the denotation. The two following combinations of words the house and garden and the house and the garden contain different numbers of speech units, one and two respectively, and reveal two different perception modes which correspond to the use of the definite article in the examples given. Thus, the function of delimitation is twofold, it is related to perception and cognition, on the one hand, and, on the other hand, to grammar, if the subject of grammar is the structure of speech. Analysis of speech units in the paper is not limited by noun phrases and is amplified by discussion of peripheral phenomena which are nevertheless important because they enable to qualify articles as a syntactic phenomenon whereas they are not infrequently described in terms of noun morphology. With this regard attention is given to the history of linguistic studies, specifically to the description of English articles by Niels Haislund, a disciple of Otto Jespersen. A discrepancy is noted between the initial plan of Jespersen who intended to describe articles as a syntactic phenomenon in ‘A Modern English Grammar on Historical Principles’ and the interpretation of articles in terms of noun morphology, finally given by Haislund. Another issue of the paper is correlation between description and denotation, being a traditional aspect of linguistic studies focused on articles. An overview of relevant studies, given in the paper, goes back to the works of G. Frege, which gave rise to a series of scientific works where the meaning of articles was described within the scope of logical semantics. Correlation between denotation and description is treated in the paper as the meaning of article, i.e. a component in its semantic structure, which differs from the function of delimitation and is similar to the meaning of other quantifiers. The paper further explains why the relation between description and denotation, i.e. the meaning of English article, is irrelevant for noun morphology and has nothing to do with nominal categories of the English language.

Keywords: delimitation of speech, denotation, description, perception, speech units, syntax

Procedia PDF Downloads 230
13 Profiling Risky Code Using Machine Learning

Authors: Zunaira Zaman, David Bohannon

Abstract:

This study explores the application of machine learning (ML) for detecting security vulnerabilities in source code. The research aims to assist organizations with large application portfolios and limited security testing capabilities in prioritizing security activities. ML-based approaches offer benefits such as increased confidence scores, false positives and negatives tuning, and automated feedback. The initial approach using natural language processing techniques to extract features achieved 86% accuracy during the training phase but suffered from overfitting and performed poorly on unseen datasets during testing. To address these issues, the study proposes using the abstract syntax tree (AST) for Java and C++ codebases to capture code semantics and structure and generate path-context representations for each function. The Code2Vec model architecture is used to learn distributed representations of source code snippets for training a machine-learning classifier for vulnerability prediction. The study evaluates the performance of the proposed methodology using two datasets and compares the results with existing approaches. The Devign dataset yielded 60% accuracy in predicting vulnerable code snippets and helped resist overfitting, while the Juliet Test Suite predicted specific vulnerabilities such as OS-Command Injection, Cryptographic, and Cross-Site Scripting vulnerabilities. The Code2Vec model achieved 75% accuracy and a 98% recall rate in predicting OS-Command Injection vulnerabilities. The study concludes that even partial AST representations of source code can be useful for vulnerability prediction. The approach has the potential for automated intelligent analysis of source code, including vulnerability prediction on unseen source code. State-of-the-art models using natural language processing techniques and CNN models with ensemble modelling techniques did not generalize well on unseen data and faced overfitting issues. However, predicting vulnerabilities in source code using machine learning poses challenges such as high dimensionality and complexity of source code, imbalanced datasets, and identifying specific types of vulnerabilities. Future work will address these challenges and expand the scope of the research.

Keywords: code embeddings, neural networks, natural language processing, OS command injection, software security, code properties

Procedia PDF Downloads 94