Search results for: text and data mining
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 25334

Search results for: text and data mining

25034 Sentiment Analysis: Comparative Analysis of Multilingual Sentiment and Opinion Classification Techniques

Authors: Sannikumar Patel, Brian Nolan, Markus Hofmann, Philip Owende, Kunjan Patel

Abstract:

Sentiment analysis and opinion mining have become emerging topics of research in recent years but most of the work is focused on data in the English language. A comprehensive research and analysis are essential which considers multiple languages, machine translation techniques, and different classifiers. This paper presents, a comparative analysis of different approaches for multilingual sentiment analysis. These approaches are divided into two parts: one using classification of text without language translation and second using the translation of testing data to a target language, such as English, before classification. The presented research and results are useful for understanding whether machine translation should be used for multilingual sentiment analysis or building language specific sentiment classification systems is a better approach. The effects of language translation techniques, features, and accuracy of various classifiers for multilingual sentiment analysis is also discussed in this study.

Keywords: cross-language analysis, machine learning, machine translation, sentiment analysis

Procedia PDF Downloads 682
25033 Polycode Texts in Communication of Antisocial Groups: Functional and Pragmatic Aspects

Authors: Ivan Potapov

Abstract:

Background: The aim of this paper is to investigate poly code texts in the communication of youth antisocial groups. Nowadays, the notion of a text has numerous interpretations. Besides all the approaches to defining a text, we must take into account semiotic and cultural-semiotic ones. Rapidly developing IT, world globalization, and new ways of coding of information increase the role of the cultural-semiotic approach. However, the development of computer technologies leads also to changes in the text itself. Polycode texts play a more and more important role in the everyday communication of the younger generation. Therefore, the research of functional and pragmatic aspects of both verbal and non-verbal content is actually quite important. Methods and Material: For this survey, we applied the combination of four methods of text investigation: not only intention and content analysis but also semantic and syntactic analysis. Using these methods provided us with information on general text properties, the content of transmitted messages, and each communicants’ intentions. Besides, during our research, we figured out the social background; therefore, we could distinguish intertextual connections between certain types of polycode texts. As the sources of the research material, we used 20 public channels in the popular messenger Telegram and data extracted from smartphones, which belonged to arrested members of antisocial groups. Findings: This investigation let us assert that polycode texts can be characterized as highly intertextual language unit. Moreover, we could outline the classification of these texts based on communicants’ intentions. The most common types of antisocial polycode texts are a call to illegal actions and agitation. What is more, each type has its own semantic core: it depends on the sphere of communication. However, syntactic structure is universal for most of the polycode texts. Conclusion: Polycode texts play important role in online communication. The results of this investigation demonstrate that in some social groups using these texts has a destructive influence on the younger generation and obviously needs further researches.

Keywords: text, polycode text, internet linguistics, text analysis, context, semiotics, sociolinguistics

Procedia PDF Downloads 107
25032 Managing Data from One Hundred Thousand Internet of Things Devices Globally for Mining Insights

Authors: Julian Wise

Abstract:

Newcrest Mining is one of the world’s top five gold and rare earth mining organizations by production, reserves and market capitalization in the world. This paper elaborates on the data acquisition processes employed by Newcrest in collaboration with Fortune 500 listed organization, Insight Enterprises, to standardize machine learning solutions which process data from over a hundred thousand distributed Internet of Things (IoT) devices located at mine sites globally. Through the utilization of software architecture cloud technologies and edge computing, the technological developments enable for standardized processes of machine learning applications to influence the strategic optimization of mineral processing. Target objectives of the machine learning optimizations include time savings on mineral processing, production efficiencies, risk identification, and increased production throughput. The data acquired and utilized for predictive modelling is processed through edge computing by resources collectively stored within a data lake. Being involved in the digital transformation has necessitated the standardization software architecture to manage the machine learning models submitted by vendors, to ensure effective automation and continuous improvements to the mineral process models. Operating at scale, the system processes hundreds of gigabytes of data per day from distributed mine sites across the globe, for the purposes of increased improved worker safety, and production efficiency through big data applications.

Keywords: mineral technology, big data, machine learning operations, data lake

Procedia PDF Downloads 86
25031 Resource Creation Using Natural Language Processing Techniques for Malay Translated Qur'an

Authors: Nor Diana Ahmad, Eric Atwell, Brandon Bennett

Abstract:

Text processing techniques for English have been developed for several decades. But for the Malay language, text processing methods are still far behind. Moreover, there are limited resources, tools for computational linguistic analysis available for the Malay language. Therefore, this research presents the use of natural language processing (NLP) in processing Malay translated Qur’an text. As the result, a new language resource for Malay translated Qur’an was created. This resource will help other researchers to build the necessary processing tools for the Malay language. This research also develops a simple question-answer prototype to demonstrate the use of the Malay Qur’an resource for text processing. This prototype has been developed using Python. The prototype pre-processes the Malay Qur’an and an input query using a stemming algorithm and then searches for occurrences of the query word stem. The result produced shows improved matching likelihood between user query and its answer. A POS-tagging algorithm has also been produced. The stemming and tagging algorithms can be used as tools for research related to other Malay texts and can be used to support applications such as information retrieval, question answering systems, ontology-based search and other text analysis tasks.

Keywords: language resource, Malay translated Qur'an, natural language processing (NLP), text processing

Procedia PDF Downloads 287
25030 Method of Complex Estimation of Text Perusal and Indicators of Reading Quality in Different Types of Commercials

Authors: Victor N. Anisimov, Lyubov A. Boyko, Yazgul R. Almukhametova, Natalia V. Galkina, Alexander V. Latanov

Abstract:

Modern commercials presented on billboards, TV and on the Internet contain a lot of information about the product or service in text form. However, this information cannot always be perceived and understood by consumers. Typical sociological focus group studies often cannot reveal important features of the interpretation and understanding information that has been read in text messages. In addition, there is no reliable method to determine the degree of understanding of the information contained in a text. Only the fact of viewing a text does not mean that consumer has perceived and understood the meaning of this text. At the same time, the tools based on marketing analysis allow only to indirectly estimate the process of reading and understanding a text. Therefore, the aim of this work is to develop a valid method of recording objective indicators in real time for assessing the fact of reading and the degree of text comprehension. Psychophysiological parameters recorded during text reading can form the basis for this objective method. We studied the relationship between multimodal psychophysiological parameters and the process of text comprehension during reading using the method of correlation analysis. We used eye-tracking technology to record eye movements parameters to estimate visual attention, electroencephalography (EEG) to assess cognitive load and polygraphic indicators (skin-galvanic reaction, SGR) that reflect the emotional state of the respondent during text reading. We revealed reliable interrelations between perceiving the information and the dynamics of psychophysiological parameters during reading the text in commercials. Eye movement parameters reflected the difficulties arising in respondents during perceiving ambiguous parts of text. EEG dynamics in rate of alpha band were related with cumulative effect of cognitive load. SGR dynamics were related with emotional state of the respondent and with the meaning of text and type of commercial. EEG and polygraph parameters together also reflected the mental difficulties of respondents in understanding text and showed significant differences in cases of low and high text comprehension. We also revealed differences in psychophysiological parameters for different type of commercials (static vs. video, financial vs. cinema vs. pharmaceutics vs. mobile communication, etc.). Conclusions: Our methodology allows to perform multimodal evaluation of text perusal and the quality of text reading in commercials. In general, our results indicate the possibility of designing an integral model to estimate the comprehension of reading the commercial text in percent scale based on all noticed markers.

Keywords: reading, commercials, eye movements, EEG, polygraphic indicators

Procedia PDF Downloads 142
25029 Text Based Shuffling Algorithm on Graphics Processing Unit for Digital Watermarking

Authors: Zayar Phyo, Ei Chaw Htoon

Abstract:

In a New-LSB based Steganography method, the Fisher-Yates algorithm is used to permute an existing array randomly. However, that algorithm performance became slower and occurred memory overflow problem while processing the large dimension of images. Therefore, the Text-Based Shuffling algorithm aimed to select only necessary pixels as hiding characters at the specific position of an image according to the length of the input text. In this paper, the enhanced text-based shuffling algorithm is presented with the powered of GPU to improve more excellent performance. The proposed algorithm employs the OpenCL Aparapi framework, along with XORShift Kernel including the Pseudo-Random Number Generator (PRNG) Kernel. PRNG is applied to produce random numbers inside the kernel of OpenCL. The experiment of the proposed algorithm is carried out by practicing GPU that it can perform faster-processing speed and better efficiency without getting the disruption of unnecessary operating system tasks.

Keywords: LSB based steganography, Fisher-Yates algorithm, text-based shuffling algorithm, OpenCL, XORShiftKernel

Procedia PDF Downloads 122
25028 The Application of Lesson Study Model in Writing Review Text in Junior High School

Authors: Sulastriningsih Djumingin

Abstract:

This study has some objectives. It aims at describing the ability of the second-grade students to write review text without applying the Lesson Study model at SMPN 18 Makassar. Second, it seeks to describe the ability of the second-grade students to write review text by applying the Lesson Study model at SMPN 18 Makassar. Third, it aims at testing the effectiveness of the Lesson Study model in writing review text at SMPN 18 Makassar. This research was true experimental design with posttest Only group design involving two groups consisting of one class of the control group and one class of the experimental group. The research populations were all the second-grade students at SMPN 18 Makassar amounted to 250 students consisting of 8 classes. The sampling technique was purposive sampling technique. The control class was VIII2 consisting of 30 students, while the experimental class was VIII8 consisting of 30 students. The research instruments were in the form of observation and tests. The collected data were analyzed using descriptive statistical techniques and inferential statistical techniques with t-test types processed using SPSS 21 for windows. The results shows that: (1) of 30 students in control class, there are only 14 (47%) students who get the score more than 7.5, categorized as inadequate; (2) in the experimental class, there are 26 (87%) students who obtain the score of 7.5, categorized as adequate; (3) the Lesson Study models is effective to be applied in writing review text. Based on the comparison of the ability of the control class and experimental class, it indicates that the value of t-count is greater than the value of t-table (2.411> 1.667). It means that the alternative hypothesis (H1) proposed by the researcher is accepted.

Keywords: application, lesson study, review text, writing

Procedia PDF Downloads 179
25027 Application of Knowledge Discovery in Database Techniques in Cost Overruns of Construction Projects

Authors: Mai Ghazal, Ahmed Hammad

Abstract:

Cost overruns in construction projects are considered as worldwide challenges since the cost performance is one of the main measures of success along with schedule performance. To overcome this problem, studies were conducted to investigate the cost overruns' factors, also projects' historical data were analyzed to extract new and useful knowledge from it. This research is studying and analyzing the effect of some factors causing cost overruns using the historical data from completed construction projects. Then, using these factors to estimate the probability of cost overrun occurrence and predict its percentage for future projects. First, an intensive literature review was done to study all the factors that cause cost overrun in construction projects, then another review was done for previous researcher papers about mining process in dealing with cost overruns. Second, a proposed data warehouse was structured which can be used by organizations to store their future data in a well-organized way so it can be easily analyzed later. Third twelve quantitative factors which their data are frequently available at construction projects were selected to be the analyzed factors and suggested predictors for the proposed model.

Keywords: construction management, construction projects, cost overrun, cost performance, data mining, data warehousing, knowledge discovery, knowledge management

Procedia PDF Downloads 336
25026 Mood Recognition Using Indian Music

Authors: Vishwa Joshi

Abstract:

The study of mood recognition in the field of music has gained a lot of momentum in the recent years with machine learning and data mining techniques and many audio features contributing considerably to analyze and identify the relation of mood plus music. In this paper we consider the same idea forward and come up with making an effort to build a system for automatic recognition of mood underlying the audio song’s clips by mining their audio features and have evaluated several data classification algorithms in order to learn, train and test the model describing the moods of these audio songs and developed an open source framework. Before classification, Preprocessing and Feature Extraction phase is necessary for removing noise and gathering features respectively.

Keywords: music, mood, features, classification

Procedia PDF Downloads 475
25025 Heart Failure Identification and Progression by Classifying Cardiac Patients

Authors: Muhammad Saqlain, Nazar Abbas Saqib, Muazzam A. Khan

Abstract:

Heart Failure (HF) has become the major health problem in our society. The prevalence of HF has increased as the patient’s ages and it is the major cause of the high mortality rate in adults. A successful identification and progression of HF can be helpful to reduce the individual and social burden from this syndrome. In this study, we use a real data set of cardiac patients to propose a classification model for the identification and progression of HF. The data set has divided into three age groups, namely young, adult, and old and then each age group have further classified into four classes according to patient’s current physical condition. Contemporary Data Mining classification algorithms have been applied to each individual class of every age group to identify the HF. Decision Tree (DT) gives the highest accuracy of 90% and outperform all other algorithms. Our model accurately diagnoses different stages of HF for each age group and it can be very useful for the early prediction of HF.

Keywords: decision tree, heart failure, data mining, classification model

Procedia PDF Downloads 382
25024 Role of Gender in Apparel Stores' Consumer Review: A Sentiment Analysis

Authors: Sarif Ullah Patwary, Matthew Heinrich, Brandon Payne

Abstract:

The ubiquity of web 2.0 platforms, in the form of wikis, social media (e.g., Facebook, Twitter, etc.) and online review portals (e.g., Yelp), helps shape today’s apparel consumers’ purchasing decision. Online reviews play important role towards consumers’ apparel purchase decision. Each of the consumer reviews carries a sentiment (positive, negative or neutral) towards products. Commercially, apparel brands and retailers analyze sentiment of this massive amount of consumer review data to update their inventory and bring new products in the market. The purpose of this study is to analyze consumer reviews of selected apparel stores with a view to understand, 1) the difference of sentiment expressed through men’s and woman’s text reviews, 2) the difference of sentiment expressed through men’s and woman’s star-based reviews, and 3) the difference of sentiment between star-based reviews and text-based reviews. A total of 9,363 reviews (1,713 men and 7,650 women) were collected using Yelp Dataset Challenge. Sentiment analysis of collected reviews was carried out in two dimensions: star-based reviews and text-based reviews. Sentiment towards apparel stores expressed through star-based reviews was deemed: 1) positive for 3 or 4 stars 2) negative for 1 or 2 stars and 3) neutral for 3 stars. Sentiment analysis of text-based reviews was carried out using Bing Liu dictionary. The analysis was conducted in IPyhton 5.0. Space. The sentiment analysis results revealed the percentage of positive text reviews by men (80%) and women (80%) were identical. Women reviewers (12%) provided more neutral (e.g., 3 out of 5 stars) star reviews than men (6%). Star-based reviews were more negative than the text-based reviews. In other words, while 80% men and women wrote positive reviews for the stores, less than 70% ended up giving 4 or 5 stars in those reviews. One of the key takeaways of the study is that star reviews provide slightly negative sentiment of the consumer reviews. Therefore, in order to understand sentiment towards apparel products, one might need to combine both star and text aspects of consumer reviews. This study used a specific dataset consisting of selected apparel stores from particular geographical locations (the information was not given for privacy concern). Future studies need to include more data from more stores and locations to generalize the findings of the study.

Keywords: apparel, consumer review, sentiment analysis, gender

Procedia PDF Downloads 139
25023 Intelligent Process Data Mining for Monitoring for Fault-Free Operation of Industrial Processes

Authors: Hyun-Woo Cho

Abstract:

The real-time fault monitoring and diagnosis of large scale production processes is helpful and necessary in order to operate industrial process safely and efficiently producing good final product quality. Unusual and abnormal events of the process may have a serious impact on the process such as malfunctions or breakdowns. This work try to utilize process measurement data obtained in an on-line basis for the safe and some fault-free operation of industrial processes. To this end, this work evaluated the proposed intelligent process data monitoring framework based on a simulation process. The monitoring scheme extracts the fault pattern in the reduced space for the reliable data representation. Moreover, this work shows the results of using linear and nonlinear techniques for the monitoring purpose. It has shown that the nonlinear technique produced more reliable monitoring results and outperforms linear methods. The adoption of the qualitative monitoring model helps to reduce the sensitivity of the fault pattern to noise.

Keywords: process data, data mining, process operation, real-time monitoring

Procedia PDF Downloads 610
25022 The Crisis of Turkey's Downing the Russian Warplane within the Concept of Country Branding: The Examples of BBC World, and Al Jazeera English

Authors: Derya Gül Ünlü, Oguz Kuş

Abstract:

The branding of a country means that the country has its own position different from other countries in its region and thus it is perceived more specifically. It is made possible by the branding efforts of a country and the uniqueness of all the national structures, by presenting it in a specific way, by creating the desired image and attracting tourists and foreign investors. Establishing a national brand involves, in a sense, the process of managing the perceptions of the citizens of the other country about the target country, by structuring the image of the country permanently and holistically. By this means, countries are not easily affected by their crisis of international relations. Therefore, within the scope of the research that will be carried out from this point, it is aimed to show how the warplane downing crisis between Turkey and Russia is perceived on social media. The Russian warplane was downed by Turkey on November 24, 2015, on the grounds that Turkey violated the airspace on the Syrian border. Whereupon the relations between the two countries have been tensed, and Russia has called on its citizens not to go to Turkey and citizens in Turkey to return to their countries. Moreover, relations between two countries have been weakened, for example, tourism tours organized in Russia to Turkey and visa-free travel were canceled and all military dialogue was cut off. After the event, various news sites on social media published plenty of news related to topic and the readers made various comments about the event and Turkey. In this context, an investigation into the perception of Turkey's national brand before and after the warplane downing crisis has been conducted. through comments fetched from the reports on the BBC World, and from Al Jazeera English news sites on Facebook accounts, which takes place widely in the social media. In order to realize study, user comments were fetched from jet downing-related news which are published on Facebook fan-page of BBC World Service, and Al Jazeera English. Regarding this, all the news published between 24.10.2015-24.12.2015 and containing Turk and Turkey keyword in its title composed data set of our study. Afterwards, comments written to these news were analyzed via text mining technique. Furthermore, by sentiment analysis, it was intended to reveal reader’s emotions before and after the crisis.

Keywords: Al Jazeera English, BBC World, country branding, social media, text mining

Procedia PDF Downloads 195
25021 Facilitating Written Biology Assessment in Large-Enrollment Courses Using Machine Learning

Authors: Luanna B. Prevost, Kelli Carter, Margaurete Romero, Kirsti Martinez

Abstract:

Writing is an essential scientific practice, yet, in several countries, the increasing university science class-size limits the use of written assessments. Written assessments allow students to demonstrate their learning in their own words and permit the faculty to evaluate students’ understanding. However, the time and resources required to grade written assessments prohibit their use in large-enrollment science courses. This study examined the use of machine learning algorithms to automatically analyze student writing and provide timely feedback to the faculty about students' writing in biology. Written responses to questions about matter and energy transformation were collected from large-enrollment undergraduate introductory biology classrooms. Responses were analyzed using the LightSide text mining and classification software. Cohen’s Kappa was used to measure agreement between the LightSide models and human raters. Predictive models achieved agreement with human coding of 0.7 Cohen’s Kappa or greater. Models captured that when writing about matter-energy transformation at the ecosystem level, students focused on primarily on the concepts of heat loss, recycling of matter, and conservation of matter and energy. Models were also produced to capture writing about processes such as decomposition and biochemical cycling. The models created in this study can be used to provide automatic feedback about students understanding of these concepts to biology faculty who desire to use formative written assessments in larger enrollment biology classes, but do not have the time or personnel for manual grading.

Keywords: machine learning, written assessment, biology education, text mining

Procedia PDF Downloads 249
25020 Improved FP-Growth Algorithm with Multiple Minimum Supports Using Maximum Constraints

Authors: Elsayeda M. Elgaml, Dina M. Ibrahim, Elsayed A. Sallam

Abstract:

Association rule mining is one of the most important fields of data mining and knowledge discovery. In this paper, we propose an efficient multiple support frequent pattern growth algorithm which we called “MSFP-growth” that enhancing the FP-growth algorithm by making infrequent child node pruning step with multiple minimum support using maximum constrains. The algorithm is implemented, and it is compared with other common algorithms: Apriori-multiple minimum supports using maximum constraints and FP-growth. The experimental results show that the rule mining from the proposed algorithm are interesting and our algorithm achieved better performance than other algorithms without scarifying the accuracy.

Keywords: association rules, FP-growth, multiple minimum supports, Weka tool

Procedia PDF Downloads 452
25019 Analysis of Changes Being Done of the Mine Legislation of Turkey: Mining Operation Activity Process

Authors: Taşkın Deniz Yıldız, Mustafa Topaloğlu, Orhan Kural

Abstract:

The right to operate a fairly long periods of prior periods and after the 3213 Mining Law has been observed to be shortened in Turkey. Permit the realization of business activities (or concession) requested the purchase of the mine operated "found mine" position, as well as the financial and technical capability to have the owner of the right to operate the mines as well as the principle of equality is important in terms of assessing the best way be. In particular, in this context, license fields "negligence" (downsizing) have noted that the current arrangement for all periods. However, in the period after 3213 Mining Act and a permit to operate more effectively within the framework of implementation of negligence is laid down.

Keywords: mining legislation, operation, permit, Turkey

Procedia PDF Downloads 378
25018 Improved Classification Procedure for Imbalanced and Overlapped Situations

Authors: Hankyu Lee, Seoung Bum Kim

Abstract:

The issue with imbalance and overlapping in the class distribution becomes important in various applications of data mining. The imbalanced dataset is a special case in classification problems in which the number of observations of one class (i.e., major class) heavily exceeds the number of observations of the other class (i.e., minor class). Overlapped dataset is the case where many observations are shared together between the two classes. Imbalanced and overlapped data can be frequently found in many real examples including fraud and abuse patients in healthcare, quality prediction in manufacturing, text classification, oil spill detection, remote sensing, and so on. The class imbalance and overlap problem is the challenging issue because this situation degrades the performance of most of the standard classification algorithms. In this study, we propose a classification procedure that can effectively handle imbalanced and overlapped datasets by splitting data space into three parts: nonoverlapping, light overlapping, and severe overlapping and applying the classification algorithm in each part. These three parts were determined based on the Hausdorff distance and the margin of the modified support vector machine. An experiments study was conducted to examine the properties of the proposed method and compared it with other classification algorithms. The results showed that the proposed method outperformed the competitors under various imbalanced and overlapped situations. Moreover, the applicability of the proposed method was demonstrated through the experiment with real data.

Keywords: classification, imbalanced data with class overlap, split data space, support vector machine

Procedia PDF Downloads 279
25017 Text Analysis to Support Structuring and Modelling a Public Policy Problem-Outline of an Algorithm to Extract Inferences from Textual Data

Authors: Claudia Ehrentraut, Osama Ibrahim, Hercules Dalianis

Abstract:

Policy making situations are real-world problems that exhibit complexity in that they are composed of many interrelated problems and issues. To be effective, policies must holistically address the complexity of the situation rather than propose solutions to single problems. Formulating and understanding the situation and its complex dynamics, therefore, is a key to finding holistic solutions. Analysis of text based information on the policy problem, using Natural Language Processing (NLP) and Text analysis techniques, can support modelling of public policy problem situations in a more objective way based on domain experts knowledge and scientific evidence. The objective behind this study is to support modelling of public policy problem situations, using text analysis of verbal descriptions of the problem. We propose a formal methodology for analysis of qualitative data from multiple information sources on a policy problem to construct a causal diagram of the problem. The analysis process aims at identifying key variables, linking them by cause-effect relationships and mapping that structure into a graphical representation that is adequate for designing action alternatives, i.e., policy options. This study describes the outline of an algorithm used to automate the initial step of a larger methodological approach, which is so far done manually. In this initial step, inferences about key variables and their interrelationships are extracted from textual data to support a better problem structuring. A small prototype for this step is also presented.

Keywords: public policy, problem structuring, qualitative analysis, natural language processing, algorithm, inference extraction

Procedia PDF Downloads 562
25016 Automatic Lead Qualification with Opinion Mining in Customer Relationship Management Projects

Authors: Victor Radich, Tania Basso, Regina Moraes

Abstract:

Lead qualification is one of the main procedures in Customer Relationship Management (CRM) projects. Its main goal is to identify potential consumers who have the ideal characteristics to establish a profitable and long-term relationship with a certain organization. Social networks can be an important source of data for identifying and qualifying leads since interest in specific products or services can be identified from the users’ expressed feelings of (dis)satisfaction. In this context, this work proposes the use of machine learning techniques and sentiment analysis as an extra step in the lead qualification process in order to improve it. In addition to machine learning models, sentiment analysis or opinion mining can be used to understand the evaluation that the user makes of a particular service, product, or brand. The results obtained so far have shown that it is possible to extract data from social networks and combine the techniques for a more complete classification.

Keywords: lead qualification, sentiment analysis, opinion mining, machine learning, CRM, lead scoring

Procedia PDF Downloads 49
25015 Semantic Textual Similarity on Contracts: Exploring Multiple Negative Ranking Losses for Sentence Transformers

Authors: Yogendra Sisodia

Abstract:

Researchers are becoming more interested in extracting useful information from legal documents thanks to the development of large-scale language models in natural language processing (NLP), and deep learning has accelerated the creation of powerful text mining models. Legal fields like contracts benefit greatly from semantic text search since it makes it quick and easy to find related clauses. After collecting sentence embeddings, it is relatively simple to locate sentences with a comparable meaning throughout the entire legal corpus. The author of this research investigated two pre-trained language models for this task: MiniLM and Roberta, and further fine-tuned them on Legal Contracts. The author used Multiple Negative Ranking Loss for the creation of sentence transformers. The fine-tuned language models and sentence transformers showed promising results.

Keywords: legal contracts, multiple negative ranking loss, natural language inference, sentence transformers, semantic textual similarity

Procedia PDF Downloads 72
25014 An Experimental Study for Assessing Email Classification Attributes Using Feature Selection Methods

Authors: Issa Qabaja, Fadi Thabtah

Abstract:

Email phishing classification is one of the vital problems in the online security research domain that have attracted several scholars due to its impact on the users payments performed daily online. One aspect to reach a good performance by the detection algorithms in the email phishing problem is to identify the minimal set of features that significantly have an impact on raising the phishing detection rate. This paper investigate three known feature selection methods named Information Gain (IG), Chi-square and Correlation Features Set (CFS) on the email phishing problem to separate high influential features from low influential ones in phishing detection. We measure the degree of influentially by applying four data mining algorithms on a large set of features. We compare the accuracy of these algorithms on the complete features set before feature selection has been applied and after feature selection has been applied. After conducting experiments, the results show 12 common significant features have been chosen among the considered features by the feature selection methods. Further, the average detection accuracy derived by the data mining algorithms on the reduced 12-features set was very slight affected when compared with the one derived from the 47-features set.

Keywords: data mining, email classification, phishing, online security

Procedia PDF Downloads 407
25013 Investigating Data Normalization Techniques in Swarm Intelligence Forecasting for Energy Commodity Spot Price

Authors: Yuhanis Yusof, Zuriani Mustaffa, Siti Sakira Kamaruddin

Abstract:

Data mining is a fundamental technique in identifying patterns from large data sets. The extracted facts and patterns contribute in various domains such as marketing, forecasting, and medical. Prior to that, data are consolidated so that the resulting mining process may be more efficient. This study investigates the effect of different data normalization techniques, which are Min-max, Z-score, and decimal scaling, on Swarm-based forecasting models. Recent swarm intelligence algorithms employed includes the Grey Wolf Optimizer (GWO) and Artificial Bee Colony (ABC). Forecasting models are later developed to predict the daily spot price of crude oil and gasoline. Results showed that GWO works better with Z-score normalization technique while ABC produces better accuracy with the Min-Max. Nevertheless, the GWO is more superior that ABC as its model generates the highest accuracy for both crude oil and gasoline price. Such a result indicates that GWO is a promising competitor in the family of swarm intelligence algorithms.

Keywords: artificial bee colony, data normalization, forecasting, Grey Wolf optimizer

Procedia PDF Downloads 449
25012 Artificial Intelligence as a User of Copyrighted Work: Descriptive Study

Authors: Dominika Collett

Abstract:

AI applications, such as machine learning, require access to a vast amount of data in the training phase, which can often be the subject of copyright protection. During later usage, the various content with which the application works can be recorded or made available on the basis of which it produces the resulting output. The EU has recently adopted new legislation to secure machine access to protected works under the DSM Directive; but, the issue of machine use of copyright works is not clearly addressed. However, such clarity is needed regarding the increasing importance of AI and its development. Therefore, this paper provides a basic background of the technology used in the development of applications in the field of computer creativity. The second part of the paper then will focus on a legal analysis of machine use of the authors' works from the perspective of existing European and Czech legislation. The main results of the paper discuss the potential collision of existing legislation in regards to machine use of works with special focus on exceptions and limitations. The legal regulation of machine use of copyright work will impact the development of AI technology.

Keywords: copyright, artificial intelligence, legal use, infringement, Czech law, EU law, text and data mining

Procedia PDF Downloads 102
25011 Leveraging Power BI for Advanced Geotechnical Data Analysis and Visualization in Mining Projects

Authors: Elaheh Talebi, Fariba Yavari, Lucy Philip, Lesley Town

Abstract:

The mining industry generates vast amounts of data, necessitating robust data management systems and advanced analytics tools to achieve better decision-making processes in the development of mining production and maintaining safety. This paper highlights the advantages of Power BI, a powerful intelligence tool, over traditional Excel-based approaches for effectively managing and harnessing mining data. Power BI enables professionals to connect and integrate multiple data sources, ensuring real-time access to up-to-date information. Its interactive visualizations and dashboards offer an intuitive interface for exploring and analyzing geotechnical data. Advanced analytics is a collection of data analysis techniques to improve decision-making. Leveraging some of the most complex techniques in data science, advanced analytics is used to do everything from detecting data errors and ensuring data accuracy to directing the development of future project phases. However, while Power BI is a robust tool, specific visualizations required by geotechnical engineers may have limitations. This paper studies the capability to use Python or R programming within the Power BI dashboard to enable advanced analytics, additional functionalities, and customized visualizations. This dashboard provides comprehensive tools for analyzing and visualizing key geotechnical data metrics, including spatial representation on maps, field and lab test results, and subsurface rock and soil characteristics. Advanced visualizations like borehole logs and Stereonet were implemented using Python programming within the Power BI dashboard, enhancing the understanding and communication of geotechnical information. Moreover, the dashboard's flexibility allows for the incorporation of additional data and visualizations based on the project scope and available data, such as pit design, rock fall analyses, rock mass characterization, and drone data. This further enhances the dashboard's usefulness in future projects, including operation, development, closure, and rehabilitation phases. Additionally, this helps in minimizing the necessity of utilizing multiple software programs in projects. This geotechnical dashboard in Power BI serves as a user-friendly solution for analyzing, visualizing, and communicating both new and historical geotechnical data, aiding in informed decision-making and efficient project management throughout various project stages. Its ability to generate dynamic reports and share them with clients in a collaborative manner further enhances decision-making processes and facilitates effective communication within geotechnical projects in the mining industry.

Keywords: geotechnical data analysis, power BI, visualization, decision-making, mining industry

Procedia PDF Downloads 59
25010 Road Traffic Accidents Analysis in Mexico City through Crowdsourcing Data and Data Mining Techniques

Authors: Gabriela V. Angeles Perez, Jose Castillejos Lopez, Araceli L. Reyes Cabello, Emilio Bravo Grajales, Adriana Perez Espinosa, Jose L. Quiroz Fabian

Abstract:

Road traffic accidents are among the principal causes of traffic congestion, causing human losses, damages to health and the environment, economic losses and material damages. Studies about traditional road traffic accidents in urban zones represents very high inversion of time and money, additionally, the result are not current. However, nowadays in many countries, the crowdsourced GPS based traffic and navigation apps have emerged as an important source of information to low cost to studies of road traffic accidents and urban congestion caused by them. In this article we identified the zones, roads and specific time in the CDMX in which the largest number of road traffic accidents are concentrated during 2016. We built a database compiling information obtained from the social network known as Waze. The methodology employed was Discovery of knowledge in the database (KDD) for the discovery of patterns in the accidents reports. Furthermore, using data mining techniques with the help of Weka. The selected algorithms was the Maximization of Expectations (EM) to obtain the number ideal of clusters for the data and k-means as a grouping method. Finally, the results were visualized with the Geographic Information System QGIS.

Keywords: data mining, k-means, road traffic accidents, Waze, Weka

Procedia PDF Downloads 380
25009 The Effects of Three Pre-Reading Activities (Text Summary, Vocabulary Definition, and Pre-Passage Questions) on the Reading Comprehension of Iranian EFL Learners

Authors: Leila Anjomshoa, Firooz Sadighi

Abstract:

This study investigated the effects of three types of pre-reading activities (vocabulary definitions, text summary and pre-passage questions) on EFL learners’ English reading comprehension. On the basis of the results of a placement test administered to two hundred and thirty English students at Kerman Azad University, 200 subjects (one hundred intermediate and one hundred advanced) were selected.Four texts, two of them at intermediate level and two of them at advanced level were chosen. The data gathered was subjected to the statistical procedures of ANOVA. A close examination of the results through Tukey’s HSD showed the fact that the experimental groups performed better than the control group, highlighting the effect of the treatment on them. Also, the experimental group C (text summary), performed remarkably better than the other three groups (both experimental & control). Group B subjects, vocabulary definitions, performed better than groups A and D. The pre-passage questions group’s (D) performance showed higher scores than the control condition.

Keywords: pre-reading activities, text summary, vocabulary definition, and pre-passage questions, reading comprehension

Procedia PDF Downloads 322
25008 Recognition of Cursive Arabic Handwritten Text Using Embedded Training Based on Hidden Markov Models (HMMs)

Authors: Rabi Mouhcine, Amrouch Mustapha, Mahani Zouhir, Mammass Driss

Abstract:

In this paper, we present a system for offline recognition cursive Arabic handwritten text based on Hidden Markov Models (HMMs). The system is analytical without explicit segmentation used embedded training to perform and enhance the character models. Extraction features preceded by baseline estimation are statistical and geometric to integrate both the peculiarities of the text and the pixel distribution characteristics in the word image. These features are modelled using hidden Markov models and trained by embedded training. The experiments on images of the benchmark IFN/ENIT database show that the proposed system improves recognition.

Keywords: recognition, handwriting, Arabic text, HMMs, embedded training

Procedia PDF Downloads 325
25007 Cirrhosis Mortality Prediction as Classification using Frequent Subgraph Mining

Authors: Abdolghani Ebrahimi, Diego Klabjan, Chenxi Ge, Daniela Ladner, Parker Stride

Abstract:

In this work, we use machine learning and novel data analysis techniques to predict the one-year mortality of cirrhotic patients. Data from 2,322 patients with liver cirrhosis are collected at a single medical center. Different machine learning models are applied to predict one-year mortality. A comprehensive feature space including demographic information, comorbidity, clinical procedure and laboratory tests is being analyzed. A temporal pattern mining technic called Frequent Subgraph Mining (FSM) is being used. Model for End-stage liver disease (MELD) prediction of mortality is used as a comparator. All of our models statistically significantly outperform the MELD-score model and show an average 10% improvement of the area under the curve (AUC). The FSM technic itself does not improve the model significantly, but FSM, together with a machine learning technique called an ensemble, further improves the model performance. With the abundance of data available in healthcare through electronic health records (EHR), existing predictive models can be refined to identify and treat patients at risk for higher mortality. However, due to the sparsity of the temporal information needed by FSM, the FSM model does not yield significant improvements. To the best of our knowledge, this is the first work to apply modern machine learning algorithms and data analysis methods on predicting one-year mortality of cirrhotic patients and builds a model that predicts one-year mortality significantly more accurate than the MELD score. We have also tested the potential of FSM and provided a new perspective of the importance of clinical features.

Keywords: machine learning, liver cirrhosis, subgraph mining, supervised learning

Procedia PDF Downloads 109
25006 Development of Innovative Islamic Web Applications

Authors: Farrukh Shahzad

Abstract:

The rich Islamic resources related to religious text, Islamic sciences, and history are widely available in print and in electronic format online. However, most of these works are only available in Arabic language. In this research, an attempt is made to utilize these resources to create interactive web applications in Arabic, English and other languages. The system utilizes the Pattern Recognition, Knowledge Management, Data Mining, Information Retrieval and Management, Indexing, storage and data-analysis techniques to parse, store, convert and manage the information from authentic Arabic resources. These interactive web Apps provide smart multi-lingual search, tree based search, on-demand information matching and linking. In this paper, we provide details of application architecture, design, implementation and technologies employed. We also presented the summary of web applications already developed. We have also included some screen shots from the corresponding web sites. These web applications provide an Innovative On-line Learning Systems (eLearning and computer based education).

Keywords: Islamic resources, Muslim scholars, hadith, narrators, history, fiqh

Procedia PDF Downloads 259
25005 Poetics of the Connecting ha’: A Textual Study in the Poetry of Al-Husari Al-Qayrawani

Authors: Mahmoud al-Ashiriy

Abstract:

This paper begins from the idea that the real history of literature is the history of its style. And since the rhyme –as known- is not merely the last letter, that have received a lot of analysis and investigation, but it is a collection of other values in addition to its different markings. This paper will explore the work of the connecting ha’ and its effectiveness in shaping the text of poetry, since it establishes vocal rhythms in addition to its role in indicating references through the pronoun, vertically through the poem through the sequence of its verses, also horizontally through what environs the one verse of sentences. If the scientific formation of prosody stopped at the possibilities and prohibitions; literary criticism and poetry studies should explore what is above the rule of aesthetic horizon of poetic effectiveness that varies from a text to another, a poet to another, a literary period to another, or from a poetic taste to another. Then the paper will explore this poetic essence in the texts of the famous Andalusian Poet Al-Husari Al-Qayrawani through his well-known Daliyya (a poem that its verses end with the letter D), and the role of the connecting ha’ in fulfilling its text and the accomplishment of its poetics, departing from this to the diwan (the big collection of poems) also as a higher text that surpasses the text/poem, and through what it represents of effectiveness the work of the phenomenon in accomplishing the poetics of the poem of Al-Husari Al-Qayrawani who is one of the pillars of Arabic poetics in Andalusia.

Keywords: Al-Husari Al-Qayrawni, poetics, rhyme, stylistics, science of the text

Procedia PDF Downloads 538