Search results for: document classifier
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 1069

Search results for: document classifier

949 Improving Subjective Bias Detection Using Bidirectional Encoder Representations from Transformers and Bidirectional Long Short-Term Memory

Authors: Ebipatei Victoria Tunyan, T. A. Cao, Cheol Young Ock

Abstract:

Detecting subjectively biased statements is a vital task. This is because this kind of bias, when present in the text or other forms of information dissemination media such as news, social media, scientific texts, and encyclopedias, can weaken trust in the information and stir conflicts amongst consumers. Subjective bias detection is also critical for many Natural Language Processing (NLP) tasks like sentiment analysis, opinion identification, and bias neutralization. Having a system that can adequately detect subjectivity in text will boost research in the above-mentioned areas significantly. It can also come in handy for platforms like Wikipedia, where the use of neutral language is of importance. The goal of this work is to identify the subjectively biased language in text on a sentence level. With machine learning, we can solve complex AI problems, making it a good fit for the problem of subjective bias detection. A key step in this approach is to train a classifier based on BERT (Bidirectional Encoder Representations from Transformers) as upstream model. BERT by itself can be used as a classifier; however, in this study, we use BERT as data preprocessor as well as an embedding generator for a Bi-LSTM (Bidirectional Long Short-Term Memory) network incorporated with attention mechanism. This approach produces a deeper and better classifier. We evaluate the effectiveness of our model using the Wiki Neutrality Corpus (WNC), which was compiled from Wikipedia edits that removed various biased instances from sentences as a benchmark dataset, with which we also compare our model to existing approaches. Experimental analysis indicates an improved performance, as our model achieved state-of-the-art accuracy in detecting subjective bias. This study focuses on the English language, but the model can be fine-tuned to accommodate other languages.

Keywords: subjective bias detection, machine learning, BERT–BiLSTM–Attention, text classification, natural language processing

Procedia PDF Downloads 100
948 A Quantitative Evaluation of Text Feature Selection Methods

Authors: B. S. Harish, M. B. Revanasiddappa

Abstract:

Due to rapid growth of text documents in digital form, automated text classification has become an important research in the last two decades. The major challenge of text document representations are high dimension, sparsity, volume and semantics. Since the terms are only features that can be found in documents, selection of good terms (features) plays an very important role. In text classification, feature selection is a strategy that can be used to improve classification effectiveness, computational efficiency and accuracy. In this paper, we present a quantitative analysis of most widely used feature selection (FS) methods, viz. Term Frequency-Inverse Document Frequency (tfidf ), Mutual Information (MI), Information Gain (IG), CHISquare (x2), Term Frequency-Relevance Frequency (tfrf ), Term Strength (TS), Ambiguity Measure (AM) and Symbolic Feature Selection (SFS) to classify text documents. We evaluated all the feature selection methods on standard datasets like 20 Newsgroups, 4 University dataset and Reuters-21578.

Keywords: classifiers, feature selection, text classification

Procedia PDF Downloads 427
947 Control Configuration System as a Key Element in Distributed Control System

Authors: Goodarz Sabetian, Sajjad Moshfe

Abstract:

Control system for hi-tech industries could be realized generally and deeply by a special document. Vast heavy industries such as power plants with a large number of I/O signals are controlled by a distributed control system (DCS). This system comprises of so many parts from field level to high control level, and junior instrument engineers may be confused by this enormous information. The key document which can solve this problem is “control configuration system diagram” for each type of DCS. This is a road map that covers all of activities respect to control system in each industrial plant and inevitable to be studied by whom corresponded. It plays an important role from designing control system start point until the end; deliver the system to operate. This should be inserted in bid documents, contracts, purchasing specification and used in different periods of project EPC (engineering, procurement, and construction). Separate parts of DCS are categorized here in order of importance and a brief description and some practical plan is offered. This article could be useful for all instrument and control engineers who worked is EPC projects.

Keywords: control, configuration, DCS, power plant, bus

Procedia PDF Downloads 468
946 Applications of Visual Ethnography in Public Anthropology

Authors: Subramaniam Panneerselvam, Gunanithi Perumal, KP Subin

Abstract:

The Visual Ethnography is used to document the culture of a community through a visual means. It could be either photography or audio-visual documentation. The visual ethnographic techniques are widely used in visual anthropology. The visual anthropologists use the camera to capture the cultural image of the studied community. There is a scope for subjectivity while the culture is documented by an external person. But the upcoming of the public anthropology provides an opportunity for the participants to document their own culture. There is a need to equip the participants with the skill of doing visual ethnography. The mobile phone technology provides visual documentation facility to everyone to capture the moments instantly. The visual ethnography facilitates the multiple-interpretation for the audiences. This study explores the effectiveness of visual ethnography among the tribal youth through public anthropology perspective. The case study was conducted to equip the tribal youth of Nilgiris in visual ethnography and the outcome of the experiment shared in this paper.

Keywords: visual ethnography, visual anthropology, public anthropology, multiple-interpretation, case study

Procedia PDF Downloads 143
945 Development of Distance Training Packages for Teacher on Education Management for Learners with Special Needs

Authors: Jareeluk Ratanaphan

Abstract:

The purposed of this research were; 1. To survey the teacher’s needs on knowledge about special education management for special needs student 2. Development of distance training packages for teacher on special education management for special needs student 3. to study the effects of using the packages on trainee’s achievement 4. to study the effects of using the packages on trainee’s opinion on the distance training packages. The design of the experiment was research and development. The research sample for survey were 86 teachers, and 22 teachers for study the effects of using the packages on achievement and opinion. The research instrument comprised: 1) training packages on special education management for special needs student 2) achievement test 3) questionnaire. Mean, percentage, standard deviation, t-test and content analysis were used for data analysis. The findings of the research were as follows: 1. The teacher’s needs on knowledge about teaching for a learner with learning disability, mental retardation, autism, physical and health impairment and research in special education. 2. The package composed of special education management for special needs student document and manual of distance training packages. The document consisted by the name of packages, the explanation for the educator, content’s structure, concept, objectives, content and activities. Manual of distance training packages consisted by the explanation about a document, objectives, explanation about using the package, training schedule, and evaluation. The efficiency of packages was established at 79.50/81.35. 3. The results of using the packages were the posttest average scores of trainee’s achievement were higher than the pretest. 4. The trainee’s opinion on the package was at the highest level.

Keywords: distance training package, teacher, learner with special needs

Procedia PDF Downloads 465
944 Semantic Indexing Improvement for Textual Documents: Contribution of Classification by Fuzzy Association Rules

Authors: Mohsen Maraoui

Abstract:

In the aim of natural language processing applications improvement, such as information retrieval, machine translation, lexical disambiguation, we focus on statistical approach to semantic indexing for multilingual text documents based on conceptual network formalism. We propose to use this formalism as an indexing language to represent the descriptive concepts and their weighting. These concepts represent the content of the document. Our contribution is based on two steps. In the first step, we propose the extraction of index terms using the multilingual lexical resource Euro WordNet (EWN). In the second step, we pass from the representation of index terms to the representation of index concepts through conceptual network formalism. This network is generated using the EWN resource and pass by a classification step based on association rules model (in attempt to discover the non-taxonomic relations or contextual relations between the concepts of a document). These relations are latent relations buried in the text and carried by the semantic context of the co-occurrence of concepts in the document. Our proposed indexing approach can be applied to text documents in various languages because it is based on a linguistic method adapted to the language through a multilingual thesaurus. Next, we apply the same statistical process regardless of the language in order to extract the significant concepts and their associated weights. We prove that the proposed indexing approach provides encouraging results.

Keywords: concept extraction, conceptual network formalism, fuzzy association rules, multilingual thesaurus, semantic indexing

Procedia PDF Downloads 122
943 Understanding Embryology in Promoting Peace Leadership: A Document Review

Authors: Vasudev Das

Abstract:

The specific problem is that many leaders of the 21st century do not understand that the extermination of embryos wreaks havoc on peace leadership. The purpose of the document review is to understand embryology in facilitating peace leadership. Extermination of human embryos generates a requital wave of violence which later falls on human society in the form of disturbances, considering that violence breeds further violence as a consequentiality. The study results reveal that a deep understanding of embryology facilitates peace leadership, given that minimizing embryo extermination enhances non-violence in the global village. Neo-Newtonians subscribe to the idea that every action has an equal and opposite reaction. The US Federal Government recognizes the embryo or fetus as a member of Homo sapiens. The social change implications of this study are that understanding human embryology promotes peace leadership, considering that the consequentiality of embryo extermination can serve as a deterrent for violence on embryos.

Keywords: consequentiality, Homo sapiens, neo-Newtonians, violence

Procedia PDF Downloads 112
942 Adaptation of Projection Profile Algorithm for Skewed Handwritten Text Line Detection

Authors: Kayode A. Olaniyi, Tola. M. Osifeko, Adeola A. Ogunleye

Abstract:

Text line segmentation is an important step in document image processing. It represents a labeling process that assigns the same label using distance metric probability to spatially aligned units. Text line detection techniques have successfully been implemented mainly in printed documents. However, processing of the handwritten texts especially unconstrained documents has remained a key problem. This is because the unconstrained hand-written text lines are often not uniformly skewed. The spaces between text lines may not be obvious, complicated by the nature of handwriting and, overlapping ascenders and/or descenders of some characters. Hence, text lines detection and segmentation represents a leading challenge in handwritten document image processing. Text line detection methods that rely on the traditional global projection profile of the text document cannot efficiently confront with the problem of variable skew angles between different text lines. Hence, the formulation of a horizontal line as a separator is often not efficient. This paper presents a technique to segment a handwritten document into distinct lines of text. The proposed algorithm starts, by partitioning the initial text image into columns, across its width into chunks of about 5% each. At each vertical strip of 5%, the histogram of horizontal runs is projected. We have worked with the assumption that text appearing in a single strip is almost parallel to each other. The algorithm developed provides a sliding window through the first vertical strip on the left side of the page. It runs through to identify the new minimum corresponding to a valley in the projection profile. Each valley would represent the starting point of the orientation line and the ending point is the minimum point on the projection profile of the next vertical strip. The derived text-lines traverse around any obstructing handwritten vertical strips of connected component by associating it to either the line above or below. A decision of associating such connected component is made by the probability obtained from a distance metric decision. The technique outperforms the global projection profile for text line segmentation and it is robust to handle skewed documents and those with lines running into each other.

Keywords: connected-component, projection-profile, segmentation, text-line

Procedia PDF Downloads 96
941 Offline Signature Verification Using Minutiae and Curvature Orientation

Authors: Khaled Nagaty, Heba Nagaty, Gerard McKee

Abstract:

A signature is a behavioral biometric that is used for authenticating users in most financial and legal transactions. Signatures can be easily forged by skilled forgers. Therefore, it is essential to verify whether a signature is genuine or forged. The aim of any signature verification algorithm is to accommodate the differences between signatures of the same person and increase the ability to discriminate between signatures of different persons. This work presented in this paper proposes an automatic signature verification system to indicate whether a signature is genuine or not. The system comprises four phases: (1) The pre-processing phase in which image scaling, binarization, image rotation, dilation, thinning, and connecting ridge breaks are applied. (2) The feature extraction phase in which global and local features are extracted. The local features are minutiae points, curvature orientation, and curve plateau. The global features are signature area, signature aspect ratio, and Hu moments. (3) The post-processing phase, in which false minutiae are removed. (4) The classification phase in which features are enhanced before feeding it into the classifier. k-nearest neighbors and support vector machines are used. The classifier was trained on a benchmark dataset to compare the performance of the proposed offline signature verification system against the state-of-the-art. The accuracy of the proposed system is 92.3%.

Keywords: signature, ridge breaks, minutiae, orientation

Procedia PDF Downloads 123
940 Detecting and Thwarting Interest Flooding Attack in Information Centric Network

Authors: Vimala Rani P, Narasimha Malikarjunan, Mercy Shalinie S

Abstract:

Data Networking was brought forth as an instantiation of information-centric networking. The attackers can send a colossal number of spoofs to take hold of the Pending Interest Table (PIT) named an Interest Flooding attack (IFA) since the in- interests are recorded in the PITs of the intermediate routers until they receive corresponding Data Packets are go beyond the time limit. These attacks can be detrimental to network performance. PIT expiration rate or the Interest satisfaction rate, which cannot differentiate the IFA from attacks, is the criterion Traditional IFA detection techniques are concerned with. Threshold values can casually affect Threshold-based traditional methods. This article proposes an accurate IFA detection mechanism based on a Multiple Feature-based Extreme Learning Machine (MF-ELM). Accuracy of the attack detection can be increased by presenting the entropy of Internet names, Interest satisfaction rate and PIT usage as features extracted in the MF-ELM classifier. Furthermore, we deploy a queue-based hostile Interest prefix mitigation mechanism. The inference of this real-time test bed is that the mechanism can help the network to resist IFA with higher accuracy and efficiency.

Keywords: information-centric network, pending interest table, interest flooding attack, MF-ELM classifier, queue-based mitigation strategy

Procedia PDF Downloads 181
939 Predicting Success and Failure in Drug Development Using Text Analysis

Authors: Zhi Hao Chow, Cian Mulligan, Jack Walsh, Antonio Garzon Vico, Dimitar Krastev

Abstract:

Drug development is resource-intensive, time-consuming, and increasingly expensive with each developmental stage. The success rates of drug development are also relatively low, and the resources committed are wasted with each failed candidate. As such, a reliable method of predicting the success of drug development is in demand. The hypothesis was that some examples of failed drug candidates are pushed through developmental pipelines based on false confidence and may possess common linguistic features identifiable through sentiment analysis. Here, the concept of using text analysis to discover such features in research publications and investor reports as predictors of success was explored. R studios were used to perform text mining and lexicon-based sentiment analysis to identify affective phrases and determine their frequency in each document, then using SPSS to determine the relationship between our defined variables and the accuracy of predicting outcomes. A total of 161 publications were collected and categorised into 4 groups: (i) Cancer treatment, (ii) Neurodegenerative disease treatment, (iii) Vaccines, and (iv) Others (containing all other drugs that do not fit into the 3 categories). Text analysis was then performed on each document using 2 separate datasets (BING and AFINN) in R within the category of drugs to determine the frequency of positive or negative phrases in each document. A relative positivity and negativity value were then calculated by dividing the frequency of phrases with the word count of each document. Regression analysis was then performed with SPSS statistical software on each dataset (values from using BING or AFINN dataset during text analysis) using a random selection of 61 documents to construct a model. The remaining documents were then used to determine the predictive power of the models. Model constructed from BING predicts the outcome of drug performance in clinical trials with an overall percentage of 65.3%. AFINN model had a lower accuracy at predicting outcomes compared to the BING model at 62.5% but was not effective at predicting the failure of drugs in clinical trials. Overall, the study did not show significant efficacy of the model at predicting outcomes of drugs in development. Many improvements may need to be made to later iterations of the model to sufficiently increase the accuracy.

Keywords: data analysis, drug development, sentiment analysis, text-mining

Procedia PDF Downloads 126
938 Efficient Layout-Aware Pretraining for Multimodal Form Understanding

Authors: Armineh Nourbakhsh, Sameena Shah, Carolyn Rose

Abstract:

Layout-aware language models have been used to create multimodal representations for documents that are in image form, achieving relatively high accuracy in document understanding tasks. However, the large number of parameters in the resulting models makes building and using them prohibitive without access to high-performing processing units with large memory capacity. We propose an alternative approach that can create efficient representations without the need for a neural visual backbone. This leads to an 80% reduction in the number of parameters compared to the smallest SOTA model, widely expanding applicability. In addition, our layout embeddings are pre-trained on spatial and visual cues alone and only fused with text embeddings in downstream tasks, which can facilitate applicability to low-resource of multi-lingual domains. Despite using 2.5% of training data, we show competitive performance on two form understanding tasks: semantic labeling and link prediction.

Keywords: layout understanding, form understanding, multimodal document understanding, bias-augmented attention

Procedia PDF Downloads 124
937 Roughness Discrimination Using Bioinspired Tactile Sensors

Authors: Zhengkun Yi

Abstract:

Surface texture discrimination using artificial tactile sensors has attracted increasing attentions in the past decade as it can endow technical and robot systems with a key missing ability. However, as a major component of texture, roughness has rarely been explored. This paper presents an approach for tactile surface roughness discrimination, which includes two parts: (1) design and fabrication of a bioinspired artificial fingertip, and (2) tactile signal processing for tactile surface roughness discrimination. The bioinspired fingertip is comprised of two polydimethylsiloxane (PDMS) layers, a polymethyl methacrylate (PMMA) bar, and two perpendicular polyvinylidene difluoride (PVDF) film sensors. This artificial fingertip mimics human fingertips in three aspects: (1) Elastic properties of epidermis and dermis in human skin are replicated by the two PDMS layers with different stiffness, (2) The PMMA bar serves the role analogous to that of a bone, and (3) PVDF film sensors emulate Meissner’s corpuscles in terms of both location and response to the vibratory stimuli. Various extracted features and classification algorithms including support vector machines (SVM) and k-nearest neighbors (kNN) are examined for tactile surface roughness discrimination. Eight standard rough surfaces with roughness values (Ra) of 50 μm, 25 μm, 12.5 μm, 6.3 μm 3.2 μm, 1.6 μm, 0.8 μm, and 0.4 μm are explored. The highest classification accuracy of (82.6 ± 10.8) % can be achieved using solely one PVDF film sensor with kNN (k = 9) classifier and the standard deviation feature.

Keywords: bioinspired fingertip, classifier, feature extraction, roughness discrimination

Procedia PDF Downloads 283
936 Teachers' Beliefs and Practices in Designing Negotiated English Lesson Plans

Authors: Joko Nurkamto

Abstract:

A lesson plan is a part of the planning phase in a learning and teaching system framing the scenario of pedagogical activities in the classroom. It informs a decision on what to teach and how to landscape classroom interaction. Regardless of these benefits, the writer has witnessed the fact that lesson plans are viewed merely as a teaching document. Therefore, this paper will explore teachers’ beliefs and practices in designing lesson plans. It focuses primarily on how both teachers and students negotiate lesson plans in which the students are deemed to be the agents of instructional innovations. Additionally, the paper will talk about how such lesson plans are enacted. To investigate these issues, document analysis, in-depth interviews, participant classroom observation, and focus group discussion will be deployed as data collection methods in this explorative case study. The benefits of the paper are to show different roles of lesson plans and to discover different ways to design and enact such plans from a socio-interactional perspective.

Keywords: instructional innovation, learning and teaching system, lesson plan, pedagogical activities, teachers' beliefs and practices

Procedia PDF Downloads 132
935 DocPro: A Framework for Processing Semantic and Layout Information in Business Documents

Authors: Ming-Jen Huang, Chun-Fang Huang, Chiching Wei

Abstract:

With the recent advance of the deep neural network, we observe new applications of NLP (natural language processing) and CV (computer vision) powered by deep neural networks for processing business documents. However, creating a real-world document processing system needs to integrate several NLP and CV tasks, rather than treating them separately. There is a need to have a unified approach for processing documents containing textual and graphical elements with rich formats, diverse layout arrangement, and distinct semantics. In this paper, a framework that fulfills this unified approach is presented. The framework includes a representation model definition for holding the information generated by various tasks and specifications defining the coordination between these tasks. The framework is a blueprint for building a system that can process documents with rich formats, styles, and multiple types of elements. The flexible and lightweight design of the framework can help build a system for diverse business scenarios, such as contract monitoring and reviewing.

Keywords: document processing, framework, formal definition, machine learning

Procedia PDF Downloads 187
934 An Approach for Vocal Register Recognition Based on Spectral Analysis of Singing

Authors: Aleksandra Zysk, Pawel Badura

Abstract:

Recognizing and controlling vocal registers during singing is a difficult task for beginner vocalist. It requires among others identifying which part of natural resonators is being used when a sound propagates through the body. Thus, an application has been designed allowing for sound recording, automatic vocal register recognition (VRR), and a graphical user interface providing real-time visualization of the signal and recognition results. Six spectral features are determined for each time frame and passed to the support vector machine classifier yielding a binary decision on the head or chest register assignment of the segment. The classification training and testing data have been recorded by ten professional female singers (soprano, aged 19-29) performing sounds for both chest and head register. The classification accuracy exceeded 93% in each of various validation schemes. Apart from a hard two-class clustering, the support vector classifier returns also information on the distance between particular feature vector and the discrimination hyperplane in a feature space. Such an information reflects the level of certainty of the vocal register classification in a fuzzy way. Thus, the designed recognition and training application is able to assess and visualize the continuous trend in singing in a user-friendly graphical mode providing an easy way to control the vocal emission.

Keywords: classification, singing, spectral analysis, vocal emission, vocal register

Procedia PDF Downloads 282
933 Intelligent Recognition of Diabetes Disease via FCM Based Attribute Weighting

Authors: Kemal Polat

Abstract:

In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.

Keywords: fuzzy C-means clustering, fuzzy C-means clustering based attribute weighting, Pima Indians diabetes, SVM

Procedia PDF Downloads 386
932 Classification of Potential Biomarkers in Breast Cancer Using Artificial Intelligence Algorithms and Anthropometric Datasets

Authors: Aref Aasi, Sahar Ebrahimi Bajgani, Erfan Aasi

Abstract:

Breast cancer (BC) continues to be the most frequent cancer in females and causes the highest number of cancer-related deaths in women worldwide. Inspired by recent advances in studying the relationship between different patient attributes and features and the disease, in this paper, we have tried to investigate the different classification methods for better diagnosis of BC in the early stages. In this regard, datasets from the University Hospital Centre of Coimbra were chosen, and different machine learning (ML)-based and neural network (NN) classifiers have been studied. For this purpose, we have selected favorable features among the nine provided attributes from the clinical dataset by using a random forest algorithm. This dataset consists of both healthy controls and BC patients, and it was noted that glucose, BMI, resistin, and age have the most importance, respectively. Moreover, we have analyzed these features with various ML-based classifier methods, including Decision Tree (DT), K-Nearest Neighbors (KNN), eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM) along with NN-based Multi-Layer Perceptron (MLP) classifier. The results revealed that among different techniques, the SVM and MLP classifiers have the most accuracy, with amounts of 96% and 92%, respectively. These results divulged that the adopted procedure could be used effectively for the classification of cancer cells, and also it encourages further experimental investigations with more collected data for other types of cancers.

Keywords: breast cancer, diagnosis, machine learning, biomarker classification, neural network

Procedia PDF Downloads 102
931 Document-level Sentiment Analysis: An Exploratory Case Study of Low-resource Language Urdu

Authors: Ammarah Irum, Muhammad Ali Tahir

Abstract:

Document-level sentiment analysis in Urdu is a challenging Natural Language Processing (NLP) task due to the difficulty of working with lengthy texts in a language with constrained resources. Deep learning models, which are complex neural network architectures, are well-suited to text-based applications in addition to data formats like audio, image, and video. To investigate the potential of deep learning for Urdu sentiment analysis, we implemented five different deep learning models, including Bidirectional Long Short Term Memory (BiLSTM), Convolutional Neural Network (CNN), Convolutional Neural Network with Bidirectional Long Short Term Memory (CNN-BiLSTM), and Bidirectional Encoder Representation from Transformer (BERT). In this study, we developed a hybrid deep learning model called BiLSTM-Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN) by fusing BiLSTM and CNN architecture. The proposed and baseline techniques are applied on Urdu Customer Support data set and IMDB Urdu movie review data set by using pre-trained Urdu word embedding that are suitable for sentiment analysis at the document level. Results of these techniques are evaluated and our proposed model outperforms all other deep learning techniques for Urdu sentiment analysis. BiLSTM-SLMFCNN outperformed the baseline deep learning models and achieved 83%, 79%, 83% and 94% accuracy on small, medium and large sized IMDB Urdu movie review data set and Urdu Customer Support data set respectively.

Keywords: urdu sentiment analysis, deep learning, natural language processing, opinion mining, low-resource language

Procedia PDF Downloads 41
930 Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis

Authors: Sangita Pokhrel, Nalinda Somasiri, Rebecca Jeyavadhanam, Swathi Ganesan

Abstract:

Tourism is a booming industry with huge future potential for global wealth and employment. There are countless data generated over social media sites every day, creating numerous opportunities to bring more insights to decision-makers. The integration of Big Data Technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centers or hotels, etc., and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences was first classified through Vader and Roberta model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization and TFIDF Vectorization, and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm, after pre-processing and cleaning the dataset, we received an accuracy of 96.12% for the positive and negative sentiment analysis.

Keywords: counter vectorization, convolutional neural network, crawler, data technology, long short-term memory, web scraping, sentiment analysis

Procedia PDF Downloads 57
929 An Efficient Algorithm for Solving the Transmission Network Expansion Planning Problem Integrating Machine Learning with Mathematical Decomposition

Authors: Pablo Oteiza, Ricardo Alvarez, Mehrdad Pirnia, Fuat Can

Abstract:

To effectively combat climate change, many countries around the world have committed to a decarbonisation of their electricity, along with promoting a large-scale integration of renewable energy sources (RES). While this trend represents a unique opportunity to effectively combat climate change, achieving a sound and cost-efficient energy transition towards low-carbon power systems poses significant challenges for the multi-year Transmission Network Expansion Planning (TNEP) problem. The objective of the multi-year TNEP is to determine the necessary network infrastructure to supply the projected demand in a cost-efficient way, considering the evolution of the new generation mix, including the integration of RES. The rapid integration of large-scale RES increases the variability and uncertainty in the power system operation, which in turn increases short-term flexibility requirements. To meet these requirements, flexible generating technologies such as energy storage systems must be considered within the TNEP as well, along with proper models for capturing the operational challenges of future power systems. As a consequence, TNEP formulations are becoming more complex and difficult to solve, especially for its application in realistic-sized power system models. To meet these challenges, there is an increasing need for developing efficient algorithms capable of solving the TNEP problem with reasonable computational time and resources. In this regard, a promising research area is the use of artificial intelligence (AI) techniques for solving large-scale mixed-integer optimization problems, such as the TNEP. In particular, the use of AI along with mathematical optimization strategies based on decomposition has shown great potential. In this context, this paper presents an efficient algorithm for solving the multi-year TNEP problem. The algorithm combines AI techniques with Column Generation, a traditional decomposition-based mathematical optimization method. One of the challenges of using Column Generation for solving the TNEP problem is that the subproblems are of mixed-integer nature, and therefore solving them requires significant amounts of time and resources. Hence, in this proposal we solve a linearly relaxed version of the subproblems, and trained a binary classifier that determines the value of the binary variables, based on the results obtained from the linearized version. A key feature of the proposal is that we integrate the binary classifier into the optimization algorithm in such a way that the optimality of the solution can be guaranteed. The results of a study case based on the HRP 38-bus test system shows that the binary classifier has an accuracy above 97% for estimating the value of the binary variables. Since the linearly relaxed version of the subproblems can be solved with significantly less time than the integer programming counterpart, the integration of the binary classifier into the Column Generation algorithm allowed us to reduce the computational time required for solving the problem by 50%. The final version of this paper will contain a detailed description of the proposed algorithm, the AI-based binary classifier technique and its integration into the CG algorithm. To demonstrate the capabilities of the proposal, we evaluate the algorithm in case studies with different scenarios, as well as in other power system models.

Keywords: integer optimization, machine learning, mathematical decomposition, transmission planning

Procedia PDF Downloads 53
928 Recurrent Neural Networks with Deep Hierarchical Mixed Structures for Chinese Document Classification

Authors: Zhaoxin Luo, Michael Zhu

Abstract:

In natural languages, there are always complex semantic hierarchies. Obtaining the feature representation based on these complex semantic hierarchies becomes the key to the success of the model. Several RNN models have recently been proposed to use latent indicators to obtain the hierarchical structure of documents. However, the model that only uses a single-layer latent indicator cannot achieve the true hierarchical structure of the language, especially a complex language like Chinese. In this paper, we propose a deep layered model that stacks arbitrarily many RNN layers equipped with latent indicators. After using EM and training it hierarchically, our model solves the computational problem of stacking RNN layers and makes it possible to stack arbitrarily many RNN layers. Our deep hierarchical model not only achieves comparable results to large pre-trained models on the Chinese short text classification problem but also achieves state of art results on the Chinese long text classification problem.

Keywords: nature language processing, recurrent neural network, hierarchical structure, document classification, Chinese

Procedia PDF Downloads 40
927 Distances over Incomplete Diabetes and Breast Cancer Data Based on Bhattacharyya Distance

Authors: Loai AbdAllah, Mahmoud Kaiyal

Abstract:

Missing values in real-world datasets are a common problem. Many algorithms were developed to deal with this problem, most of them replace the missing values with a fixed value that was computed based on the observed values. In our work, we used a distance function based on Bhattacharyya distance to measure the distance between objects with missing values. Bhattacharyya distance, which measures the similarity of two probability distributions. The proposed distance distinguishes between known and unknown values. Where the distance between two known values is the Mahalanobis distance. When, on the other hand, one of them is missing the distance is computed based on the distribution of the known values, for the coordinate that contains the missing value. This method was integrated with Wikaya, a digital health company developing a platform that helps to improve prevention of chronic diseases such as diabetes and cancer. In order for Wikaya’s recommendation system to work distance between users need to be measured. Since there are missing values in the collected data, there is a need to develop a distance function distances between incomplete users profiles. To evaluate the accuracy of the proposed distance function in reflecting the actual similarity between different objects, when some of them contain missing values, we integrated it within the framework of k nearest neighbors (kNN) classifier, since its computation is based only on the similarity between objects. To validate this, we ran the algorithm over diabetes and breast cancer datasets, standard benchmark datasets from the UCI repository. Our experiments show that kNN classifier using our proposed distance function outperforms the kNN using other existing methods.

Keywords: missing values, incomplete data, distance, incomplete diabetes data

Procedia PDF Downloads 194
926 Multi-source Question Answering Framework Using Transformers for Attribute Extraction

Authors: Prashanth Pillai, Purnaprajna Mangsuli

Abstract:

Oil exploration and production companies invest considerable time and efforts to extract essential well attributes (like well status, surface, and target coordinates, wellbore depths, event timelines, etc.) from unstructured data sources like technical reports, which are often non-standardized, multimodal, and highly domain-specific by nature. It is also important to consider the context when extracting attribute values from reports that contain information on multiple wells/wellbores. Moreover, semantically similar information may often be depicted in different data syntax representations across multiple pages and document sources. We propose a hierarchical multi-source fact extraction workflow based on a deep learning framework to extract essential well attributes at scale. An information retrieval module based on the transformer architecture was used to rank relevant pages in a document source utilizing the page image embeddings and semantic text embeddings. A question answering framework utilizingLayoutLM transformer was used to extract attribute-value pairs incorporating the text semantics and layout information from top relevant pages in a document. To better handle context while dealing with multi-well reports, we incorporate a dynamic query generation module to resolve ambiguities. The extracted attribute information from various pages and documents are standardized to a common representation using a parser module to facilitate information comparison and aggregation. Finally, we use a probabilistic approach to fuse information extracted from multiple sources into a coherent well record. The applicability of the proposed approach and related performance was studied on several real-life well technical reports.

Keywords: natural language processing, deep learning, transformers, information retrieval

Procedia PDF Downloads 170
925 Detection of Powdery Mildew Disease in Strawberry Using Image Texture and Supervised Classifiers

Authors: Sultan Mahmud, Qamar Zaman, Travis Esau, Young Chang

Abstract:

Strawberry powdery mildew (PM) is a serious disease that has a significant impact on strawberry production. Field scouting is still a major way to find PM disease, which is not only labor intensive but also almost impossible to monitor disease severity. To reduce the loss caused by PM disease and achieve faster automatic detection of the disease, this paper proposes an approach for detection of the disease, based on image texture and classified with support vector machines (SVMs) and k-nearest neighbors (kNNs). The methodology of the proposed study is based on image processing which is composed of five main steps including image acquisition, pre-processing, segmentation, features extraction and classification. Two strawberry fields were used in this study. Images of healthy leaves and leaves infected with PM (Sphaerotheca macularis) disease under artificial cloud lighting condition. Colour thresholding was utilized to segment all images before textural analysis. Colour co-occurrence matrix (CCM) was introduced for extraction of textural features. Forty textural features, related to a physiological parameter of leaves were extracted from CCM of National television system committee (NTSC) luminance, hue, saturation and intensity (HSI) images. The normalized feature data were utilized for training and validation, respectively, using developed classifiers. The classifiers have experimented with internal, external and cross-validations. The best classifier was selected based on their performance and accuracy. Experimental results suggested that SVMs classifier showed 98.33%, 85.33%, 87.33%, 93.33% and 95.0% of accuracy on internal, external-I, external-II, 4-fold cross and 5-fold cross-validation, respectively. Whereas, kNNs results represented 90.0%, 72.00%, 74.66%, 89.33% and 90.3% of classification accuracy, respectively. The outcome of this study demonstrated that SVMs classified PM disease with a highest overall accuracy of 91.86% and 1.1211 seconds of processing time. Therefore, overall results concluded that the proposed study can significantly support an accurate and automatic identification and recognition of strawberry PM disease with SVMs classifier.

Keywords: powdery mildew, image processing, textural analysis, color co-occurrence matrix, support vector machines, k-nearest neighbors

Procedia PDF Downloads 100
924 An Evaluation of 6th Grade History Curriculum in Ghana

Authors: Abigail Amoako Kayser, Brian Kayser

Abstract:

This study aimed to examine Ghana's 6th-grade Basic School history curriculum to determine how Ghanaian history is taught. We used qualitative methods and document analysis. The document analysis served two primary purposes: (1) To gain insight into what the curriculum materials covered and from whom's perspectives, and (2) To triangulate with teacher interview data. Documents obtained included: (1) Textbooks used by 6th-grade students, (2) Teacher pacing guide provided by the Department of Education in Ghana, and (3) Student work samples. This study was guided through Post-colonial theory and criticisms to explore the remnants of colonial power and hegemony that persist in history curricula used in public schools in Ghana. We also applied African Feminist Thought and Black Feminist Thought to unpack the extent to which issues of patriarchy, race, traditions, underdevelopment, and sexuality impact how we see the experiences of people on the continent. The findings indicated that the remnant of colonial rule persisted in the contents of the history curriculum, and the atrocities of slavery were overlooked or eliminated from the curriculum. The findings also indicated that Ghana's history centered on men's experiences.

Keywords: history, curriculum, decolonialization, culturally relevant pedagogy

Procedia PDF Downloads 43
923 Secure Text Steganography for Microsoft Word Document

Authors: Khan Farhan Rafat, M. Junaid Hussain

Abstract:

Seamless modification of an entity for the purpose of hiding a message of significance inside its substance in a manner that the embedding remains oblivious to an observer is known as steganography. Together with today's pervasive registering frameworks, steganography has developed into a science that offers an assortment of strategies for stealth correspondence over the globe that must, however, need a critical appraisal from security breach standpoint. Microsoft Word is amongst the preferably used word processing software, which comes as a part of the Microsoft Office suite. With a user-friendly graphical interface, the richness of text editing, and formatting topographies, the documents produced through this software are also most suitable for stealth communication. This research aimed not only to epitomize the fundamental concepts of steganography but also to expound on the utilization of Microsoft Word document as a carrier for furtive message exchange. The exertion is to examine contemporary message hiding schemes from security aspect so as to present the explorative discoveries and suggest enhancements which may serve a wellspring of information to encourage such futuristic research endeavors.

Keywords: hiding information in plain sight, stealth communication, oblivious information exchange, conceal, steganography

Procedia PDF Downloads 218
922 Bias Prevention in Automated Diagnosis of Melanoma: Augmentation of a Convolutional Neural Network Classifier

Authors: Kemka Ihemelandu, Chukwuemeka Ihemelandu

Abstract:

Melanoma remains a public health crisis, with incidence rates increasing rapidly in the past decades. Improving diagnostic accuracy to decrease misdiagnosis using Artificial intelligence (AI) continues to be documented. Unfortunately, unintended racially biased outcomes, a product of lack of diversity in the dataset used, with a noted class imbalance favoring lighter vs. darker skin tone, have increasingly been recognized as a problem.Resulting in noted limitations of the accuracy of the Convolutional neural network (CNN)models. CNN models are prone to biased output due to biases in the dataset used to train them. Our aim in this study was the optimization of convolutional neural network algorithms to mitigate bias in the automated diagnosis of melanoma. We hypothesized that our proposed training algorithms based on a data augmentation method to optimize the diagnostic accuracy of a CNN classifier by generating new training samples from the original ones will reduce bias in the automated diagnosis of melanoma. We applied geometric transformation, including; rotations, translations, scale change, flipping, and shearing. Resulting in a CNN model that provided a modifiedinput data making for a model that could learn subtle racial features. Optimal selection of the momentum and batch hyperparameter increased our model accuracy. We show that our augmented model reduces bias while maintaining accuracy in the automated diagnosis of melanoma.

Keywords: bias, augmentation, melanoma, convolutional neural network

Procedia PDF Downloads 177
921 MIMIC: A Multi Input Micro-Influencers Classifier

Authors: Simone Leonardi, Luca Ardito

Abstract:

Micro-influencers are effective elements in the marketing strategies of companies and institutions because of their capability to create an hyper-engaged audience around a specific topic of interest. In recent years, many scientific approaches and commercial tools have handled the task of detecting this type of social media users. These strategies adopt solutions ranging from rule based machine learning models to deep neural networks and graph analysis on text, images, and account information. This work compares the existing solutions and proposes an ensemble method to generalize them with different input data and social media platforms. The deployed solution combines deep learning models on unstructured data with statistical machine learning models on structured data. We retrieve both social media accounts information and multimedia posts on Twitter and Instagram. These data are mapped into feature vectors for an eXtreme Gradient Boosting (XGBoost) classifier. Sixty different topics have been analyzed to build a rule based gold standard dataset and to compare the performances of our approach against baseline classifiers. We prove the effectiveness of our work by comparing the accuracy, precision, recall, and f1 score of our model with different configurations and architectures. We obtained an accuracy of 0.91 with our best performing model.

Keywords: deep learning, gradient boosting, image processing, micro-influencers, NLP, social media

Procedia PDF Downloads 146
920 Quantitative Texture Analysis of Shoulder Sonography for Rotator Cuff Lesion Classification

Authors: Chung-Ming Lo, Chung-Chien Lee

Abstract:

In many countries, the lifetime prevalence of shoulder pain is up to 70%. In America, the health care system spends 7 billion per year about the healthy issues of shoulder pain. With respect to the origin, up to 70% of shoulder pain is attributed to rotator cuff lesions This study proposed a computer-aided diagnosis (CAD) system to assist radiologists classifying rotator cuff lesions with less operator dependence. Quantitative features were extracted from the shoulder ultrasound images acquired using an ALOKA alpha-6 US scanner (Hitachi-Aloka Medical, Tokyo, Japan) with linear array probe (scan width: 36mm) ranging from 5 to 13 MHz. During examination, the postures of the examined patients are standard sitting position and are followed by the regular routine. After acquisition, the shoulder US images were drawn out from the scanner and stored as 8-bit images with pixel value ranging from 0 to 255. Upon the sonographic appearance, the boundary of each lesion was delineated by a physician to indicate the specific pattern for analysis. The three lesion categories for classification were composed of 20 cases of tendon inflammation, 18 cases of calcific tendonitis, and 18 cases of supraspinatus tear. For each lesion, second-order statistics were quantified in the feature extraction. The second-order statistics were the texture features describing the correlations between adjacent pixels in a lesion. Because echogenicity patterns were expressed via grey-scale. The grey-scale co-occurrence matrixes with four angles of adjacent pixels were used. The texture metrics included the mean and standard deviation of energy, entropy, correlation, inverse different moment, inertia, cluster shade, cluster prominence, and Haralick correlation. Then, the quantitative features were combined in a multinomial logistic regression classifier to generate a prediction model of rotator cuff lesions. Multinomial logistic regression classifier is widely used in the classification of more than two categories such as the three lesion types used in this study. In the classifier, backward elimination was used to select a feature subset which is the most relevant. They were selected from the trained classifier with the lowest error rate. Leave-one-out cross-validation was used to evaluate the performance of the classifier. Each case was left out of the total cases and used to test the trained result by the remaining cases. According to the physician’s assessment, the performance of the proposed CAD system was shown by the accuracy. As a result, the proposed system achieved an accuracy of 86%. A CAD system based on the statistical texture features to interpret echogenicity values in shoulder musculoskeletal ultrasound was established to generate a prediction model for rotator cuff lesions. Clinically, it is difficult to distinguish some kinds of rotator cuff lesions, especially partial-thickness tear of rotator cuff. The shoulder orthopaedic surgeon and musculoskeletal radiologist reported greater diagnostic test accuracy than general radiologist or ultrasonographers based on the available literature. Consequently, the proposed CAD system which was developed according to the experiment of the shoulder orthopaedic surgeon can provide reliable suggestions to general radiologists or ultrasonographers. More quantitative features related to the specific patterns of different lesion types would be investigated in the further study to improve the prediction.

Keywords: shoulder ultrasound, rotator cuff lesions, texture, computer-aided diagnosis

Procedia PDF Downloads 256