Search results for: term frequency-inverse document frequency
8567 A Quantitative Evaluation of Text Feature Selection Methods
Authors: B. S. Harish, M. B. Revanasiddappa
Abstract:
Due to rapid growth of text documents in digital form, automated text classification has become an important research in the last two decades. The major challenge of text document representations are high dimension, sparsity, volume and semantics. Since the terms are only features that can be found in documents, selection of good terms (features) plays an very important role. In text classification, feature selection is a strategy that can be used to improve classification effectiveness, computational efficiency and accuracy. In this paper, we present a quantitative analysis of most widely used feature selection (FS) methods, viz. Term Frequency-Inverse Document Frequency (tfidf ), Mutual Information (MI), Information Gain (IG), CHISquare (x2), Term Frequency-Relevance Frequency (tfrf ), Term Strength (TS), Ambiguity Measure (AM) and Symbolic Feature Selection (SFS) to classify text documents. We evaluated all the feature selection methods on standard datasets like 20 Newsgroups, 4 University dataset and Reuters-21578.Keywords: classifiers, feature selection, text classification
Procedia PDF Downloads 4588566 Investigation of Topic Modeling-Based Semi-Supervised Interpretable Document Classifier
Authors: Dasom Kim, William Xiu Shun Wong, Yoonjin Hyun, Donghoon Lee, Minji Paek, Sungho Byun, Namgyu Kim
Abstract:
There have been many researches on document classification for classifying voluminous documents automatically. Through document classification, we can assign a specific category to each unlabeled document on the basis of various machine learning algorithms. However, providing labeled documents manually requires considerable time and effort. To overcome the limitations, the semi-supervised learning which uses unlabeled document as well as labeled documents has been invented. However, traditional document classifiers, regardless of supervised or semi-supervised ones, cannot sufficiently explain the reason or the process of the classification. Thus, in this paper, we proposed a methodology to visualize major topics and class components of each document. We believe that our methodology for visualizing topics and classes of each document can enhance the reliability and explanatory power of document classifiers.Keywords: data mining, document classifier, text mining, topic modeling
Procedia PDF Downloads 4028565 Modified InVEST for Whatsapp Messages Forensic Triage and Search through Visualization
Authors: Agria Rhamdhan
Abstract:
WhatsApp as the most popular mobile messaging app has been used as evidence in many criminal cases. As the use of mobile messages generates large amounts of data, forensic investigation faces the challenge of large data problems. The hardest part of finding this important evidence is because current practice utilizes tools and technique that require manual analysis to check all messages. That way, analyze large sets of mobile messaging data will take a lot of time and effort. Our work offers methodologies based on forensic triage to reduce large data to manageable sets resulting easier to do detailed reviews, then show the results through interactive visualization to show important term, entities and relationship through intelligent ranking using Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Dirichlet Allocation (LDA) Model. By implementing this methodology, investigators can improve investigation processing time and result's accuracy.Keywords: forensics, triage, visualization, WhatsApp
Procedia PDF Downloads 1668564 Incremental Learning of Independent Topic Analysis
Authors: Takahiro Nishigaki, Katsumi Nitta, Takashi Onoda
Abstract:
In this paper, we present a method of applying Independent Topic Analysis (ITA) to increasing the number of document data. The number of document data has been increasing since the spread of the Internet. ITA was presented as one method to analyze the document data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis (ICA). ICA is a technique in the signal processing; however, it is difficult to apply the ITA to increasing number of document data. Because ITA must use the all document data so temporal and spatial cost is very high. Therefore, we present Incremental ITA which extracts the independent topics from increasing number of document data. Incremental ITA is a method of updating the independent topics when the document data is added after extracted the independent topics from a just previous the data. In addition, Incremental ITA updates the independent topics when the document data is added. And we show the result applied Incremental ITA to benchmark datasets.Keywords: text mining, topic extraction, independent, incremental, independent component analysis
Procedia PDF Downloads 3078563 Distorted Document Images Dataset for Text Detection and Recognition
Authors: Ilia Zharikov, Philipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan
Abstract:
With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models.Keywords: document analysis, open dataset, optical character recognition, text detection
Procedia PDF Downloads 1708562 Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition
Authors: L. Hamsaveni, Navya Prakash, Suresha
Abstract:
Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document images to obtain an original document with complete information. In case, degraded document image captured is skewed, it has to be straightened (deskew) to perform further process. A special format of image storing known as YCbCr is used as a tool to convert the Grayscale image to RGB image format. The presented algorithm is tested on various types of degraded documents such as printed documents, handwritten documents, old script documents and handwritten image sketches in documents. The purpose of this research is to obtain an original document for a given set of degraded documents of the same source.Keywords: grayscale image format, image fusing, RGB image format, SURF detection, YCbCr image format
Procedia PDF Downloads 3768561 Analysing the Variables That Affect Digital Game-Based L2 Vocabulary Learning
Authors: Jose Ramon Calvo-Ferrer
Abstract:
Video games have been extensively employed in educational contexts to teach contents and skills, upon the premise that they engage students and provide instant feedback, which makes them adequate tools in the field of education and training. Term frequency, along with metacognition and implicit corrective feedback, has often been identified as powerful variables in the learning of vocabulary in a foreign language. This study analyses the learning of L2 mobile operating system terminology by a group of students and uses the data collected by the video game The Conference Interpreter to identify the predictive strength of term frequency (times a term is shown), positive metacognition (times a right answer is provided), and negative metacognition (times a term is shown as wrong) regarding L2 vocabulary learning and perceived learning outcomes. The regression analysis shows that the factor ‘positive metacognition’ is a positive predictor of both dependent variables, whereas the other factors seem to have no statistical effect on any of them.Keywords: digital game-based learning, feedback, metacognition, frequency, video games
Procedia PDF Downloads 1548560 Model-Based Field Extraction from Different Class of Administrative Documents
Authors: Jinen Daghrir, Anis Kricha, Karim Kalti
Abstract:
The amount of incoming administrative documents is massive and manually processing these documents is a costly task especially on the timescale. In fact, this problem has led an important amount of research and development in the context of automatically extracting fields from administrative documents, in order to reduce the charges and to increase the citizen satisfaction in administrations. In this matter, we introduce an administrative document understanding system. Given a document in which a user has to select fields that have to be retrieved from a document class, a document model is automatically built. A document model is represented by an attributed relational graph (ARG) where nodes represent fields to extract, and edges represent the relation between them. Both of vertices and edges are attached with some feature vectors. When another document arrives to the system, the layout objects are extracted and an ARG is generated. The fields extraction is translated into a problem of matching two ARGs which relies mainly on the comparison of the spatial relationships between layout objects. Experimental results yield accuracy rates from 75% to 100% tested on eight document classes. Our proposed method has a good performance knowing that the document model is constructed using only one single document.Keywords: administrative document understanding, logical labelling, logical layout analysis, fields extraction from administrative documents
Procedia PDF Downloads 2128559 Binarization and Recognition of Characters from Historical Degraded Documents
Authors: Bency Jacob, S.B. Waykar
Abstract:
Degradations in historical document images appear due to aging of the documents. It is very difficult to understand and retrieve text from badly degraded documents as there is variation between the document foreground and background. Thresholding of such document images either result in broken characters or detection of false texts. Numerous algorithms exist that can separate text and background efficiently in the textual regions of the document; but portions of background are mistaken as text in areas that hardly contain any text. This paper presents a way to overcome these problems by a robust binarization technique that recovers the text from a severely degraded document images and thereby increases the accuracy of optical character recognition systems. The proposed document recovery algorithm efficiently removes degradations from document images. Here we are using the ostus method ,local thresholding and global thresholding and after the binarization training and recognizing the characters in the degraded documents.Keywords: binarization, denoising, global thresholding, local thresholding, thresholding
Procedia PDF Downloads 3428558 Machine Learning Automatic Detection on Twitter Cyberbullying
Authors: Raghad A. Altowairgi
Abstract:
With the wide spread of social media platforms, young people tend to use them extensively as the first means of communication due to their ease and modernity. But these platforms often create a fertile ground for bullies to practice their aggressive behavior against their victims. Platform usage cannot be reduced, but intelligent mechanisms can be implemented to reduce the abuse. This is where machine learning comes in. Understanding and classifying text can be helpful in order to minimize the act of cyberbullying. Artificial intelligence techniques have expanded to formulate an applied tool to address the phenomenon of cyberbullying. In this research, machine learning models are built to classify text into two classes; cyberbullying and non-cyberbullying. After preprocessing the data in 4 stages; removing characters that do not provide meaningful information to the models, tokenization, removing stop words, and lowering text. BoW and TF-IDF are used as the main features for the five classifiers, which are; logistic regression, Naïve Bayes, Random Forest, XGboost, and Catboost classifiers. Each of them scores 92%, 90%, 92%, 91%, 86% respectively.Keywords: cyberbullying, machine learning, Bag-of-Words, term frequency-inverse document frequency, natural language processing, Catboost
Procedia PDF Downloads 1308557 Using the Smith-Waterman Algorithm to Extract Features in the Classification of Obesity Status
Authors: Rosa Figueroa, Christopher Flores
Abstract:
Text categorization is the problem of assigning a new document to a set of predetermined categories, on the basis of a training set of free-text data that contains documents whose category membership is known. To train a classification model, it is necessary to extract characteristics in the form of tokens that facilitate the learning and classification process. In text categorization, the feature extraction process involves the use of word sequences also known as N-grams. In general, it is expected that documents belonging to the same category share similar features. The Smith-Waterman (SW) algorithm is a dynamic programming algorithm that performs a local sequence alignment in order to determine similar regions between two strings or protein sequences. This work explores the use of SW algorithm as an alternative to feature extraction in text categorization. The dataset used for this purpose, contains 2,610 annotated documents with the classes Obese/Non-Obese. This dataset was represented in a matrix form using the Bag of Word approach. The score selected to represent the occurrence of the tokens in each document was the term frequency-inverse document frequency (TF-IDF). In order to extract features for classification, four experiments were conducted: the first experiment used SW to extract features, the second one used unigrams (single word), the third one used bigrams (two word sequence) and the last experiment used a combination of unigrams and bigrams to extract features for classification. To test the effectiveness of the extracted feature set for the four experiments, a Support Vector Machine (SVM) classifier was tuned using 20% of the dataset. The remaining 80% of the dataset together with 5-Fold Cross Validation were used to evaluate and compare the performance of the four experiments of feature extraction. Results from the tuning process suggest that SW performs better than the N-gram based feature extraction. These results were confirmed by using the remaining 80% of the dataset, where SW performed the best (accuracy = 97.10%, weighted average F-measure = 97.07%). The second best was obtained by the combination of unigrams-bigrams (accuracy = 96.04, weighted average F-measure = 95.97) closely followed by the bigrams (accuracy = 94.56%, weighted average F-measure = 94.46%) and finally unigrams (accuracy = 92.96%, weighted average F-measure = 92.90%).Keywords: comorbidities, machine learning, obesity, Smith-Waterman algorithm
Procedia PDF Downloads 2968556 Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis
Authors: Sangita Pokhrel, Nalinda Somasiri, Rebecca Jeyavadhanam, Swathi Ganesan
Abstract:
Tourism is a booming industry with huge future potential for global wealth and employment. There are countless data generated over social media sites every day, creating numerous opportunities to bring more insights to decision-makers. The integration of Big Data Technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centers or hotels, etc., and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences was first classified through Vader and Roberta model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization and TFIDF Vectorization, and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm, after pre-processing and cleaning the dataset, we received an accuracy of 96.12% for the positive and negative sentiment analysis.Keywords: counter vectorization, convolutional neural network, crawler, data technology, long short-term memory, web scraping, sentiment analysis
Procedia PDF Downloads 868555 Comprehensive Analysis of Electrohysterography Signal Features in Term and Preterm Labor
Authors: Zhihui Liu, Dongmei Hao, Qian Qiu, Yang An, Lin Yang, Song Zhang, Yimin Yang, Xuwen Li, Dingchang Zheng
Abstract:
Premature birth, defined as birth before 37 completed weeks of gestation is a leading cause of neonatal morbidity and mortality and has long-term adverse consequences for health. It has recently been reported that the worldwide preterm birth rate is around 10%. The existing measurement techniques for diagnosing preterm delivery include tocodynamometer, ultrasound and fetal fibronectin. However, they are subjective, or suffer from high measurement variability and inaccurate diagnosis and prediction of preterm labor. Electrohysterography (EHG) method based on recording of uterine electrical activity by electrodes attached to maternal abdomen, is a promising method to assess uterine activity and diagnose preterm labor. The purpose of this study is to analyze the difference of EHG signal features between term labor and preterm labor. Free access database was used with 300 signals acquired in two groups of pregnant women who delivered at term (262 cases) and preterm (38 cases). Among them, EHG signals from 38 term labor and 38 preterm labor were preprocessed with band-pass Butterworth filters of 0.08–4Hz. Then, EHG signal features were extracted, which comprised classical time domain description including root mean square and zero-crossing number, spectral parameters including peak frequency, mean frequency and median frequency, wavelet packet coefficients, autoregression (AR) model coefficients, and nonlinear measures including maximal Lyapunov exponent, sample entropy and correlation dimension. Their statistical significance for recognition of two groups of recordings was provided. The results showed that mean frequency of preterm labor was significantly smaller than term labor (p < 0.05). 5 coefficients of AR model showed significant difference between term labor and preterm labor. The maximal Lyapunov exponent of early preterm (time of recording < the 26th week of gestation) was significantly smaller than early term. The sample entropy of late preterm (time of recording > the 26th week of gestation) was significantly smaller than late term. There was no significant difference for other features between the term labor and preterm labor groups. Any future work regarding classification should therefore focus on using multiple techniques, with the mean frequency, AR coefficients, maximal Lyapunov exponent and the sample entropy being among the prime candidates. Even if these methods are not yet useful for clinical practice, they do bring the most promising indicators for the preterm labor.Keywords: electrohysterogram, feature, preterm labor, term labor
Procedia PDF Downloads 5708554 Product Features Extraction from Opinions According to Time
Authors: Kamal Amarouche, Houda Benbrahim, Ismail Kassou
Abstract:
Nowadays, e-commerce shopping websites have experienced noticeable growth. These websites have gained consumers’ trust. After purchasing a product, many consumers share comments where opinions are usually embedded about the given product. Research on the automatic management of opinions that gives suggestions to potential consumers and portrays an image of the product to manufactures has been growing recently. After launching the product in the market, the reviews generated around it do not usually contain helpful information or generic opinions about this product (e.g. telephone: great phone...); in the sense that the product is still in the launching phase in the market. Within time, the product becomes old. Therefore, consumers perceive the advantages/ disadvantages about each specific product feature. Therefore, they will generate comments that contain their sentiments about these features. In this paper, we present an unsupervised method to extract different product features hidden in the opinions which influence its purchase, and that combines Time Weighting (TW) which depends on the time opinions were expressed with Term Frequency-Inverse Document Frequency (TF-IDF). We conduct several experiments using two different datasets about cell phones and hotels. The results show the effectiveness of our automatic feature extraction, as well as its domain independent characteristic.Keywords: opinion mining, product feature extraction, sentiment analysis, SentiWordNet
Procedia PDF Downloads 4098553 Multimedia Data Fusion for Event Detection in Twitter by Using Dempster-Shafer Evidence Theory
Authors: Samar M. Alqhtani, Suhuai Luo, Brian Regan
Abstract:
Data fusion technology can be the best way to extract useful information from multiple sources of data. It has been widely applied in various applications. This paper presents a data fusion approach in multimedia data for event detection in twitter by using Dempster-Shafer evidence theory. The methodology applies a mining algorithm to detect the event. There are two types of data in the fusion. The first is features extracted from text by using the bag-ofwords method which is calculated using the term frequency-inverse document frequency (TF-IDF). The second is the visual features extracted by applying scale-invariant feature transform (SIFT). The Dempster - Shafer theory of evidence is applied in order to fuse the information from these two sources. Our experiments have indicated that comparing to the approaches using individual data source, the proposed data fusion approach can increase the prediction accuracy for event detection. The experimental result showed that the proposed method achieved a high accuracy of 0.97, comparing with 0.93 with texts only, and 0.86 with images only.Keywords: data fusion, Dempster-Shafer theory, data mining, event detection
Procedia PDF Downloads 4108552 Estimation of Dynamic Characteristics of a Middle Rise Steel Reinforced Concrete Building Using Long-Term
Authors: Fumiya Sugino, Naohiro Nakamura, Yuji Miyazu
Abstract:
In earthquake resistant design of buildings, evaluation of vibration characteristics is important. In recent years, due to the increment of super high-rise buildings, the evaluation of response is important for not only the first mode but also higher modes. The knowledge of vibration characteristics in buildings is mostly limited to the first mode and the knowledge of higher modes is still insufficient. In this paper, using earthquake observation records of a SRC building by applying frequency filter to ARX model, characteristics of first and second modes were studied. First, we studied the change of the eigen frequency and the damping ratio during the 3.11 earthquake. The eigen frequency gradually decreases from the time of earthquake occurrence, and it is almost stable after about 150 seconds have passed. At this time, the decreasing rates of the 1st and 2nd eigen frequencies are both about 0.7. Although the damping ratio has more large error than the eigen frequency, both the 1st and 2nd damping ratio are 3 to 5%. Also, there is a strong correlation between the 1st and 2nd eigen frequency, and the regression line is y=3.17x. In the damping ratio, the regression line is y=0.90x. Therefore 1st and 2nd damping ratios are approximately the same degree. Next, we study the eigen frequency and damping ratio from 1998 after 3.11 earthquakes, the final year is 2014. In all the considered earthquakes, they are connected in order of occurrence respectively. The eigen frequency slowly declined from immediately after completion, and tend to stabilize after several years. Although it has declined greatly after the 3.11 earthquake. Both the decresing rate of the 1st and 2nd eigen frequencies until about 7 years later are about 0.8. For the damping ratio, both the 1st and 2nd are about 1 to 6%. After the 3.11 earthquake, the 1st increases by about 1% and the 2nd increases by less than 1%. For the eigen frequency, there is a strong correlation between the 1st and 2nd, and the regression line is y=3.17x. For the damping ratio, the regression line is y=1.01x. Therefore, it can be said that the 1st and 2nd damping ratio is approximately the same degree. Based on the above results, changes in eigen frequency and damping ratio are summarized as follows. In the long-term study of the eigen frequency, both the 1st and 2nd gradually declined from immediately after completion, and tended to stabilize after a few years. Further it declined after the 3.11 earthquake. In addition, there is a strong correlation between the 1st and 2nd, and the declining time and the decreasing rate are the same degree. In the long-term study of the damping ratio, both the 1st and 2nd are about 1 to 6%. After the 3.11 earthquake, the 1st increases by about 1%, the 2nd increases by less than 1%. Also, the 1st and 2nd are approximately the same degree.Keywords: eigenfrequency, damping ratio, ARX model, earthquake observation records
Procedia PDF Downloads 2168551 Predicting Success and Failure in Drug Development Using Text Analysis
Authors: Zhi Hao Chow, Cian Mulligan, Jack Walsh, Antonio Garzon Vico, Dimitar Krastev
Abstract:
Drug development is resource-intensive, time-consuming, and increasingly expensive with each developmental stage. The success rates of drug development are also relatively low, and the resources committed are wasted with each failed candidate. As such, a reliable method of predicting the success of drug development is in demand. The hypothesis was that some examples of failed drug candidates are pushed through developmental pipelines based on false confidence and may possess common linguistic features identifiable through sentiment analysis. Here, the concept of using text analysis to discover such features in research publications and investor reports as predictors of success was explored. R studios were used to perform text mining and lexicon-based sentiment analysis to identify affective phrases and determine their frequency in each document, then using SPSS to determine the relationship between our defined variables and the accuracy of predicting outcomes. A total of 161 publications were collected and categorised into 4 groups: (i) Cancer treatment, (ii) Neurodegenerative disease treatment, (iii) Vaccines, and (iv) Others (containing all other drugs that do not fit into the 3 categories). Text analysis was then performed on each document using 2 separate datasets (BING and AFINN) in R within the category of drugs to determine the frequency of positive or negative phrases in each document. A relative positivity and negativity value were then calculated by dividing the frequency of phrases with the word count of each document. Regression analysis was then performed with SPSS statistical software on each dataset (values from using BING or AFINN dataset during text analysis) using a random selection of 61 documents to construct a model. The remaining documents were then used to determine the predictive power of the models. Model constructed from BING predicts the outcome of drug performance in clinical trials with an overall percentage of 65.3%. AFINN model had a lower accuracy at predicting outcomes compared to the BING model at 62.5% but was not effective at predicting the failure of drugs in clinical trials. Overall, the study did not show significant efficacy of the model at predicting outcomes of drugs in development. Many improvements may need to be made to later iterations of the model to sufficiently increase the accuracy.Keywords: data analysis, drug development, sentiment analysis, text-mining
Procedia PDF Downloads 1578550 Using Closed Frequent Itemsets for Hierarchical Document Clustering
Authors: Cheng-Jhe Lee, Chiun-Chieh Hsu
Abstract:
Due to the rapid development of the Internet and the increased availability of digital documents, the excessive information on the Internet has led to information overflow problem. In order to solve these problems for effective information retrieval, document clustering in text mining becomes a popular research topic. Clustering is the unsupervised classification of data items into groups without the need of training data. Many conventional document clustering methods perform inefficiently for large document collections because they were originally designed for relational database. Therefore they are impractical in real-world document clustering and require special handling for high dimensionality and high volume. We propose the FIHC (Frequent Itemset-based Hierarchical Clustering) method, which is a hierarchical clustering method developed for document clustering, where the intuition of FIHC is that there exist some common words for each cluster. FIHC uses such words to cluster documents and builds hierarchical topic tree. In this paper, we combine FIHC algorithm with ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results show that our method is more accurate than those of well-known document clustering algorithms.Keywords: FIHC, documents clustering, ontology, closed frequent itemset
Procedia PDF Downloads 3988549 Analysis of Pollution Caused by the Animal Feed Industry and the Fertilizer Industry Using Rock Magnetic Method
Authors: Kharina Budiman, Adinda Syifa Azhari, Eleonora Agustine
Abstract:
Industrial activities get increase in this globalization era, one of the major impacts of industrial activities is a problem to the environment. This can happen because at the industrial production term will bring out pollutant in the shape of solid, liquid or gas. Normally this pollutant came from some dangerous materials for environment. However not every industry produces the same amount of pollutant, every industry produces different kind of pollution. To compare the pollution impact of industrial activities, soil sample has been taken around the animal feed industry and the fertilizer industry. This study applied the rock magnetic method and used Bartington MS2B to measured magnetic susceptibility (χ) as the physical parameter. This study tested soil samples using the value of susceptibility low frequency (χ lf) and Frequency Dependent (χ FD). Samples only taken in the soil surface with 0-5 cm depth and sampling interval was 20 cm. The animal feed factory has susceptibility low frequency (χ lf) = 111,9 – 325,7 and Frequency Dependent (χ FD) = 0,8 – 3,57 %. And the fertilizer factory has susceptibility low frequency (χ lf) = 187,1 – 494,8 and Frequency Dependent (χ FD) = 1,37 – 2,46 %. Based on the results, the highest value of susceptibility low frequency (χ lf) is the fertilizer factory, but the highest value of Frequency Dependent (FD) is the animal feed factory.Keywords: industrial, pollution, magnetic susceptibility, χlf, χfd, animal feed industry and fertilizer industry
Procedia PDF Downloads 4008548 A Proposed Approach for Emotion Lexicon Enrichment
Authors: Amr Mansour Mohsen, Hesham Ahmed Hassan, Amira M. Idrees
Abstract:
Document Analysis is an important research field that aims to gather the information by analyzing the data in documents. As one of the important targets for many fields is to understand what people actually want, sentimental analysis field has been one of the vital fields that are tightly related to the document analysis. This research focuses on analyzing text documents to classify each document according to its opinion. The aim of this research is to detect the emotions from text documents based on enriching the lexicon with adapting their content based on semantic patterns extraction. The proposed approach has been presented, and different experiments are applied by different perspectives to reveal the positive impact of the proposed approach on the classification results.Keywords: document analysis, sentimental analysis, emotion detection, WEKA tool, NRC lexicon
Procedia PDF Downloads 4418547 Document-level Sentiment Analysis: An Exploratory Case Study of Low-resource Language Urdu
Authors: Ammarah Irum, Muhammad Ali Tahir
Abstract:
Document-level sentiment analysis in Urdu is a challenging Natural Language Processing (NLP) task due to the difficulty of working with lengthy texts in a language with constrained resources. Deep learning models, which are complex neural network architectures, are well-suited to text-based applications in addition to data formats like audio, image, and video. To investigate the potential of deep learning for Urdu sentiment analysis, we implemented five different deep learning models, including Bidirectional Long Short Term Memory (BiLSTM), Convolutional Neural Network (CNN), Convolutional Neural Network with Bidirectional Long Short Term Memory (CNN-BiLSTM), and Bidirectional Encoder Representation from Transformer (BERT). In this study, we developed a hybrid deep learning model called BiLSTM-Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN) by fusing BiLSTM and CNN architecture. The proposed and baseline techniques are applied on Urdu Customer Support data set and IMDB Urdu movie review data set by using pre-trained Urdu word embedding that are suitable for sentiment analysis at the document level. Results of these techniques are evaluated and our proposed model outperforms all other deep learning techniques for Urdu sentiment analysis. BiLSTM-SLMFCNN outperformed the baseline deep learning models and achieved 83%, 79%, 83% and 94% accuracy on small, medium and large sized IMDB Urdu movie review data set and Urdu Customer Support data set respectively.Keywords: urdu sentiment analysis, deep learning, natural language processing, opinion mining, low-resource language
Procedia PDF Downloads 718546 Improving the Performance of Requisition Document Online System for Royal Thai Army by Using Time Series Model
Authors: D. Prangchumpol
Abstract:
This research presents a forecasting method of requisition document demands for Military units by using Exponential Smoothing methods to analyze data. The data used in the forecast is an actual data requisition document of The Adjutant General Department. The results of the forecasting model to forecast the requisition of the document found that Holt–Winters’ trend and seasonality method of α=0.1, β=0, γ=0 is appropriate and matches for requisition of documents. In addition, the researcher has developed a requisition online system to improve the performance of requisition documents of The Adjutant General Department, and also ensuring that the operation can be checked.Keywords: requisition, holt–winters, time series, royal thai army
Procedia PDF Downloads 3088545 Off-Topic Text Detection System Using a Hybrid Model
Authors: Usama Shahid
Abstract:
Be it written documents, news columns, or students' essays, verifying the content can be a time-consuming task. Apart from the spelling and grammar mistakes, the proofreader is also supposed to verify whether the content included in the essay or document is relevant or not. The irrelevant content in any document or essay is referred to as off-topic text and in this paper, we will address the problem of off-topic text detection from a document using machine learning techniques. Our study aims to identify the off-topic content from a document using Echo state network model and we will also compare data with other models. The previous study uses Convolutional Neural Networks and TFIDF to detect off-topic text. We will rearrange the existing datasets and take new classifiers along with new word embeddings and implement them on existing and new datasets in order to compare the results with the previously existing CNN model.Keywords: off topic, text detection, eco state network, machine learning
Procedia PDF Downloads 858544 A Methodology for Automatic Diversification of Document Categories
Authors: Dasom Kim, Chen Liu, Myungsu Lim, Su-Hyeon Jeon, ByeoungKug Jeon, Kee-Young Kwahk, Namgyu Kim
Abstract:
Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we previously proposed a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. In this paper, we design a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology.Keywords: big data analysis, document classification, multi-category, text mining, topic analysis
Procedia PDF Downloads 2728543 Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison
Authors: Po-Fang Hsu, Chiching Wei
Abstract:
In this paper, we present a novel neural graph matching approach applied to document comparison. Document comparison is a common task in the legal and financial industries. In some cases, the most important differences may be the addition or omission of words, sentences, clauses, or paragraphs. However, it is a challenging task without recording or tracing the whole edited process. Under many temporal uncertainties, we explore the potentiality of our approach to proximate the accurate comparison to make sure which element blocks have a relation of edition with others. In the beginning, we apply a document layout analysis that combines traditional and modern technics to segment layouts in blocks of various types appropriately. Then we transform this issue into a problem of layout graph matching with textual awareness. Regarding graph matching, it is a long-studied problem with a broad range of applications. However, different from previous works focusing on visual images or structural layout, we also bring textual features into our model for adapting this domain. Specifically, based on the electronic document, we introduce an encoder to deal with the visual presentation decoding from PDF. Additionally, because the modifications can cause the inconsistency of document layout analysis between modified documents and the blocks can be merged and split, Sinkhorn divergence is adopted in our neural graph approach, which tries to overcome both these issues with many-to-many block matching. We demonstrate this on two categories of layouts, as follows., legal agreement and scientific articles, collected from our real-case datasets.Keywords: document comparison, graph matching, graph neural network, modification similarity, multi-modal
Procedia PDF Downloads 1798542 Frequency of Tube Feeding in Aboriginal and Non-aboriginal Head and Neck Cancer Patients and the Impact on Relapse and Survival Outcomes
Authors: Kim Kennedy, Daren Gibson, Stephanie Flukes, Chandra Diwakarla, Lisa Spalding, Leanne Pilkington, Andrew Redfern
Abstract:
Introduction: Head and neck cancer and treatments are known for their profound effect on nutrition and tube feeding is a common requirement to maintain nutrition. Aim: We aimed to evaluate the frequency of tube feeding in Aboriginal and non-Aboriginal patients, and to examine the relapse and survival outcomes in patients who require enteral tube feeding. Methods: We performed a retrospective cohort analysis of 320 head and neck cancer patients from a single centre in Western Australia, identifying 80 Aboriginal patients and 240 non-Aboriginal patients matched on a 1:3 ratio by site, histology, rurality, and age. Data collected included patient demographics, tumour features, treatment details, and cancer and survival outcomes. Results: Aboriginal and non-Aboriginal patients required feeding tubes at similar rates (42.5% vs 46.2% respectively), however Aboriginal patients were far more likely to fail to return to oral nutrition, with 26.3% requiring long-term tube feeding versus only 15% of non-Aboriginal patients. In the overall study population, 27.5% required short-term tube feeding, 17.8% required long-term enteral tube nutrition, and 45.3% of patients did not have a feeding tube at any point. Relapse was more common in patients who required tube feeding, with relapses in 42.1% of the patients requiring long-term tube feeding, 31.8% in those requiring a short-term tube, versus 18.9% in the ‘no tube’ group. Survival outcomes for patients who required a long-term tube were also significantly poorer when compared to patients who only required a short-term tube, or not at all. Long-term tube-requiring patients were half as likely to survive (29.8%) compared to patients requiring a short-term tube (62.5%) or no tube at all (63.5%). Patients requiring a long-term tube were twice as likely to die with active disease (59.6%) as patients with no tube (28%), or a short term tube (33%). This may suggest an increased relapse risk in patients who require long-term feeding, due to consequences of malnutrition on cancer and treatment outcomes, although may simply reflect that patients with recurrent disease were more likely to have longer-term swallowing dysfunction due to recurrent disease and salvage treatments. Interestingly long-term tube patients were also more likely to die with no active disease (10.5%) (compared with short-term tube requiring patients (4.6%), or patients with no tube (8%)), which is likely reflective of the increased mortality associated with long-term aspiration and malnutrition issues. Conclusions: Requirement for tube feeding was associated with a higher rate of cancer relapse, and in particular, long-term tube feeding was associated with a higher likelihood of dying from head and neck cancer, but also a higher risk of dying from other causes without cancer relapse. This data reflects the complex effect of head and neck cancer and its treatments on swallowing and nutrition, and ultimately, the effects of malnutrition, swallowing dysfunction, and aspiration on overall cancer and survival outcomes. Tube feeding was seen at similar rates in Aboriginal and non-Aboriginal patient, however failure to return to oral intake with a requirement for a long-term feeding tube was seen far more commonly in the Aboriginal population.Keywords: head and neck cancer, enteral tube feeding, malnutrition, survival, relapse, aboriginal patients
Procedia PDF Downloads 1018541 Use of Interpretable Evolved Search Query Classifiers for Sinhala Documents
Authors: Prasanna Haddela
Abstract:
Document analysis is a well matured yet still active research field, partly as a result of the intricate nature of building computational tools but also due to the inherent problems arising from the variety and complexity of human languages. Breaking down language barriers is vital in enabling access to a number of recent technologies. This paper investigates the application of document classification methods to new Sinhalese datasets. This language is geographically isolated and rich with many of its own unique features. We will examine the interpretability of the classification models with a particular focus on the use of evolved Lucene search queries generated using a Genetic Algorithm (GA) as a method of document classification. We will compare the accuracy and interpretability of these search queries with other popular classifiers. The results are promising and are roughly in line with previous work on English language datasets.Keywords: evolved search queries, Sinhala document classification, Lucene Sinhala analyzer, interpretable text classification, genetic algorithm
Procedia PDF Downloads 1128540 Symbol Synchronization and Resource Reuse Schemes for Layered Video Multicast Service in Long Term Evolution Networks
Authors: Chung-Nan Lee, Sheng-Wei Chu, You-Chiun Wang
Abstract:
LTE (Long Term Evolution) employs the eMBMS (evolved Multimedia Broadcast/Multicast Service) protocol to deliver video streams to a multicast group of users. However, it requires all multicast members to receive a video stream in the same transmission rate, which would degrade the overall service quality when some users encounter bad channel conditions. To overcome this problem, this paper provides two efficient resource allocation schemes in such LTE network: The symbol synchronization (S2) scheme assumes that the macro and pico eNodeBs use the same frequency channel to deliver the video stream to all users. It then adopts a multicast transmission index to guarantee the fairness among users. On the other hand, the resource reuse (R2) scheme allows eNodeBs to transmit data on different frequency channels. Then, by introducing the concept of frequency reuse, it can further improve the overall service quality. Extensive simulation results show that the S2 and R2 schemes can respectively improve around 50% of fairness and 14% of video quality as compared with the common maximum throughput method.Keywords: LTE networks, multicast, resource allocation, layered video
Procedia PDF Downloads 3898539 Sentiment Classification Using Enhanced Contextual Valence Shifters
Authors: Vo Ngoc Phu, Phan Thi Tuoi
Abstract:
We have explored different methods of improving the accuracy of sentiment classification. The sentiment orientation of a document can be positive (+), negative (-), or neutral (0). We combine five dictionaries from [2, 3, 4, 5, 6] into the new one with 21137 entries. The new dictionary has many verbs, adverbs, phrases and idioms, that are not in five ones before. The paper shows that our proposed method based on the combination of Term-Counting method and Enhanced Contextual Valence Shifters method has improved the accuracy of sentiment classification. The combined method has accuracy 68.984% on the testing dataset, and 69.224% on the training dataset. All of these methods are implemented to classify the reviews based on our new dictionary and the Internet Movie data set.Keywords: sentiment classification, sentiment orientation, valence shifters, contextual, valence shifters, term counting
Procedia PDF Downloads 5038538 Bidirectional Long Short-Term Memory-Based Signal Detection for Orthogonal Frequency Division Multiplexing With All Index Modulation
Authors: Mahmut Yildirim
Abstract:
This paper proposed the bidirectional long short-term memory (Bi-LSTM) network-aided deep learning (DL)-based signal detection for Orthogonal frequency division multiplexing with all index modulation (OFDM-AIM), namely Bi-DeepAIM. OFDM-AIM is developed to increase the spectral efficiency of OFDM with index modulation (OFDM-IM), a promising multi-carrier technique for communication systems beyond 5G. In this paper, due to its strong classification ability, Bi-LSTM is considered an alternative to the maximum likelihood (ML) algorithm, which is used for signal detection in the classical OFDM-AIM scheme. The performance of the Bi-DeepAIM is compared with LSTM network-aided DL-based OFDM-AIM (DeepAIM) and classic OFDM-AIM that uses (ML)-based signal detection via BER performance and computational time criteria. Simulation results show that Bi-DeepAIM obtains better bit error rate (BER) performance than DeepAIM and lower computation time in signal detection than ML-AIM.Keywords: bidirectional long short-term memory, deep learning, maximum likelihood, OFDM with all index modulation, signal detection
Procedia PDF Downloads 70