Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 6338

Search results for: small text extraction

6338 Small Text Extraction From Documents and Chart Images

Authors: Rominkumar Busa, Shahira K. C., Lijiya A.

Abstract:

Text recognition is an important area in computer vision which deals with detecting and recognising text from an image. The Optical Character Recognition (OCR) is a saturated area these days and with very good text recognition accuracy. However the same OCR methods when applied on text with small font sizes like the text data of chart images, the recognition rate is less than 30\%. In this work, aims to extract small text in images using the deep learning model, CRNN with CTC loss. The text recognition accuracy is found to improve by applying image enhancement by super resolution prior to CRNN model. We also observe the text recognition rate further increases by 18\% by applying the proposed method, which involves super resolution and character segmentation followed by CRNN with CTC loss. The efficiency of the proposed method shows that further pre-processing on chart image text and other small text images will improve the accuracy further, thereby helping text extraction from chart images.

Keywords: small text extraction, OCR, scene text recognition, CRNN

Procedia PDF Downloads 22
6337 Urdu Text Extraction Method from Images

Authors: Samabia Tehsin, Sumaira Kausar

Abstract:

Due to the vast increase in the multimedia data in recent years, efficient and robust retrieval techniques are needed to retrieve and index images/ videos. Text embedded in the images can serve as the strong retrieval tool for images. This is the reason that text extraction is an area of research with increasing attention. English text extraction is the focus of many researchers but very less work has been done on other languages like Urdu. This paper is focusing on Urdu text extraction from video frames. This paper presents a text detection feature set, which has the ability to deal up with most of the problems connected with the text extraction process. To test the validity of the method, it is tested on Urdu news dataset, which gives promising results.

Keywords: caption text, content-based image retrieval, document analysis, text extraction

Procedia PDF Downloads 421
6336 Extraction of Text Subtitles in Multimedia Systems

Authors: Amarjit Singh

Abstract:

In this paper, a method for extraction of text subtitles in large video is proposed. The video data needs to be annotated for many multimedia applications. Text is incorporated in digital video for the motive of providing useful information about that video. So need arises to detect text present in video to understanding and video indexing. This is achieved in two steps. First step is text localization and the second step is text verification. The method of text detection can be extended to text recognition which finds applications in automatic video indexing; video annotation and content based video retrieval. The method has been tested on various types of videos.

Keywords: video, subtitles, extraction, annotation, frames

Procedia PDF Downloads 518
6335 Optimal Classifying and Extracting Fuzzy Relationship from Query Using Text Mining Techniques

Authors: Faisal Alshuwaier, Ali Areshey

Abstract:

Text mining techniques are generally applied for classifying the text, finding fuzzy relations and structures in data sets. This research provides plenty text mining capabilities. One common application is text classification and event extraction, which encompass deducing specific knowledge concerning incidents referred to in texts. The main contribution of this paper is the clarification of a concept graph generation mechanism, which is based on a text classification and optimal fuzzy relationship extraction. Furthermore, the work presented in this paper explains the application of fuzzy relationship extraction and branch and bound method to simplify the texts.

Keywords: extraction, max-prod, fuzzy relations, text mining, memberships, classification, memberships, classification

Procedia PDF Downloads 491
6334 Video Text Information Detection and Localization in Lecture Videos Using Moments

Authors: Belkacem Soundes, Guezouli Larbi

Abstract:

This paper presents a robust and accurate method for text detection and localization over lecture videos. Frame regions are classified into text or background based on visual feature analysis. However, lecture video shows significant degradation mainly related to acquisition conditions, camera motion and environmental changes resulting in low quality videos. Hence, affecting feature extraction and description efficiency. Moreover, traditional text detection methods cannot be directly applied to lecture videos. Therefore, robust feature extraction methods dedicated to this specific video genre are required for robust and accurate text detection and extraction. Method consists of a three-step process: Slide region detection and segmentation; Feature extraction and non-text filtering. For robust and effective features extraction moment functions are used. Two distinct types of moments are used: orthogonal and non-orthogonal. For orthogonal Zernike Moments, both Pseudo Zernike moments are used, whereas for non-orthogonal ones Hu moments are used. Expressivity and description efficiency are given and discussed. Proposed approach shows that in general, orthogonal moments show high accuracy in comparison to the non-orthogonal one. Pseudo Zernike moments are more effective than Zernike with better computation time.

Keywords: text detection, text localization, lecture videos, pseudo zernike moments

Procedia PDF Downloads 64
6333 Text Data Preprocessing Library: Bilingual Approach

Authors: Kabil Boukhari

Abstract:

In the context of information retrieval, the selection of the most relevant words is a very important step. In fact, the text cleaning allows keeping only the most representative words for a better use. In this paper, we propose a library for the purpose text preprocessing within an implemented application to facilitate this task. This study has two purposes. The first, is to present the related work of the various steps involved in text preprocessing, presenting the segmentation, stemming and lemmatization algorithms that could be efficient in the rest of study. The second, is to implement a developed tool for text preprocessing in French and English. This library accepts unstructured text as input and provides the preprocessed text as output, based on a set of rules and on a base of stop words for both languages. The proposed library has been made on different corpora and gave an interesting result.

Keywords: text preprocessing, segmentation, knowledge extraction, normalization, text generation, information retrieval

Procedia PDF Downloads 20
6332 A Conglomerate of Multiple Optical Character Recognition Table Detection and Extraction

Authors: Smita Pallavi, Raj Ratn Pranesh, Sumit Kumar

Abstract:

Information representation as tables is compact and concise method that eases searching, indexing, and storage requirements. Extracting and cloning tables from parsable documents is easier and widely used; however, industry still faces challenges in detecting and extracting tables from OCR (Optical Character Recognition) documents or images. This paper proposes an algorithm that detects and extracts multiple tables from OCR document. The algorithm uses a combination of image processing techniques, text recognition, and procedural coding to identify distinct tables in the same image and map the text to appropriate the corresponding cell in dataframe, which can be stored as comma-separated values, database, excel, and multiple other usable formats.

Keywords: table extraction, optical character recognition, image processing, text extraction, morphological transformation

Procedia PDF Downloads 66
6331 Towards Logical Inference for the Arabic Question-Answering

Authors: Wided Bakari, Patrice Bellot, Omar Trigui, Mahmoud Neji

Abstract:

This article constitutes an opening to think of the modeling and analysis of Arabic texts in the context of a question-answer system. It is a question of exceeding the traditional approaches focused on morphosyntactic approaches. Furthermore, we present a new approach that analyze a text in order to extract correct answers then transform it to logical predicates. In addition, we would like to represent different levels of information within a text to answer a question and choose an answer among several proposed. To do so, we transform both the question and the text into logical forms. Then, we try to recognize all entailment between them. The results of recognizing the entailment are a set of text sentences that can implicate the user’s question. Our work is now concentrated on an implementation step in order to develop a system of question-answering in Arabic using techniques to recognize textual implications. In this context, the extraction of text features (keywords, named entities, and relationships that link them) is actually considered the first step in our process of text modeling. The second one is the use of techniques of textual implication that relies on the notion of inference and logic representation to extract candidate answers. The last step is the extraction and selection of the desired answer.

Keywords: NLP, Arabic language, question-answering, recognition text entailment, logic forms

Procedia PDF Downloads 258
6330 Anatomical Survey for Text Pattern Detection

Authors: S. Tehsin, S. Kausar

Abstract:

The ultimate aim of machine intelligence is to explore and materialize the human capabilities, one of which is the ability to detect various text objects within one or more images displayed on any canvas including prints, videos or electronic displays. Multimedia data has increased rapidly in past years. Textual information present in multimedia contains important information about the image/video content. However, it needs to technologically testify the commonly used human intelligence of detecting and differentiating the text within an image, for computers. Hence in this paper feature set based on anatomical study of human text detection system is proposed. Subsequent examination bears testimony to the fact that the features extracted proved instrumental to text detection.

Keywords: biologically inspired vision, content based retrieval, document analysis, text extraction

Procedia PDF Downloads 375
6329 Exploratory Analysis of A Review of Nonexistence Polarity in Native Speech

Authors: Deawan Rakin Ahamed Remal, Sinthia Chowdhury, Sharun Akter Khushbu, Sheak Rashed Haider Noori

Abstract:

Native Speech to text synthesis has its own leverage for the purpose of mankind. The extensive nature of art to speaking different accents is common but the purpose of communication between two different accent types of people is quite difficult. This problem will be motivated by the extraction of the wrong perception of language meaning. Thus, many existing automatic speech recognition has been placed to detect text. Overall study of this paper mentions a review of NSTTR (Native Speech Text to Text Recognition) synthesis compared with Text to Text recognition. Review has exposed many text to text recognition systems that are at a very early stage to comply with the system by native speech recognition. Many discussions started about the progression of chatbots, linguistic theory another is rule based approach. In the Recent years Deep learning is an overwhelming chapter for text to text learning to detect language nature. To the best of our knowledge, In the sub continent a huge number of people speak in Bangla language but they have different accents in different regions therefore study has been elaborate contradictory discussion achievement of existing works and findings of future needs in Bangla language acoustic accent.

Keywords: TTR, NSTTR, text to text recognition, deep learning, natural language processing

Procedia PDF Downloads 28
6328 A U-Net Based Architecture for Fast and Accurate Diagram Extraction

Authors: Revoti Prasad Bora, Saurabh Yadav, Nikita Katyal

Abstract:

In the context of educational data mining, the use case of extracting information from images containing both text and diagrams is of high importance. Hence, document analysis requires the extraction of diagrams from such images and processes the text and diagrams separately. To the author’s best knowledge, none among plenty of approaches for extracting tables, figures, etc., suffice the need for real-time processing with high accuracy as needed in multiple applications. In the education domain, diagrams can be of varied characteristics viz. line-based i.e. geometric diagrams, chemical bonds, mathematical formulas, etc. There are two broad categories of approaches that try to solve similar problems viz. traditional computer vision based approaches and deep learning approaches. The traditional computer vision based approaches mainly leverage connected components and distance transform based processing and hence perform well in very limited scenarios. The existing deep learning approaches either leverage YOLO or faster-RCNN architectures. These approaches suffer from a performance-accuracy tradeoff. This paper proposes a U-Net based architecture that formulates the diagram extraction as a segmentation problem. The proposed method provides similar accuracy with a much faster extraction time as compared to the mentioned state-of-the-art approaches. Further, the segmentation mask in this approach allows the extraction of diagrams of irregular shapes.

Keywords: computer vision, deep-learning, educational data mining, faster-RCNN, figure extraction, image segmentation, real-time document analysis, text extraction, U-Net, YOLO

Procedia PDF Downloads 21
6327 A Deep Learning Approach to Subsection Identification in Electronic Health Records

Authors: Nitin Shravan, Sudarsun Santhiappan, B. Sivaselvan

Abstract:

Subsection identification, in the context of Electronic Health Records (EHRs), is identifying the important sections for down-stream tasks like auto-coding. In this work, we classify the text present in EHRs according to their information, using machine learning and deep learning techniques. We initially describe briefly about the problem and formulate it as a text classification problem. Then, we discuss upon the methods from the literature. We try two approaches - traditional feature extraction based machine learning methods and deep learning methods. Through experiments on a private dataset, we establish that the deep learning methods perform better than the feature extraction based Machine Learning Models.

Keywords: deep learning, machine learning, semantic clinical classification, subsection identification, text classification

Procedia PDF Downloads 111
6326 Extracting Actions with Improved Part of Speech Tagging for Social Networking Texts

Authors: Yassine Jamoussi, Ameni Youssfi, Henda Ben Ghezala

Abstract:

With the growing interest in social networking, the interaction of social actors evolved to a source of knowledge in which it becomes possible to perform context aware-reasoning. The information extraction from social networking especially Twitter and Facebook is one of the problems in this area. To extract text from social networking, we need several lexical features and large scale word clustering. We attempt to expand existing tokenizer and to develop our own tagger in order to support the incorrect words currently in existence in Facebook and Twitter. Our goal in this work is to benefit from the lexical features developed for Twitter and online conversational text in previous works, and to develop an extraction model for constructing a huge knowledge based on actions

Keywords: social networking, information extraction, part-of-speech tagging, natural language processing

Procedia PDF Downloads 232
6325 Myanmar Character Recognition Using Eight Direction Chain Code Frequency Features

Authors: Kyi Pyar Zaw, Zin Mar Kyu

Abstract:

Character recognition is the process of converting a text image file into editable and searchable text file. Feature Extraction is the heart of any character recognition system. The character recognition rate may be low or high depending on the extracted features. In the proposed paper, 25 features for one character are used in character recognition. Basically, there are three steps of character recognition such as character segmentation, feature extraction and classification. In segmentation step, horizontal cropping method is used for line segmentation and vertical cropping method is used for character segmentation. In the Feature extraction step, features are extracted in two ways. The first way is that the 8 features are extracted from the entire input character using eight direction chain code frequency extraction. The second way is that the input character is divided into 16 blocks. For each block, although 8 feature values are obtained through eight-direction chain code frequency extraction method, we define the sum of these 8 feature values as a feature for one block. Therefore, 16 features are extracted from that 16 blocks in the second way. We use the number of holes feature to cluster the similar characters. We can recognize the almost Myanmar common characters with various font sizes by using these features. All these 25 features are used in both training part and testing part. In the classification step, the characters are classified by matching the all features of input character with already trained features of characters.

Keywords: chain code frequency, character recognition, feature extraction, features matching, segmentation

Procedia PDF Downloads 245
6324 Event Extraction, Analysis, and Event Linking

Authors: Anam Alam, Rahim Jamaluddin Kanji

Abstract:

With the rapid growth of event in everywhere, event extraction has now become an important matter to retrieve the information from the unstructured data. One of the challenging problems is to extract the event from it. An event is an observable occurrence of interaction among entities. The paper investigates the effectiveness of event extraction capabilities of three software tools that are Wandora, Nitro and SPSS. We performed standard text mining techniques of these tools on the data sets of (i) Afghan War Diaries (AWD collection), (ii) MUC4 and (iii) WebKB. Information retrieval measures such as precision and recall which are computed under extensive set of experiments for Event Extraction. The experimental study analyzes the difference between events extracted by the software and human. This approach helps to construct an algorithm that will be applied for different machine learning methods.

Keywords: event extraction, Wandora, nitro, SPSS, event analysis, extraction method, AFG, Afghan War Diaries, MUC4, 4 universities, dataset, algorithm, precision, recall, evaluation

Procedia PDF Downloads 470
6323 Visual Template Detection and Compositional Automatic Regular Expression Generation for Business Invoice Extraction

Authors: Anthony Proschka, Deepak Mishra, Merlyn Ramanan, Zurab Baratashvili

Abstract:

Small and medium-sized businesses receive over 160 billion invoices every year. Since these documents exhibit many subtle differences in layout and text, extracting structured fields such as sender name, amount, and VAT rate from them automatically is an open research question. In this paper, existing work in template-based document extraction is extended, and a system is devised that is able to reliably extract all required fields for up to 70% of all documents in the data set, more than any other previously reported method. The approaches are described for 1) detecting through visual features which template a given document belongs to, 2) automatically generating extraction rules for a given new template by composing regular expressions from multiple components, and 3) computing confidence scores that indicate the accuracy of the automatic extractions. The system can generate templates with as little as one training sample and only requires the ground truth field values instead of detailed annotations such as bounding boxes that are hard to obtain. The system is deployed and used inside a commercial accounting software.

Keywords: data mining, information retrieval, business, feature extraction, layout, business data processing, document handling, end-user trained information extraction, document archiving, scanned business documents, automated document processing, F1-measure, commercial accounting software

Procedia PDF Downloads 60
6322 A Clustering Algorithm for Massive Texts

Authors: Ming Liu, Chong Wu, Bingquan Liu, Lei Chen

Abstract:

Internet users have to face the massive amount of textual data every day. Organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering, in fact, is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection. This situation mainly attributes to the high- dimensional vectors generated from texts. To effectively and efficiently cluster large-scale text collection, this paper proposes a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster’s representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature’s weight is fine-tuned by iterative process. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster’s representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high quality on both small-scale and large-scale text collections.

Keywords: vector reconstruction, large-scale text clustering, partial tuning sub-process, overall tuning sub-process

Procedia PDF Downloads 369
6321 Morphological Processing of Punjabi Text for Sentiment Analysis of Farmer Suicides

Authors: Jaspreet Singh, Gurvinder Singh, Prabhsimran Singh, Rajinder Singh, Prithvipal Singh, Karanjeet Singh Kahlon, Ravinder Singh Sawhney

Abstract:

Morphological evaluation of Indian languages is one of the burgeoning fields in the area of Natural Language Processing (NLP). The evaluation of a language is an eminent task in the era of information retrieval and text mining. The extraction and classification of knowledge from text can be exploited for sentiment analysis and morphological evaluation. This study coalesce morphological evaluation and sentiment analysis for the task of classification of farmer suicide cases reported in Punjab state of India. The pre-processing of Punjabi text involves morphological evaluation and normalization of Punjabi word tokens followed by the training of proposed model using deep learning classification on Punjabi language text extracted from online Punjabi news reports. The class-wise accuracies of sentiment prediction for four negatively oriented classes of farmer suicide cases are 93.85%, 88.53%, 83.3%, and 95.45% respectively. The overall accuracy of sentiment classification obtained using proposed framework on 275 Punjabi text documents is found to be 90.29%.

Keywords: deep neural network, farmer suicides, morphological processing, punjabi text, sentiment analysis

Procedia PDF Downloads 169
6320 Recognition of Cursive Arabic Handwritten Text Using Embedded Training Based on Hidden Markov Models (HMMs)

Authors: Rabi Mouhcine, Amrouch Mustapha, Mahani Zouhir, Mammass Driss

Abstract:

In this paper, we present a system for offline recognition cursive Arabic handwritten text based on Hidden Markov Models (HMMs). The system is analytical without explicit segmentation used embedded training to perform and enhance the character models. Extraction features preceded by baseline estimation are statistical and geometric to integrate both the peculiarities of the text and the pixel distribution characteristics in the word image. These features are modelled using hidden Markov models and trained by embedded training. The experiments on images of the benchmark IFN/ENIT database show that the proposed system improves recognition.

Keywords: recognition, handwriting, Arabic text, HMMs, embedded training

Procedia PDF Downloads 274
6319 Mechanisms of Ginger Bioactive Compounds Extract Using Soxhlet and Accelerated Water Extraction

Authors: M. N. Azian, A. N. Ilia Anisa, Y. Iwai

Abstract:

The mechanism for extraction bioactive compounds from plant matrix is essential for optimizing the extraction process. As a benchmark technique, a soxhlet extraction has been utilized for discussing the mechanism and compared with an accelerated water extraction. The trends of both techniques show that the process involves extraction and degradation. The highest yields of 6-, 8-, 10-gingerols and 6-shogaol in soxhlet extraction were 13.948, 7.12, 10.312 and 2.306 mg/g, respectively. The optimum 6-, 8-, 10-gingerols and 6-shogaol extracted by the accelerated water extraction at 140oC were 68.97±3.95 mg/g at 3min, 18.98±3.04 mg/g at 5min, 5.167±2.35 mg/g at 3min and 14.57±6.27 mg/g at 3min, respectively. The effect of temperature at 3mins shows that the concentration of 6-shogaol increased rapidly as decreasing the recovery of 6-gingerol.

Keywords: mechanism, ginger bioactive compounds, soxhlet extraction, accelerated water extraction

Procedia PDF Downloads 351
6318 Information Extraction for Short-Answer Question for the University of the Cordilleras

Authors: Thelma Palaoag, Melanie Basa, Jezreel Mark Panilo

Abstract:

Checking short-answer questions and essays, whether it may be paper or electronic in form, is a tiring and tedious task for teachers. Evaluating a student’s output require wide array of domains. Scoring the work is often a critical task. Several attempts in the past few years to create an automated writing assessment software but only have received negative results from teachers and students alike due to unreliability in scoring, does not provide feedback and others. The study aims to create an application that will be able to check short-answer questions which incorporate information extraction. Information extraction is a subfield of Natural Language Processing (NLP) where a chunk of text (technically known as unstructured text) is being broken down to gather necessary bits of data and/or keywords (structured text) to be further analyzed or rather be utilized by query tools. The proposed system shall be able to extract keywords or phrases from the individual’s answers to match it into a corpora of words (as defined by the instructor), which shall be the basis of evaluation of the individual’s answer. The proposed system shall also enable the teacher to provide feedback and re-evaluate the output of the student for some writing elements in which the computer cannot fully evaluate such as creativity and logic. Teachers can formulate, design, and check short answer questions efficiently by defining keywords or phrases as parameters by assigning weights for checking answers. With the proposed system, teacher’s time in checking and evaluating students output shall be lessened, thus, making the teacher more productive and easier.

Keywords: information extraction, short-answer question, natural language processing, application

Procedia PDF Downloads 376
6317 OCR/ICR Text Recognition Using ABBYY FineReader as an Example Text

Authors: A. R. Bagirzade, A. Sh. Najafova, S. M. Yessirkepova, E. S. Albert

Abstract:

This article describes a text recognition method based on Optical Character Recognition (OCR). The features of the OCR method were examined using the ABBYY FineReader program. It describes automatic text recognition in images. OCR is necessary because optical input devices can only transmit raster graphics as a result. Text recognition describes the task of recognizing letters shown as such, to identify and assign them an assigned numerical value in accordance with the usual text encoding (ASCII, Unicode). The peculiarity of this study conducted by the authors using the example of the ABBYY FineReader, was confirmed and shown in practice, the improvement of digital text recognition platforms developed by Electronic Publication.

Keywords: ABBYY FineReader system, algorithm symbol recognition, OCR/ICR techniques, recognition technologies

Procedia PDF Downloads 75
6316 Recognition of Grocery Products in Images Captured by Cellular Phones

Authors: Farshideh Einsele, Hassan Foroosh

Abstract:

In this paper, we present a robust algorithm to recognize extracted text from grocery product images captured by mobile phone cameras. Recognition of such text is challenging since text in grocery product images varies in its size, orientation, style, illumination, and can suffer from perspective distortion. Pre-processing is performed to make the characters scale and rotation invariant. Since text degradations can not be appropriately defined using wellknown geometric transformations such as translation, rotation, affine transformation and shearing, we use the whole character black pixels as our feature vector. Classification is performed with minimum distance classifier using the maximum likelihood criterion, which delivers very promising Character Recognition Rate (CRR) of 89%. We achieve considerably higher Word Recognition Rate (WRR) of 99% when using lower level linguistic knowledge about product words during the recognition process.

Keywords: camera-based OCR, feature extraction, document, image processing, grocery products

Procedia PDF Downloads 334
6315 Programmed Speech to Text Summarization Using Graph-Based Algorithm

Authors: Hamsini Pulugurtha, P. V. S. L. Jagadamba

Abstract:

Programmed Speech to Text and Text Summarization Using Graph-based Algorithms can be utilized in gatherings to get the short depiction of the gathering for future reference. This gives signature check utilizing Siamese neural organization to confirm the personality of the client and convert the client gave sound record which is in English into English text utilizing the discourse acknowledgment bundle given in python. At times just the outline of the gathering is required, the answer for this text rundown. Thus, the record is then summed up utilizing the regular language preparing approaches, for example, solo extractive text outline calculations

Keywords: Siamese neural network, English speech, English text, natural language processing, unsupervised extractive text summarization

Procedia PDF Downloads 95
6314 A Relationship Extraction Method from Literary Fiction Considering Korean Linguistic Features

Authors: Hee-Jeong Ahn, Kee-Won Kim, Seung-Hoon Kim

Abstract:

The knowledge of the relationship between characters can help readers to understand the overall story or plot of the literary fiction. In this paper, we present a method for extracting the specific relationship between characters from a Korean literary fiction. Generally, methods for extracting relationships between characters in text are statistical or computational methods based on the sentence distance between characters without considering Korean linguistic features. Furthermore, it is difficult to extract the relationship with direction from text, such as one-sided love, because they consider only the weight of relationship, without considering the direction of the relationship. Therefore, in order to identify specific relationships between characters, we propose a statistical method considering linguistic features, such as syntactic patterns and speech verbs in Korean. The result of our method is represented by a weighted directed graph of the relationship between the characters. Furthermore, we expect that proposed method could be applied to the relationship analysis between characters of other content like movie or TV drama.

Keywords: data mining, Korean linguistic feature, literary fiction, relationship extraction

Procedia PDF Downloads 276
6313 Reducing Accidents Using Text Stops

Authors: Benish Chaudhry

Abstract:

Most of the accidents these days are occurring because of the ‘text-and-drive’ concept. If we look at the structure of cities in UAE, there are great distances, because of which it is impossible to drive without using or merely checking the cellphone. Moreover, if we look at the road structure, it is almost impossible to stop at a point and text. With the introduction of TEXT STOPs, drivers will be able to stop different stops for a maximum of 1 and a half-minute in order to reply or write a message. They can be introduced at a distance of 10 minutes of driving on the average speed of the road, so the drivers can look forward to a stop and can reply to a text when needed. A user survey indicates that drivers are willing to NOT text-and-drive if they have such a facility available.

Keywords: transport, accidents, urban planning, road planning

Procedia PDF Downloads 279
6312 Evolving Knowledge Extraction from Online Resources

Authors: Zhibo Xiao, Tharini Nayanika de Silva, Kezhi Mao

Abstract:

In this paper, we present an evolving knowledge extraction system named AKEOS (Automatic Knowledge Extraction from Online Sources). AKEOS consists of two modules, including a one-time learning module and an evolving learning module. The one-time learning module takes in user input query, and automatically harvests knowledge from online unstructured resources in an unsupervised way. The output of the one-time learning is a structured vector representing the harvested knowledge. The evolving learning module automatically schedules and performs repeated one-time learning to extract the newest information and track the development of an event. In addition, the evolving learning module summarizes the knowledge learned at different time points to produce a final knowledge vector about the event. With the evolving learning, we are able to visualize the key information of the event, discover the trends, and track the development of an event.

Keywords: evolving learning, knowledge extraction, knowledge graph, text mining

Procedia PDF Downloads 390
6311 Using the Smith-Waterman Algorithm to Extract Features in the Classification of Obesity Status

Authors: Rosa Figueroa, Christopher Flores

Abstract:

Text categorization is the problem of assigning a new document to a set of predetermined categories, on the basis of a training set of free-text data that contains documents whose category membership is known. To train a classification model, it is necessary to extract characteristics in the form of tokens that facilitate the learning and classification process. In text categorization, the feature extraction process involves the use of word sequences also known as N-grams. In general, it is expected that documents belonging to the same category share similar features. The Smith-Waterman (SW) algorithm is a dynamic programming algorithm that performs a local sequence alignment in order to determine similar regions between two strings or protein sequences. This work explores the use of SW algorithm as an alternative to feature extraction in text categorization. The dataset used for this purpose, contains 2,610 annotated documents with the classes Obese/Non-Obese. This dataset was represented in a matrix form using the Bag of Word approach. The score selected to represent the occurrence of the tokens in each document was the term frequency-inverse document frequency (TF-IDF). In order to extract features for classification, four experiments were conducted: the first experiment used SW to extract features, the second one used unigrams (single word), the third one used bigrams (two word sequence) and the last experiment used a combination of unigrams and bigrams to extract features for classification. To test the effectiveness of the extracted feature set for the four experiments, a Support Vector Machine (SVM) classifier was tuned using 20% of the dataset. The remaining 80% of the dataset together with 5-Fold Cross Validation were used to evaluate and compare the performance of the four experiments of feature extraction. Results from the tuning process suggest that SW performs better than the N-gram based feature extraction. These results were confirmed by using the remaining 80% of the dataset, where SW performed the best (accuracy = 97.10%, weighted average F-measure = 97.07%). The second best was obtained by the combination of unigrams-bigrams (accuracy = 96.04, weighted average F-measure = 95.97) closely followed by the bigrams (accuracy = 94.56%, weighted average F-measure = 94.46%) and finally unigrams (accuracy = 92.96%, weighted average F-measure = 92.90%).

Keywords: comorbidities, machine learning, obesity, Smith-Waterman algorithm

Procedia PDF Downloads 235
6310 Text Analysis to Support Structuring and Modelling a Public Policy Problem-Outline of an Algorithm to Extract Inferences from Textual Data

Authors: Claudia Ehrentraut, Osama Ibrahim, Hercules Dalianis

Abstract:

Policy making situations are real-world problems that exhibit complexity in that they are composed of many interrelated problems and issues. To be effective, policies must holistically address the complexity of the situation rather than propose solutions to single problems. Formulating and understanding the situation and its complex dynamics, therefore, is a key to finding holistic solutions. Analysis of text based information on the policy problem, using Natural Language Processing (NLP) and Text analysis techniques, can support modelling of public policy problem situations in a more objective way based on domain experts knowledge and scientific evidence. The objective behind this study is to support modelling of public policy problem situations, using text analysis of verbal descriptions of the problem. We propose a formal methodology for analysis of qualitative data from multiple information sources on a policy problem to construct a causal diagram of the problem. The analysis process aims at identifying key variables, linking them by cause-effect relationships and mapping that structure into a graphical representation that is adequate for designing action alternatives, i.e., policy options. This study describes the outline of an algorithm used to automate the initial step of a larger methodological approach, which is so far done manually. In this initial step, inferences about key variables and their interrelationships are extracted from textual data to support a better problem structuring. A small prototype for this step is also presented.

Keywords: public policy, problem structuring, qualitative analysis, natural language processing, algorithm, inference extraction

Procedia PDF Downloads 514
6309 Structure Analysis of Text-Image Connection in Jalayrid Period Illustrated Manuscripts

Authors: Mahsa Khani Oushani

Abstract:

Text and image are two important elements in the field of Iranian art, the text component and the image component have always been manifested together. The image narrates the text and the text is the factor in the formation of the image and they are closely related to each other. The connection between text and image is an interactive and two-way connection in the tradition of Iranian manuscript arrangement. The interaction between the narrative description and the image scene is the result of a direct and close connection between the text and the image, which in addition to the decorative aspect, also has a descriptive aspect. In this article the connection between the text element and the image element and its adaptation to the theory of Roland Barthes, the structuralism theorist, in this regard will be discussed. This study tends to investigate the question of how the connection between text and image in illustrated manuscripts of the Jalayrid period is defined according to Barthes’ theory. And what kind of proportion has the artist created in the composition between text and image. Based on the results of reviewing the data of this study, it can be inferred that in the Jalayrid period, the image has a reference connection and although it is of major importance on the page, it also maintains a close connection with the text and is placed in a special proportion. It is not necessarily balanced and symmetrical and sometimes uses imbalance for composition. This research has been done by descriptive-analytical method, which has been done by library collection method.

Keywords: structure, text, image, Jalayrid, painter

Procedia PDF Downloads 132