Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 159

Search results for: unstructured text

159 Clustering Unstructured Text Documents Using Fading Function

Authors: Pallav Roxy, Durga Toshniwal

Abstract:

Clustering unstructured text documents is an important issue in data mining community and has a number of applications such as document archive filtering, document organization and topic detection and subject tracing. In the real world, some of the already clustered documents may not be of importance while new documents of more significance may evolve. Most of the work done so far in clustering unstructured text documents overlooks this aspect of clustering. This paper, addresses this issue by using the Fading Function. The unstructured text documents are clustered. And for each cluster a statistics structure called Cluster Profile (CP) is implemented. The cluster profile incorporates the Fading Function. This Fading Function keeps an account of the time-dependent importance of the cluster. The work proposes a novel algorithm Clustering n-ary Merge Algorithm (CnMA) for unstructured text documents, that uses Cluster Profile and Fading Function. Experimental results illustrating the effectiveness of the proposed technique are also included.

Keywords: Clustering, Text Mining, Unstructured TextDocuments, Fading Function.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
158 Weighted-Distance Sliding Windows and Cooccurrence Graphs for Supporting Entity-Relationship Discovery in Unstructured Text

Authors: Paolo Fantozzi, Luigi Laura, Umberto Nanni

Abstract:

The problem of Entity relation discovery in structured data, a well covered topic in literature, consists in searching within unstructured sources (typically, text) in order to find connections among entities. These can be a whole dictionary, or a specific collection of named items. In many cases machine learning and/or text mining techniques are used for this goal. These approaches might be unfeasible in computationally challenging problems, such as processing massive data streams. A faster approach consists in collecting the cooccurrences of any two words (entities) in order to create a graph of relations - a cooccurrence graph. Indeed each cooccurrence highlights some grade of semantic correlation between the words because it is more common to have related words close each other than having them in the opposite sides of the text. Some authors have used sliding windows for such problem: they count all the occurrences within a sliding windows running over the whole text. In this paper we generalise such technique, coming up to a Weighted-Distance Sliding Window, where each occurrence of two named items within the window is accounted with a weight depending on the distance between items: a closer distance implies a stronger evidence of a relationship. We develop an experiment in order to support this intuition, by applying this technique to a data set consisting in the text of the Bible, split into verses.

Keywords: Cooccurrence graph, entity relation graph, unstructured text, weighted distance.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
157 Mining Association Rules from Unstructured Documents

Authors: Hany Mahgoub

Abstract:

This paper presents a system for discovering association rules from collections of unstructured documents called EART (Extract Association Rules from Text). The EART system treats texts only not images or figures. EART discovers association rules amongst keywords labeling the collection of textual documents. The main characteristic of EART is that the system integrates XML technology (to transform unstructured documents into structured documents) with Information Retrieval scheme (TF-IDF) and Data Mining technique for association rules extraction. EART depends on word feature to extract association rules. It consists of four phases: structure phase, index phase, text mining phase and visualization phase. Our work depends on the analysis of the keywords in the extracted association rules through the co-occurrence of the keywords in one sentence in the original text and the existing of the keywords in one sentence without co-occurrence. Experiments applied on a collection of scientific documents selected from MEDLINE that are related to the outbreak of H5N1 avian influenza virus.

Keywords: Association rules, information retrieval, knowledgediscovery in text, text mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
156 Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge

Authors: Lu Zhang, Chunping Li, Jun Liu, Hui Wang

Abstract:

Text similarity measurement is a fundamental issue in many textual applications such as document clustering, classification, summarization and question answering. However, prevailing approaches based on Vector Space Model (VSM) more or less suffer from the limitation of Bag of Words (BOW), which ignores the semantic relationship among words. Enriching document representation with background knowledge from Wikipedia is proven to be an effective way to solve this problem, but most existing methods still cannot avoid similar flaws of BOW in a new vector space. In this paper, we propose a novel text similarity measurement which goes beyond VSM and can find semantic affinity between documents. Specifically, it is a unified graph model that exploits Wikipedia as background knowledge and synthesizes both document representation and similarity computation. The experimental results on two different datasets show that our approach significantly improves VSM-based methods in both text clustering and classification.

Keywords: Text classification, Text clustering, Text similarity, Wikipedia

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
155 A Novel Arabic Text Steganography Method Using Letter Points and Extensions

Authors: Adnan Abdul-Aziz Gutub, Manal Mohammad Fattani

Abstract:

This paper presents a new steganography approach suitable for Arabic texts. It can be classified under steganography feature coding methods. The approach hides secret information bits within the letters benefiting from their inherited points. To note the specific letters holding secret bits, the scheme considers the two features, the existence of the points in the letters and the redundant Arabic extension character. We use the pointed letters with extension to hold the secret bit 'one' and the un-pointed letters with extension to hold 'zero'. This steganography technique is found attractive to other languages having similar texts to Arabic such as Persian and Urdu.

Keywords: Arabic text, Cryptography, Feature coding, Information security, Text steganography, Text watermarking.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
154 Opinion Mining Framework in the Education Domain

Authors: A. M. H. Elyasir, K. S. M. Anbananthen

Abstract:

The internet is growing larger and becoming the most popular platform for the people to share their opinion in different interests. We choose the education domain specifically comparing some Malaysian universities against each other. This comparison produces benchmark based on different criteria shared by the online users in various online resources including Twitter, Facebook and web pages. The comparison is accomplished using opinion mining framework to extract, process the unstructured text and classify the result to positive, negative or neutral (polarity). Hence, we divide our framework to three main stages; opinion collection (extraction), unstructured text processing and polarity classification. The extraction stage includes web crawling, HTML parsing, Sentence segmentation for punctuation classification, Part of Speech (POS) tagging, the second stage processes the unstructured text with stemming and stop words removal and finally prepare the raw text for classification using Named Entity Recognition (NER). Last phase is to classify the polarity and present overall result for the comparison among the Malaysian universities. The final result is useful for those who are interested to study in Malaysia, in which our final output declares clear winners based on the public opinions all over the web.

Keywords: Entity Recognition, Education Domain, Opinion Mining, Unstructured Text.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
153 Exploiting Query Feedback for Efficient Query Routing in Unstructured Peer-to-peer Networks

Authors: Iskandar Ishak, Naomie Salim

Abstract:

Unstructured peer-to-peer networks are popular due to its robustness and scalability. Query schemes that are being used in unstructured peer-to-peer such as the flooding and interest-based shortcuts suffer various problems such as using large communication overhead long delay response. The use of routing indices has been a popular approach for peer-to-peer query routing. It helps the query routing processes to learn the routing based on the feedbacks collected. In an unstructured network where there is no global information available, efficient and low cost routing approach is needed for routing efficiency. In this paper, we propose a novel mechanism for query-feedback oriented routing indices to achieve routing efficiency in unstructured network at a minimal cost. The approach also applied information retrieval technique to make sure the content of the query is understandable and will make the routing process not just based to the query hits but also related to the query content. Experiments have shown that the proposed mechanism performs more efficient than flood-based routing.

Keywords: Unstructured peer-to-peer, Searching, Retrieval, Internet.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
152 A Finite Volume Procedure on Unstructured Meshes for Fluid-Structure Interaction Problems

Authors: P I Jagad, B P Puranik, A W Date

Abstract:

Flow through micro and mini channels requires relatively high driving pressure due to the large fluid pressure drop through these channels. Consequently the forces acting on the walls of the channel due to the fluid pressure are also large. Due to these forces there are displacement fields set up in the solid substrate containing the channels. If the movement of the substrate is constrained at some points, then stress fields are established in the substrate. On the other hand, if the deformation of the channel shape is sufficiently large then its effect on the fluid flow is important to be calculated. Such coupled fluid-solid systems form a class of problems known as fluidstructure interactions. In the present work a co-located finite volume discretization procedure on unstructured meshes is described for solving fluid-structure interaction type of problems. A linear elastic solid is assumed for which the effect of the channel deformation on the flow is neglected. Thus the governing equations for the fluid and the solid are decoupled and are solved separately. The procedure is validated by solving two benchmark problems, one from fluid mechanics and another from solid mechanics. A fluid-structure interaction problem of flow through a U-shaped channel embedded in a plate is solved.

Keywords: Finite volume method, flow induced stresses, fluidstructureinteraction, unstructured meshes.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
151 Discovery and Capture of Organizational Knowledge from Unstructured Information

Authors: J. Gu, W.B. Lee, C.F. Cheung, E. Tsui, W.M. Wang

Abstract:

Knowledge of an organization does not merely reside in structured form of information and data; it is also embedded in unstructured form. The discovery of such knowledge is particularly difficult as the characteristic is dynamic, scattered, massive and multiplying at high speed. Conventional methods of managing unstructured information are considered too resource demanding and time consuming to cope with the rapid information growth. In this paper, a Multi-faceted and Automatic Knowledge Elicitation System (MAKES) is introduced for the purpose of discovery and capture of organizational knowledge. A trial implementation has been conducted in a public organization to achieve the objective of decision capture and navigation from a number of meeting minutes which are autonomously organized, classified and presented in a multi-faceted taxonomy map in both document and content level. Key concepts such as critical decision made, key knowledge workers, knowledge flow and the relationship among them are elicited and displayed in predefined knowledge model and maps. Hence, the structured knowledge can be retained, shared and reused. Conducting Knowledge Management with MAKES reduces work in searching and retrieving the target decision, saves a great deal of time and manpower, and also enables an organization to keep pace with the knowledge life cycle. This is particularly important when the amount of unstructured information and data grows extremely quickly. This system approach of knowledge management can accelerate value extraction and creation cycles of organizations.

Keywords: Knowledge-Based System, Knowledge Elicitation, Knowledge Management, Taxonomy, Unstructured Information Management

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
150 Continuous Text Translation Using Text Modeling in the Thetos System

Authors: Nina Suszczanska, Przemyslaw Szmal, Slawomir Kulikow

Abstract:

In the paper a method of modeling text for Polish is discussed. The method is aimed at transforming continuous input text into a text consisting of sentences in so called canonical form, whose characteristic is, among others, a complete structure as well as no anaphora or ellipses. The transformation is lossless as to the content of text being transformed. The modeling method has been worked out for the needs of the Thetos system, which translates Polish written texts into the Polish sign language. We believe that the method can be also used in various applications that deal with the natural language, e.g. in a text summary generator for Polish.

Keywords: anaphora, machine translation, NLP, sign language, text syntax.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
149 Dynamic Variational Multiscale LES of Bluff Body Flows on Unstructured Grids

Authors: Carine Moussaed, Stephen Wornom, Bruno Koobus, Maria Vittoria Salvetti, Alain Dervieux,

Abstract:

The effects of dynamic subgrid scale (SGS) models are investigated in variational multiscale (VMS) LES simulations of bluff body flows. The spatial discretization is based on a mixed finite element/finite volume formulation on unstructured grids. In the VMS approach used in this work, the separation between the largest and the smallest resolved scales is obtained through a variational projection operator and a finite volume cell agglomeration. The dynamic version of Smagorinsky and WALE SGS models are used to account for the effects of the unresolved scales. In the VMS approach, these effects are only modeled in the smallest resolved scales. The dynamic VMS-LES approach is applied to the simulation of the flow around a circular cylinder at Reynolds numbers 3900 and 20000 and to the flow around a square cylinder at Reynolds numbers 22000 and 175000. It is observed as in previous studies that the dynamic SGS procedure has a smaller impact on the results within the VMS approach than in LES. But improvements are demonstrated for important feature like recirculating part of the flow. The global prediction is improved for a small computational extra cost.

Keywords: variational multiscale LES, dynamic SGS model, unstructured grids, circular cylinder, square cylinder.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
148 Classifying Biomedical Text Abstracts based on Hierarchical 'Concept' Structure

Authors: Rozilawati Binti Dollah, Masaki Aono

Abstract:

Classifying biomedical literature is a difficult and challenging task, especially when a large number of biomedical articles should be organized into a hierarchical structure. In this paper, we present an approach for classifying a collection of biomedical text abstracts downloaded from Medline database with the help of ontology alignment. To accomplish our goal, we construct two types of hierarchies, the OHSUMED disease hierarchy and the Medline abstract disease hierarchies from the OHSUMED dataset and the Medline abstracts, respectively. Then, we enrich the OHSUMED disease hierarchy before adapting it to ontology alignment process for finding probable concepts or categories. Subsequently, we compute the cosine similarity between the vector in probable concepts (in the “enriched" OHSUMED disease hierarchy) and the vector in Medline abstract disease hierarchies. Finally, we assign category to the new Medline abstracts based on the similarity score. The results obtained from the experiments show the performance of our proposed approach for hierarchical classification is slightly better than the performance of the multi-class flat classification.

Keywords: Biomedical literature, hierarchical text classification, ontology alignment, text mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
147 Finite Volume Method for Flow Prediction Using Unstructured Meshes

Authors: Juhee Lee, Yongjun Lee

Abstract:

In designing a low-energy-consuming buildings, the heat transfer through a large glass or wall becomes critical. Multiple layers of the window glasses and walls are employed for the high insulation. The gravity driven air flow between window glasses or wall layers is a natural heat convection phenomenon being a key of the heat transfer. For the first step of the natural heat transfer analysis, in this study the development and application of a finite volume method for the numerical computation of viscous incompressible flows is presented. It will become a part of the natural convection analysis with high-order scheme, multi-grid method, and dual-time step in the future. A finite volume method based on a fully-implicit second-order is used to discretize and solve the fluid flow on unstructured grids composed of arbitrary-shaped cells. The integrations of the governing equation are discretised in the finite volume manner using a collocated arrangement of variables. The convergence of the SIMPLE segregated algorithm for the solution of the coupled nonlinear algebraic equations is accelerated by using a sparse matrix solver such as BiCGSTAB. The method used in the present study is verified by applying it to some flows for which either the numerical solution is known or the solution can be obtained using another numerical technique available in the other researches. The accuracy of the method is assessed through the grid refinement.

Keywords: Finite volume method, fluid flow, laminar flow, unstructured grid.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
146 A Proposed Hybrid Approach for Feature Selection in Text Document Categorization

Authors: M. F. Zaiyadi, B. Baharudin

Abstract:

Text document categorization involves large amount of data or features. The high dimensionality of features is a troublesome and can affect the performance of the classification. Therefore, feature selection is strongly considered as one of the crucial part in text document categorization. Selecting the best features to represent documents can reduce the dimensionality of feature space hence increase the performance. There were many approaches has been implemented by various researchers to overcome this problem. This paper proposed a novel hybrid approach for feature selection in text document categorization based on Ant Colony Optimization (ACO) and Information Gain (IG). We also presented state-of-the-art algorithms by several other researchers.

Keywords: Ant colony optimization, feature selection, information gain, text categorization, text representation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
145 An Edge-based Text Region Extraction Algorithm for Indoor Mobile Robot Navigation

Authors: Jagath Samarabandu, Xiaoqing Liu

Abstract:

Using bottom-up image processing algorithms to predict human eye fixations and extract the relevant embedded information in images has been widely applied in the design of active machine vision systems. Scene text is an important feature to be extracted, especially in vision-based mobile robot navigation as many potential landmarks such as nameplates and information signs contain text. This paper proposes an edge-based text region extraction algorithm, which is robust with respect to font sizes, styles, color/intensity, orientations, and effects of illumination, reflections, shadows, perspective distortion, and the complexity of image backgrounds. Performance of the proposed algorithm is compared against a number of widely used text localization algorithms and the results show that this method can quickly and effectively localize and extract text regions from real scenes and can be used in mobile robot navigation under an indoor environment to detect text based landmarks.

Keywords: Landmarks, mobile robot navigation, scene text, text localization and extraction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
144 Emotional Analysis for Text Search Queries on Internet

Authors: Gemma García López

Abstract:

The goal of this study is to analyze if search queries carried out in search engines such as Google, can offer emotional information about the user that performs them. Knowing the emotional state in which the Internet user is located can be a key to achieve the maximum personalization of content and the detection of worrying behaviors. For this, two studies were carried out using tools with advanced natural language processing techniques. The first study determines if a query can be classified as positive, negative or neutral, while the second study extracts emotional content from words and applies the categorical and dimensional models for the representation of emotions. In addition, we use search queries in Spanish and English to establish similarities and differences between two languages. The results revealed that text search queries performed by users on the Internet can be classified emotionally. This allows us to better understand the emotional state of the user at the time of the search, which could involve adapting the technology and personalizing the responses to different emotional states.

Keywords: Emotion classification, text search queries, emotional analysis, sentiment analysis in text, natural language processing.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
143 An Unstructured Finite-volume Technique for Shallow-water Flows with Wetting and Drying Fronts

Authors: Rajendra K. Ray, Kim Dan Nguyen

Abstract:

An unstructured finite volume numerical model is presented here for simulating shallow-water flows with wetting and drying fronts. The model is based on the Green-s theorem in combination with Chorin-s projection method. A 2nd-order upwind scheme coupled with a Least Square technique is used to handle convection terms. An Wetting and drying treatment is used in the present model to ensures the total mass conservation. To test it-s capacity and reliability, the present model is used to solve the Parabolic Bowl problem. We compare our numerical solutions with the corresponding analytical and existing standard numerical results. Excellent agreements are found in all the cases.

Keywords: Finite volume method, Projection method, Shallow water, Unstructured grid, wetting/drying fronts.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
142 Visual Text Analytics Technologies for Real-Time Big Data: Chronological Evolution and Issues

Authors: Siti Azrina B. A. Aziz, Siti Hafizah A. Hamid

Abstract:

New approaches to analyze and visualize data stream in real-time basis is important in making a prompt decision by the decision maker. Financial market trading and surveillance, large-scale emergency response and crowd control are some example scenarios that require real-time analytic and data visualization. This situation has led to the development of techniques and tools that support humans in analyzing the source data. With the emergence of Big Data and social media, new techniques and tools are required in order to process the streaming data. Today, ranges of tools which implement some of these functionalities are available. In this paper, we present chronological evolution evaluation of technologies for supporting of real-time analytic and visualization of the data stream. Based on the past research papers published from 2002 to 2014, we gathered the general information, main techniques, challenges and open issues. The techniques for streaming text visualization are identified based on Text Visualization Browser in chronological order. This paper aims to review the evolution of streaming text visualization techniques and tools, as well as to discuss the problems and challenges for each of identified tools.

Keywords: Information visualization, visual analytics, text mining, visual text analytics tools, big data visualization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
141 Text Summarization for Oil and Gas News Article

Authors: L. H. Chong, Y. Y. Chen

Abstract:

Information is increasing in volumes; companies are overloaded with information that they may lose track in getting the intended information. It is a time consuming task to scan through each of the lengthy document. A shorter version of the document which contains only the gist information is more favourable for most information seekers. Therefore, in this paper, we implement a text summarization system to produce a summary that contains gist information of oil and gas news articles. The summarization is intended to provide important information for oil and gas companies to monitor their competitor-s behaviour in enhancing them in formulating business strategies. The system integrated statistical approach with three underlying concepts: keyword occurrences, title of the news article and location of the sentence. The generated summaries were compared with human generated summaries from an oil and gas company. Precision and recall ratio are used to evaluate the accuracy of the generated summary. Based on the experimental results, the system is able to produce an effective summary with the average recall value of 83% at the compression rate of 25%.

Keywords: Information retrieval, text summarization, statistical approach.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
140 The Influence of Preprocessing Parameters on Text Categorization

Authors: Jan Pomikalek, Radim Rehurek

Abstract:

Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.

Keywords: Text categorization, machine learning, electronic documents, classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
139 A System to Adapt Techniques of Text Summarizing to Polish

Authors: Marcin Ciura, Damian Grund, S

Abstract:

This paper describes a system, in which various methods of text summarizing can be adapted to Polish. A structure of the system is presented. A modular construction of the system and access to the system via the Internet are signaled.

Keywords: Automatic summary generation, linguistic analysis, text generation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
138 Towards a Deconstructive Text: Beyond Language and the Politics of Absences in Samuel Beckett’s Waiting for Godot

Authors: Afia Shahid

Abstract:

The writing of Samuel Beckett is associated with meaning in the meaninglessness and the production of what he calls ‘literature of unword’. The casual escape from the world of words in the form of silences and pauses, in his play Waiting for Godot, urges to ask question of their existence and ultimately leads to investigate the theory behind their use in the play. This paper proposes that these absences (silence and pause) in Beckett’s play force to think ‘beyond’ language. This paper asks how silence and pause in Beckett’s text speak for the emergence of poststructuralist text. It aims to identify the significant features of the philosophy of deconstruction in the play of Beckett to demystify the hostile complicity between literature and philosophy. With the interpretive paradigm of poststructuralism this research focuses on the text as a research data. It attempts to delineate the relationship between poststructuralist theoretical concerns and text of Beckett. Keeping in view the theoretical concerns of Poststructuralist theorist Jacques Derrida, the main concern of the discussion is directed towards the notion of ‘beyond’ language into the absences that are aimed at silencing the existing discourse with the ‘radical irony’ of this anti-formal art that contains its own denial and thus represents the idea of ceaseless questioning and radical contradiction in art and any text. This article asks how text of Beckett vibrates with loud silence and has disrupted language to demonstrate the emptiness of words and thus exploring the limitless void of absences. Beckett’s text resonates with silence and pause that is neither negation nor affirmation rather a poststructuralist’s suspension of reality that is ever changing with the undecidablity of all meanings. Within the theoretical notion of Derrida’s Différance this study interprets silence and pause in Beckett’s art. The silence and pause behave like Derrida’s Différance and have questioned their own existence in the text to deconstruct any definiteness and finality of reality to extend an undecidable threshold of poststructuralists that aims to evade the ‘labyrinth of language’.

Keywords: Différance, language, pause, poststructuralism, silence, text.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
137 Evaluating 8D Reports Using Text-Mining

Authors: Benjamin Kuester, Bjoern Eilert, Malte Stonis, Ludger Overmeyer

Abstract:

Increasing quality requirements make reliable and effective quality management indispensable. This includes the complaint handling in which the 8D method is widely used. The 8D report as a written documentation of the 8D method is one of the key quality documents as it internally secures the quality standards and acts as a communication medium to the customer. In practice, however, the 8D report is mostly faulty and of poor quality. There is no quality control of 8D reports today. This paper describes the use of natural language processing for the automated evaluation of 8D reports. Based on semantic analysis and text-mining algorithms the presented system is able to uncover content and formal quality deficiencies and thus increases the quality of the complaint processing in the long term.

Keywords: 8D report, complaint management, evaluation system, text-mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
136 Narrative and Expository Text Reading Comprehension by Fourth Grade Spanish-Speaking Children

Authors: Mariela V. De Mier, Veronica S. Sanchez Abchi, Ana M. Borzone

Abstract:

This work aims to explore the factors that have an incidence in reading comprehension process, with different type of texts. In a recent study with 2nd, 3rd and 4th grade children, it was observed that reading comprehension of narrative texts was better than comprehension of expository texts. Nevertheless it seems that not only the type of text but also other textual factors would account for comprehension depending on the cognitive processing demands posed by the text. In order to explore this assumption, three narrative and three expository texts were elaborated with different degree of complexity. A group of 40 fourth grade Spanish-speaking children took part in the study. Children were asked to read the texts and answer orally three literal and three inferential questions for each text. The quantitative and qualitative analysis of children responses showed that children had difficulties in both, narrative and expository texts. The problem was to answer those questions that involved establishing complex relationships among information units that were present in the text or that should be activated from children’s previous knowledge to make an inference. Considering the data analysis, it could be concluded that there is some interaction between the type of text and the cognitive processing load of a specific text.

Keywords: comprehension, textual factors, type of text, processing demands.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
135 Automatic Text Summarization

Authors: Mohamed Abdel Fattah, Fuji Ren

Abstract:

This work proposes an approach to address automatic text summarization. This approach is a trainable summarizer, which takes into account several features, including sentence position, positive keyword, negative keyword, sentence centrality, sentence resemblance to the title, sentence inclusion of name entity, sentence inclusion of numerical data, sentence relative length, Bushy path of the sentence and aggregated similarity for each sentence to generate summaries. First we investigate the effect of each sentence feature on the summarization task. Then we use all features score function to train genetic algorithm (GA) and mathematical regression (MR) models to obtain a suitable combination of feature weights. The proposed approach performance is measured at several compression rates on a data corpus composed of 100 English religious articles. The results of the proposed approach are promising.

Keywords: Automatic Summarization, Genetic Algorithm, Mathematical Regression, Text Features.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
134 A Framework for Urdu Language Translation using LESSA

Authors: Imran Sarwar Bajwa

Abstract:

Internet is one of the major sources of information for the person belonging to almost all the fields of life. Major language that is used to publish information on internet is language. This thing becomes a problem in a country like Pakistan, where Urdu is the national language. Only 10% of Pakistan mass can understand English. The reason is millions of people are deprived of precious information available on internet. This paper presents a system for translation from English to Urdu. A module LESSA is used that uses a rule based algorithm to read the input text in English language, understand it and translate it into Urdu language. The designed approach was further incorporated to translate the complete website from English language o Urdu language. An option appears in the browser to translate the webpage in a new window. The designed system will help the millions of users of internet to get benefit of the internet and approach the latest information and knowledge posted daily on internet.

Keywords: Natural Language Translation, Text Understanding, Knowledge extraction, Text Processing

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
133 TOSOM: A Topic-Oriented Self-Organizing Map for Text Organization

Authors: Hsin-Chang Yang, Chung-Hong Lee, Kuo-Lung Ke

Abstract:

The self-organizing map (SOM) model is a well-known neural network model with wide spread of applications. The main characteristics of SOM are two-fold, namely dimension reduction and topology preservation. Using SOM, a high-dimensional data space will be mapped to some low-dimensional space. Meanwhile, the topological relations among data will be preserved. With such characteristics, the SOM was usually applied on data clustering and visualization tasks. However, the SOM has main disadvantage of the need to know the number and structure of neurons prior to training, which are difficult to be determined. Several schemes have been proposed to tackle such deficiency. Examples are growing/expandable SOM, hierarchical SOM, and growing hierarchical SOM. These schemes could dynamically expand the map, even generate hierarchical maps, during training. Encouraging results were reported. Basically, these schemes adapt the size and structure of the map according to the distribution of training data. That is, they are data-driven or dataoriented SOM schemes. In this work, a topic-oriented SOM scheme which is suitable for document clustering and organization will be developed. The proposed SOM will automatically adapt the number as well as the structure of the map according to identified topics. Unlike other data-oriented SOMs, our approach expands the map and generates the hierarchies both according to the topics and their characteristics of the neurons. The preliminary experiments give promising result and demonstrate the plausibility of the method.

Keywords: Self-organizing map, topic identification, learning algorithm, text clustering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
132 Growing Self Organising Map Based Exploratory Analysis of Text Data

Authors: Sumith Matharage, Damminda Alahakoon

Abstract:

Textual data plays an important role in the modern world. The possibilities of applying data mining techniques to uncover hidden information present in large volumes of text collections is immense. The Growing Self Organizing Map (GSOM) is a highly successful member of the Self Organising Map family and has been used as a clustering and visualisation tool across wide range of disciplines to discover hidden patterns present in the data. A comprehensive analysis of the GSOM’s capabilities as a text clustering and visualisation tool has so far not been published. These functionalities, namely map visualisation capabilities, automatic cluster identification and hierarchical clustering capabilities are presented in this paper and are further demonstrated with experiments on a benchmark text corpus.

Keywords: Text Clustering, Growing Self Organizing Map, Automatic Cluster Identification, Hierarchical Clustering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
131 Optimal Classifying and Extracting Fuzzy Relationship from Query Using Text Mining Techniques

Authors: Faisal Alshuwaier, Ali Areshey

Abstract:

Text mining techniques are generally applied for classifying the text, finding fuzzy relations and structures in data sets. This research provides plenty text mining capabilities. One common application is text classification and event extraction, which encompass deducing specific knowledge concerning incidents referred to in texts. The main contribution of this paper is the clarification of a concept graph generation mechanism, which is based on a text classification and optimal fuzzy relationship extraction. Furthermore, the work presented in this paper explains the application of fuzzy relationship extraction and branch and bound (BB) method to simplify the texts.

Keywords: Extraction, Max-Prod, Fuzzy Relations, Text Mining, Memberships, Classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
130 Application of Smooth Ergodic Hidden Markov Model in Text to Speech Systems

Authors: Armin Ghayoori, Faramarz Hendessi, Asrar Sheikh

Abstract:

In developing a text-to-speech system, it is well known that the accuracy of information extracted from a text is crucial to produce high quality synthesized speech. In this paper, a new scheme for converting text into its equivalent phonetic spelling is introduced and developed. This method is applicable to many applications in text to speech converting systems and has many advantages over other methods. The proposed method can also complement the other methods with a purpose of improving their performance. The proposed method is a probabilistic model and is based on Smooth Ergodic Hidden Markov Model. This model can be considered as an extension to HMM. The proposed method is applied to Persian language and its accuracy in converting text to speech phonetics is evaluated using simulations.

Keywords: Hidden Markov Models, text, synthesis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF