Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 24

Information Retrieval Related Abstracts

24 Identification of Coauthors in Scientific Database

Authors: Gray F. Moita, Thiago M. R Dias

Abstract:

The analysis of scientific collaboration networks has contributed significantly to improving the understanding of how does the process of collaboration between researchers and also to understand how the evolution of scientific production of researchers or research groups occurs. However, the identification of collaborations in large scientific databases is not a trivial task given the high computational cost of the methods commonly used. This paper proposes a method for identifying collaboration in large data base of curriculum researchers. The proposed method has low computational cost with satisfactory results, proving to be an interesting alternative for the modeling and characterization of large scientific collaboration networks.

Keywords: Information Retrieval, Data Integration, Extraction, scientific collaboration

Procedia PDF Downloads 220
23 Empirical Study on Factors Influencing SEO

Authors: Pakinee Aimmanee, Phoom Chokratsamesiri

Abstract:

Search engine has become an essential tool nowadays for people to search for their needed information on the internet. In this work, we evaluate the performance of the search engine from three factors: the keyword frequency, the number of inbound links, and the difficulty of the keyword. The evaluations are based on the ranking position and the number of days that Google has seen or detect the webpage. We find that the keyword frequency and the difficulty of the keyword do not affect the Google ranking where the number of inbound links gives remarkable improvement of the ranking position. The optimal number of inbound links found in the experiment is 10.

Keywords: Information Retrieval, Web Search, Knowledge Technologies, SEO

Procedia PDF Downloads 153
22 Evaluating Alternative Structures for Prefix Trees

Authors: Izzat Alsmadi, Feras Hanandeh, Muhammad M. Kwafha

Abstract:

Prefix trees or tries are data structures that are used to store data or index of data. The goal is to be able to store and retrieve data by executing queries in quick and reliable manners. In principle, the structure of the trie depends on having letters in nodes at the different levels to point to the actual words in the leafs. However, the exact structure of the trie may vary based on several aspects. In this paper, we evaluated different structures for building tries. Using datasets of words of different sizes, we evaluated the different forms of trie structures. Results showed that some characteristics may impact significantly, positively or negatively, the size and the performance of the trie. We investigated different forms and structures for the trie. Results showed that using an array of pointers in each level to represent the different alphabet letters is the best choice.

Keywords: Information Retrieval, Indexing, data structures, tree structure, trie

Procedia PDF Downloads 300
21 Investigating the Encouraging Factors for Scholarly Works Contribution towards Institutional Repository: A Case Study at a Malaysian University

Authors: Mohd Rashid bin Ab Hamid, Noor Azura binti Omar, Zainol Bin Mustafa

Abstract:

Purpose: The aim of this paper is to study the encouraging factors for scholarly works contribution towards among academicians at Malaysian university. Methods: This paper uses questionnaire for data collection on the respondents’ perceptional level on the institutional repository efforts in one of the university under study. Several encouraging factors have been identified and to be measured using descriptive statistics. The factors are related to content contribution, i.e. personal factor, professional factor, organizational factor and technological factor. Findings: The study found that all these four encouraging factors did have a relation to the contribution of scholarly works in the university by the academician. Research Limitations: This study used a case study and generalization to all Malaysian universities should be well taken care of. Practical implications: The library at the university should look into these four encouraging factors in order to enhance the contribution from academician towards the repository. Originality/value: This research paper provides basic information for the knowledge management officers in the university by endeavouring more efforts in order to attract more contributions.

Keywords: Information Retrieval, institutional repository, information storage and retrieval

Procedia PDF Downloads 431
20 Smartphones: Tools for Enhancing Teaching in Nigeria’s Higher Institutions

Authors: Ma'amun Muhammed

Abstract:

The ability of smartphones in enhancing communication, providing access to business and serving as a pool for information retrieval has a far reaching and potentially beneficial impacts on enhancing teaching in higher institutions in the developing countries like Nigeria. Nigeria as one of the fast growing economies in Africa, whose citizens patronize smartphones can utilize this opportunity by inculcating the culture of using smartphones not only for communication, business transaction, banking etc. but also for enhancing teaching in the higher institutions. Smartphones have become part and parcel of our lives, particularly among young people. The primary objective of this paper is to ascertain the use of smartphones in enhancing teaching in Nigeria’s higher institutions, to achieve this, content analysis was used thoroughly. This paper examines the opportunities offered by smartphones to the students of higher institutions of learning, the challenges being faced by lecturers of these institutions in classrooms. Lastly, it offers solution on how some of these critical challenges will be overcame, so as to utilize the technology of these devices.

Keywords: Communication, Information Retrieval, mobile phone, smartphones teaching

Procedia PDF Downloads 281
19 Role of Natural Language Processing in Information Retrieval; Challenges and Opportunities

Authors: Khaled M. Alhawiti

Abstract:

This paper aims to analyze the role of natural language processing (NLP). The paper will discuss the role in the context of automated data retrieval, automated question answer, and text structuring. NLP techniques are gaining wider acceptance in real life applications and industrial concerns. There are various complexities involved in processing the text of natural language that could satisfy the need of decision makers. This paper begins with the description of the qualities of NLP practices. The paper then focuses on the challenges in natural language processing. The paper also discusses major techniques of NLP. The last section describes opportunities and challenges for future research.

Keywords: Information Retrieval, natural language processing, data retrieval, text structuring

Procedia PDF Downloads 177
18 Personalization of Context Information Retrieval Model via User Search Behaviours for Ranking Document Relevance

Authors: Kehinde Agbele, Longe Olumide, Daniel Ekong, Dele Seluwa, Akintoye Onamade

Abstract:

One major problem of most existing information retrieval systems (IRS) is that they provide even access and retrieval results to individual users specially based on the query terms user issued to the system. When using IRS, users often present search queries made of ad-hoc keywords. It is then up to IRS to obtain a precise representation of user’s information need, and the context of the information. In effect, the volume and range of the Internet documents is growing exponentially and consequently causes difficulties for a user to obtain information that precisely matches the user interest. Diverse combination techniques are used to achieve the specific goal. This is due, firstly, to the fact that users often do not present queries to IRS that optimally represent the information they want, and secondly, the measure of a document's relevance is highly subjective between diverse users. In this paper, we address the problem by investigating the optimization of IRS to individual information needs in order of relevance. The paper addressed the development of algorithms that optimize the ranking of documents retrieved from IRS. This paper addresses this problem with a two-fold approach in order to retrieve domain-specific documents. Firstly, the design of context of information. The context of a query determines retrieved information relevance using personalization and context-awareness. Thus, executing the same query in diverse contexts often leads to diverse result rankings based on the user preferences. Secondly, the relevant context aspects should be incorporated in a way that supports the knowledge domain representing users’ interests. In this paper, the use of evolutionary algorithms is incorporated to improve the effectiveness of IRS. A context-based information retrieval system that learns individual needs from user-provided relevance feedback is developed whose retrieval effectiveness is evaluated using precision and recall metrics. The results demonstrate how to use attributes from user interaction behavior to improve the IR effectiveness.

Keywords: Information Retrieval, context, document relevance, personalization, user search behaviors

Procedia PDF Downloads 339
17 Natural Language Processing; the Future of Clinical Record Management

Authors: Khaled M. Alhawiti

Abstract:

This paper investigates the future of medicine and the use of Natural language processing. The importance of having correct clinical information available online is remarkable; improving patient care at affordable costs could be achieved using automated applications to use the online clinical information. The major challenge towards the retrieval of such vital information is to have it appropriately coded. Majority of the online patient reports are not found to be coded and not accessible as its recorded in natural language text. The use of Natural Language processing provides a feasible solution by retrieving and organizing clinical information, available in text and transforming clinical data that is available for use. Systems used in NLP are rather complex to construct, as they entail considerable knowledge, however significant development has been made. Newly formed NLP systems have been tested and have established performance that is promising and considered as practical clinical applications.

Keywords: Information Retrieval, natural language processing, clinical information, automated applications

Procedia PDF Downloads 232
16 Business Domain Modelling Using an Integrated Framework

Authors: Mohammed Hasan Salahat, Stave Wade

Abstract:

This paper presents an application of a “Systematic Soft Domain Driven Design Framework” as a soft systems approach to domain-driven design of information systems development. The framework combining techniques from Soft Systems Methodology (SSM), the Unified Modeling Language (UML), and an implementation pattern knows as ‘Naked Objects’. This framework have been used in action research projects that have involved the investigation and modeling of business processes using object-oriented domain models and the implementation of software systems based on those domain models. Within this framework, Soft Systems Methodology (SSM) is used as a guiding methodology to explore the problem situation and to develop the domain model using UML for the given business domain. The framework is proposed and evaluated in our previous works, and a real case study ‘Information Retrieval System for Academic Research’ is used, in this paper, to show further practice and evaluation of the framework in different business domain. We argue that there are advantages from combining and using techniques from different methodologies in this way for business domain modeling. The framework is overviewed and justified as multi-methodology using Mingers Multi-Methodology ideas.

Keywords: Information Retrieval, UML, SSM, domain-driven design, soft domain-driven design, naked objects, soft language, multimethodology

Procedia PDF Downloads 335
15 Nearest Neighbor Investigate Using R+ Tree

Authors: Rutuja Desai

Abstract:

Search engine is fundamentally a framework used to search the data which is pertinent to the client via WWW. Looking close-by spot identified with the keywords is an imperative concept in developing web advances. For such kind of searching, extent pursuit or closest neighbor is utilized. In range search the forecast is made whether the objects meet to query object. Nearest neighbor is the forecast of the focuses close to the query set by the client. Here, the nearest neighbor methodology is utilized where Data recovery R+ tree is utilized rather than IR2 tree. The disadvantages of IR2 tree is: The false hit number can surpass the limit and the mark in Information Retrieval R-tree must have Voice over IP bit for each one of a kind word in W set is recouped by Data recovery R+ tree. The inquiry is fundamentally subordinate upon the key words and the geometric directions.

Keywords: Information Retrieval, nearest neighbor search, keyword search, R+ tree

Procedia PDF Downloads 147
14 Enhanced Arabic Semantic Information Retrieval System Based on Arabic Text Classification

Authors: T. Nazmy, A. Elsehemy, M. Abdeen

Abstract:

Since the appearance of the Semantic web, many semantic search techniques and models were proposed to exploit the information in ontology to enhance the traditional keyword-based search. Many advances were made in languages such as English, German, French and Spanish. However, other languages such as Arabic are not fully supported yet. In this paper we present a framework for ontology based information retrieval for Arabic language. Our system consists of four main modules, namely query parser, indexer, search and a ranking module. Our approach includes building a semantic index by linking ontology concepts to documents, including an annotation weight for each link, to be used in ranking the results. We also augmented the framework with an automatic document categorizer, which enhances the overall document ranking. We have built three Arabic domain ontologies: Sports, Economic and Politics as example for the Arabic language. We built a knowledge base that consists of 79 classes and more than 1456 instances. The system is evaluated using the precision and recall metrics. We have done many retrieval operations on a sample of 40,316 documents with a size 320 MB of pure text. The results show that the semantic search enhanced with text classification gives better performance results than the system without classification.

Keywords: Information Retrieval, Arabic text classification, ontology based retrieval, Arabic semantic web, Arabic ontology

Procedia PDF Downloads 373
13 Merging of Results in Distributed Information Retrieval Systems

Authors: Larbi Guezouli, Imane Azzouz

Abstract:

This work is located in the domain of distributed information retrieval ‘DIR’. A simplified view of the DIR requires a multi-search in a set of collections, which forces the system to analyze results found in these collections, and merge results back before sending them to the user in a single list. Our work is to find a fusion method based on the relevance score of each result received from collections and the relevance of the local search engine of each collection.

Keywords: Information Retrieval, Datamining, distributed IR systems, merging results

Procedia PDF Downloads 115
12 AINA: Disney Animation Information as Educational Resources

Authors: Piedad Garrido, Julio A. Sanguesa, Fernando Repulles, Andy Bloor, Jesus Gallardo, Vicente Torres, Jesus Tramullas

Abstract:

With the emergence and development of Information and Communications Technologies (ICTs), Higher Education is experiencing rapid changes, not only in its teaching strategies but also in student’s learning skills. However, we have noticed that students often have difficulty when seeking innovative, useful, and interesting learning resources for their work. This is due to the lack of supervision in the selection of good query tools. This paper presents AINA, an Information Retrieval (IR) computer system aimed at providing motivating and stimulating content to both students and teachers working on different areas and at different educational levels. In particular, our proposal consists of an open virtual resource environment oriented to the vast universe of Disney comics and cartoons. Our test suite includes Disney’s long and shorts films, and we have performed some activities based on the Just In Time Teaching (JiTT) methodology. More specifically, it has been tested by groups of university and secondary school students.

Keywords: Animation, Information Retrieval, Educational Resources, JiTT

Procedia PDF Downloads 99
11 Biomedical Definition Extraction Using Machine Learning with Synonymous Feature

Authors: Jian Qu, Akira Shimazu

Abstract:

OOV (Out Of Vocabulary) terms are terms that cannot be found in many dictionaries. Although it is possible to translate such OOV terms, the translations do not provide any real information for a user. We present an OOV term definition extraction method by using information available from the Internet. We use features such as occurrence of the synonyms and location distances. We apply machine learning method to find the correct definitions for OOV terms. We tested our method on both biomedical type and name type OOV terms, our work outperforms existing work with an accuracy of 86.5%.

Keywords: Information Retrieval, definition retrieval, OOV (out of vocabulary), biomedical information retrieval

Procedia PDF Downloads 361
10 Algorithm for Information Retrieval Optimization

Authors: Kehinde K. Agbele, Kehinde Daniel Aruleba, Eniafe F. Ayetiran

Abstract:

When using Information Retrieval Systems (IRS), users often present search queries made of ad-hoc keywords. It is then up to the IRS to obtain a precise representation of the user’s information need and the context of the information. This paper investigates optimization of IRS to individual information needs in order of relevance. The study addressed development of algorithms that optimize the ranking of documents retrieved from IRS. This study discusses and describes a Document Ranking Optimization (DROPT) algorithm for information retrieval (IR) in an Internet-based or designated databases environment. Conversely, as the volume of information available online and in designated databases is growing continuously, ranking algorithms can play a major role in the context of search results. In this paper, a DROPT technique for documents retrieved from a corpus is developed with respect to document index keywords and the query vectors. This is based on calculating the weight (

Keywords: Information Retrieval, performance measures, document relevance, personalization

Procedia PDF Downloads 100
9 Morphological Analysis of Manipuri Language: Wahei-Neinarol

Authors: Y. Bablu Singh, B. S. Purkayashtha, Chungkham Yashawanta Singh

Abstract:

Morphological analysis forms the basic foundation in NLP applications including syntax parsing Machine Translation (MT), Information Retrieval (IR) and automatic indexing in all languages. It is the field of the linguistics; it can provide valuable information for computer based linguistics task such as lemmatization and studies of internal structure of the words. Computational Morphology is the application of morphological rules in the field of computational linguistics, and it is the emerging area in AI, which studies the structure of words, which are formed by combining smaller units of linguistics information, called morphemes: the building blocks of words. Morphological analysis provides about semantic and syntactic role in a sentence. It analyzes the Manipuri word forms and produces several grammatical information associated with the words. The Morphological Analyzer for Manipuri has been tested on 3500 Manipuri words in Shakti Standard format (SSF) using Meitei Mayek as source; thereby an accuracy of 80% has been obtained on a manual check.

Keywords: Information Retrieval, Machine Translation, Morphological Analysis, SSF, computational morphology

Procedia PDF Downloads 165
8 Searching Linguistic Synonyms through Parts of Speech Tagging

Authors: Usman Qamar, Faiza Hussain

Abstract:

Synonym-based searching is recognized to be a complicated problem as text mining from unstructured data of web is challenging. Finding useful information which matches user need from bulk of web pages is a cumbersome task. In this paper, a novel and practical synonym retrieval technique is proposed for addressing this problem. For replacement of semantics, user intent is taken into consideration to realize the technique. Parts-of-Speech tagging is applied for pattern generation of the query and a thesaurus for this experiment was formed and used. Comparison with Non-Context Based Searching, Context Based searching proved to be a more efficient approach while dealing with linguistic semantics. This approach is very beneficial in doing intent based searching. Finally, results and future dimensions are presented.

Keywords: Semantics, Information Retrieval, Text Mining, natural language processing, Grammar, parts-of-speech tagging

Procedia PDF Downloads 155
7 Improving Research by the Integration of a Collaborative Dimension in an Information Retrieval (IR) System

Authors: Amel Hannech, Mehdi Adda, Hamid Mcheick

Abstract:

In computer science, the purpose of finding useful information is still one of the most active and important research topics. The most popular application of information retrieval (IR) are Search Engines, they meet users' specific needs and aim to locate the effective information in the web. However, these search engines have some limitations related to the relevancy of the results and the ease to explore those results. In this context, we proposed in previous works a Multi-Space Search Engine model that is based on a multidimensional interpretation universe. In the present paper, we integrate an additional dimension that allows to offer users new research experiences. The added component is based on creating user profiles and calculating the similarity between them that then allow the use of collaborative filtering in retrieving search results. To evaluate the effectiveness of the proposed model, a prototype is developed. The experiments showed that the additional dimension has improved the relevancy of results by predicting the interesting items of users based on their experiences and the experiences of other similar users. The offered personalization service allows users to approve the pertinent items, which allows to enrich their profiles and further improve research.

Keywords: Information Retrieval, User Behavior Analysis, Association Rules, user profiles, topical ontology, v-facets, data personalization

Procedia PDF Downloads 150
6 Comparison of Crossover Types to Obtain Optimal Queries Using Adaptive Genetic Algorithm

Authors: Khaled Almakadmeh, Wafa’ Alma'Aitah

Abstract:

this study presents an information retrieval system of using genetic algorithm to increase information retrieval efficiency. Using vector space model, information retrieval is based on the similarity measurement between query and documents. Documents with high similarity to query are judge more relevant to the query and should be retrieved first. Using genetic algorithms, each query is represented by a chromosome; these chromosomes are fed into genetic operator process: selection, crossover, and mutation until an optimized query chromosome is obtained for document retrieval. Results show that information retrieval with adaptive crossover probability and single point type crossover and roulette wheel as selection type give the highest recall. The proposed approach is verified using (242) proceedings abstracts collected from the Saudi Arabian national conference.

Keywords: Information Retrieval, Genetic Algorithm, crossover, optimal queries

Procedia PDF Downloads 135
5 Q-Map: Clinical Concept Mining from Clinical Documents

Authors: Sheikh Shams Azam, Manoj Raju, Venkatesh Pagidimarri, Vamsi Kasivajjala

Abstract:

Over the past decade, there has been a steep rise in the data-driven analysis in major areas of medicine, such as clinical decision support system, survival analysis, patient similarity analysis, image analytics etc. Most of the data in the field are well-structured and available in numerical or categorical formats which can be used for experiments directly. But on the opposite end of the spectrum, there exists a wide expanse of data that is intractable for direct analysis owing to its unstructured nature which can be found in the form of discharge summaries, clinical notes, procedural notes which are in human written narrative format and neither have any relational model nor any standard grammatical structure. An important step in the utilization of these texts for such studies is to transform and process the data to retrieve structured information from the haystack of irrelevant data using information retrieval and data mining techniques. To address this problem, the authors present Q-Map in this paper, which is a simple yet robust system that can sift through massive datasets with unregulated formats to retrieve structured information aggressively and efficiently. It is backed by an effective mining technique which is based on a string matching algorithm that is indexed on curated knowledge sources, that is both fast and configurable. The authors also briefly examine its comparative performance with MetaMap, one of the most reputed tools for medical concepts retrieval and present the advantages the former displays over the latter.

Keywords: Medical informatics, Information Retrieval, natural language processing, unified medical language system, syntax based analysis

Procedia PDF Downloads 22
4 Selecting Answers for Questions with Multiple Answer Choices in Arabic Question Answering Based on Textual Entailment Recognition

Authors: Anes Enakoa, Yawei Liang

Abstract:

Question Answering (QA) system is one of the most important and demanding tasks in the field of Natural Language Processing (NLP). In QA systems, the answer generation task generates a list of candidate answers to the user's question, in which only one answer is correct. Answer selection is one of the main components of the QA, which is concerned with selecting the best answer choice from the candidate answers suggested by the system. However, the selection process can be very challenging especially in Arabic due to its particularities. To address this challenge, an approach is proposed to answer questions with multiple answer choices for Arabic QA systems based on Textual Entailment (TE) recognition. The developed approach employs a Support Vector Machine that considers lexical, semantic and syntactic features in order to recognize the entailment between the generated hypotheses (H) and the text (T). A set of experiments has been conducted for performance evaluation and the overall performance of the proposed method reached an accuracy of 67.5% with [email protected] score of 80.46%. The obtained results are promising and demonstrate that the proposed method is effective for TE recognition task.

Keywords: Machine Learning, Information Retrieval, natural language processing, Question answering, textual entailment

Procedia PDF Downloads 11
3 Machine Learning Strategies for Data Extraction from Unstructured Documents in Financial Services

Authors: Delphine Vendryes, Dushyanth Sekhar, Baojia Tong, Matthew Theisen, Chester Curme

Abstract:

Much of the data that inform the decisions of governments, corporations and individuals are harvested from unstructured documents. Data extraction is defined here as a process that turns non-machine-readable information into a machine-readable format that can be stored, for instance, in a database. In financial services, introducing more automation in data extraction pipelines is a major challenge. Information sought by financial data consumers is often buried within vast bodies of unstructured documents, which have historically required thorough manual extraction. Automated solutions provide faster access to non-machine-readable datasets, in a context where untimely information quickly becomes irrelevant. Data quality standards cannot be compromised, so automation requires high data integrity. This multifaceted task is broken down into smaller steps: ingestion, table parsing (detection and structure recognition), text analysis (entity detection and disambiguation), schema-based record extraction, user feedback incorporation. Selected intermediary steps are phrased as machine learning problems. Solutions leveraging cutting-edge approaches from the fields of computer vision (e.g. table detection) and natural language processing (e.g. entity detection and disambiguation) are proposed.

Keywords: Computer Vision, Finance, Machine Learning, Information Retrieval, natural language processing, Entity Recognition

Procedia PDF Downloads 1
2 Mondoc: Informal Lightweight Ontology for Faceted Semantic Classification of Hypernymy

Authors: M. Regina Carreira-Lopez

Abstract:

Lightweight ontologies seek to concrete union relationships between a parent node, and a secondary node, also called "child node". This logic relation (L) can be formally defined as a triple ontological relation (LO) equivalent to LO in ⟨LN, LE, LC⟩, and where LN represents a finite set of nodes (N); LE is a set of entities (E), each of which represents a relationship between nodes to form a rooted tree of ⟨LN, LE⟩; and LC is a finite set of concepts (C), encoded in a formal language (FL). Mondoc enables more refined searches on semantic and classified facets for retrieving specialized knowledge about Atlantic migrations, from the Declaration of Independence of the United States of America (1776) and to the end of the Spanish Civil War (1939). The model looks forward to increasing documentary relevance by applying an inverse frequency of co-ocurrent hypernymy phenomena for a concrete dataset of textual corpora, with RMySQL package. Mondoc profiles archival utilities implementing SQL programming code, and allows data export to XML schemas, for achieving semantic and faceted analysis of speech by analyzing keywords in context (KWIC). The methodology applies random and unrestricted sampling techniques with RMySQL to verify the resonance phenomena of inverse documentary relevance between the number of co-occurrences of the same term (t) in more than two documents of a set of texts (D). Secondly, the research also evidences co-associations between (t) and their corresponding synonyms and antonyms (synsets) are also inverse. The results from grouping facets or polysemic words with synsets in more than two textual corpora within their syntagmatic context (nouns, verbs, adjectives, etc.) state how to proceed with semantic indexing of hypernymy phenomena for subject-heading lists and for authority lists for documentary and archival purposes. Mondoc contributes to the development of web directories and seems to achieve a proper and more selective search of e-documents (classification ontology). It can also foster on-line catalogs production for semantic authorities, or concepts, through XML schemas, because its applications could be used for implementing data models, by a prior adaptation of the based-ontology to structured meta-languages, such as OWL, RDF (descriptive ontology). Mondoc serves to the classification of concepts and applies a semantic indexing approach of facets. It enables information retrieval, as well as quantitative and qualitative data interpretation. The model reproduces a triple tuple ⟨LN, LE, LT, LCF L, BKF⟩ where LN is a set of entities that connect with other nodes to concrete a rooted tree in ⟨LN, LE⟩. LT specifies a set of terms, and LCF acts as a finite set of concepts, encoded in a formal language, L. Mondoc only resolves partial problems of linguistic ambiguity (in case of synonymy and antonymy), but neither the pragmatic dimension of natural language nor the cognitive perspective is addressed. To achieve this goal, forthcoming programming developments should target at oriented meta-languages with structured documents in XML.

Keywords: Information Retrieval, Resonance, hypernymy, lightweight ontology

Procedia PDF Downloads 1
1 Visual Template Detection and Compositional Automatic Regular Expression Generation for Business Invoice Extraction

Authors: Deepak Mishra, Anthony Proschka, Merlyn Ramanan, Zurab Baratashvili

Abstract:

Small and medium-sized businesses receive over 160 billion invoices every year. Since these documents exhibit many subtle differences in layout and text, extracting structured fields such as sender name, amount, and VAT rate from them automatically is an open research question. In this paper, existing work in template-based document extraction is extended, and a system is devised that is able to reliably extract all required fields for up to 70% of all documents in the data set, more than any other previously reported method. The approaches are described for 1) detecting through visual features which template a given document belongs to, 2) automatically generating extraction rules for a given new template by composing regular expressions from multiple components, and 3) computing confidence scores that indicate the accuracy of the automatic extractions. The system can generate templates with as little as one training sample and only requires the ground truth field values instead of detailed annotations such as bounding boxes that are hard to obtain. The system is deployed and used inside a commercial accounting software.

Keywords: Business, Data Mining, Information Retrieval, Feature Extraction, Business Data Processing, layout, document handling, end-user trained information extraction, document archiving, scanned business documents, automated document processing, F1-measure, commercial accounting software

Procedia PDF Downloads 1