Search results for: semi-structured documents
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 900

Search results for: semi-structured documents

900 Words Spotting in the Images Handwritten Historical Documents

Authors: Issam Ben Jami

Abstract:

Information retrieval in digital libraries is very important because most famous historical documents occupy a significant value. The word spotting in historical documents is a very difficult notion, because automatic recognition of such documents is naturally cursive, it represents a wide variability in the level scale and translation words in the same documents. We first present a system for the automatic recognition, based on the extraction of interest points words from the image model. The extraction phase of the key points is chosen from the representation of the image as a synthetic description of the shape recognition in a multidimensional space. As a result, we use advanced methods that can find and describe interesting points invariant to scale, rotation and lighting which are linked to local configurations of pixels. We test this approach on documents of the 15th century. Our experiments give important results.

Keywords: feature matching, historical documents, pattern recognition, word spotting

Procedia PDF Downloads 236
899 Data Gathering and Analysis for Arabic Historical Documents

Authors: Ali Dulla

Abstract:

This paper introduces a new dataset (and the methodology used to generate it) based on a wide range of historical Arabic documents containing clean data simple and homogeneous-page layouts. The experiments are implemented on printed and handwritten documents obtained respectively from some important libraries such as Qatar Digital Library, the British Library and the Library of Congress. We have gathered and commented on 150 archival document images from different locations and time periods. It is based on different documents from the 17th-19th century. The dataset comprises differing page layouts and degradations that challenge text line segmentation methods. Ground truth is produced using the Aletheia tool by PRImA and stored in an XML representation, in the PAGE (Page Analysis and Ground truth Elements) format. The dataset presented will be easily available to researchers world-wide for research into the obstacles facing various historical Arabic documents such as geometric correction of historical Arabic documents.

Keywords: dataset production, ground truth production, historical documents, arbitrary warping, geometric correction

Procedia PDF Downloads 141
898 Finding Related Scientific Documents Using Formal Concept Analysis

Authors: Nadeem Akhtar, Hira Javed

Abstract:

An important aspect of research is literature survey. Availability of a large amount of literature across different domains triggers the need for optimized systems which provide relevant literature to researchers. We propose a search system based on keywords for text documents. This experimental approach provides a hierarchical structure to the document corpus. The documents are labelled with keywords using KEA (Keyword Extraction Algorithm) and are automatically organized in a lattice structure using Formal Concept Analysis (FCA). This groups the semantically related documents together. The hierarchical structure, based on keywords gives out only those documents which precisely contain them. This approach open doors for multi-domain research. The documents across multiple domains which are indexed by similar keywords are grouped together. A hierarchical relationship between keywords is obtained. To signify the effectiveness of the approach, we have carried out the experiment and evaluation on Semeval-2010 Dataset. Results depict that the presented method is considerably successful in indexing of scientific papers.

Keywords: formal concept analysis, keyword extraction algorithm, scientific documents, lattice

Procedia PDF Downloads 300
897 Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition

Authors: L. Hamsaveni, Navya Prakash, Suresha

Abstract:

Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document images to obtain an original document with complete information. In case, degraded document image captured is skewed, it has to be straightened (deskew) to perform further process. A special format of image storing known as YCbCr is used as a tool to convert the Grayscale image to RGB image format. The presented algorithm is tested on various types of degraded documents such as printed documents, handwritten documents, old script documents and handwritten image sketches in documents. The purpose of this research is to obtain an original document for a given set of degraded documents of the same source.

Keywords: grayscale image format, image fusing, RGB image format, SURF detection, YCbCr image format

Procedia PDF Downloads 341
896 The Platform for Digitization of Georgian Documents

Authors: Erekle Magradze, Davit Soselia, Levan Shughliashvili, Irakli Koberidze, Shota Tsiskaridze, Victor Kakhniashvili, Tamar Chaghiashvili

Abstract:

Since the beginning of active publishing activity in Georgia, voluminous printed material has been accumulated, the digitization of which is an important task. Digitized materials will be available to the audience, and it will be possible to find text in them and conduct various factual research. Digitizing scanned documents means scanning documents, extracting text from the scanned documents, and processing the text into a corresponding language model to detect inaccuracies and grammatical errors. Implementing these stages requires a unified, scalable, and automated platform, where the digital service developed for each stage will perform the task assigned to it; at the same time, it will be possible to develop these services dynamically so that there is no interruption in the work of the platform.

Keywords: NLP, OCR, BERT, Kubernetes, transformers

Procedia PDF Downloads 113
895 A Similarity Measure for Classification and Clustering in Image Based Medical and Text Based Banking Applications

Authors: K. P. Sandesh, M. H. Suman

Abstract:

Text processing plays an important role in information retrieval, data-mining, and web search. Measuring the similarity between the documents is an important operation in the text processing field. In this project, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature the proposed measure takes the following three cases into account: (1) The feature appears in both documents; (2) The feature appears in only one document and; (3) The feature appears in none of the documents. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, especially in banking and health sectors. The results show that the performance obtained by the proposed measure is better than that achieved by the other measures.

Keywords: document classification, document clustering, entropy, accuracy, classifiers, clustering algorithms

Procedia PDF Downloads 478
894 System of Quality Automation for Documents (SQAD)

Authors: R. Babi Saraswathi, K. Divya, A. Habeebur Rahman, D. B. Hari Prakash, S. Jayanth, T. Kumar, N. Vijayarangan

Abstract:

Document automation is the design of systems and workflows, assembling repetitive documents to meet the specific business needs. In any organization or institution, documenting employee’s information is very important for both employees as well as management. It shows an individual’s progress to the management. Many documents of the employee are in the form of papers, so it is very difficult to arrange and for future reference we need to spend more time in getting the exact document. Also, it is very tedious to generate reports according to our needs. The process gets even more difficult on getting approvals and hence lacks its security aspects. This project overcomes the above-stated issues. By storing the details in the database and maintaining the e-documents, the automation system reduces the manual work to a large extent. Then the approval process of some important documents can be done in a much-secured manner by using Digital Signature and encryption techniques. Details are maintained in the database and e-documents are stored in specific folders and generation of various kinds of reports is possible. Moreover, an efficient search method is implemented is used in the database. Automation supporting document maintenance in many aspects is useful for minimize data entry, reduce the time spent on proof-reading, avoids duplication, and reduce the risks associated with the manual error, etc.

Keywords: e-documents, automation, digital signature, encryption

Procedia PDF Downloads 358
893 Enhancement of Indexing Model for Heterogeneous Multimedia Documents: User Profile Based Approach

Authors: Aicha Aggoune, Abdelkrim Bouramoul, Mohamed Khiereddine Kholladi

Abstract:

Recent research shows that user profile as important element can improve heterogeneous information retrieval with its content. In this context, we present our indexing model for heterogeneous multimedia documents. This model is based on the combination of user profile to the indexing process. The general idea of our proposal is to operate the common concepts between the representation of a document and the definition of a user through his profile. These two elements will be added as additional indexing entities to enrich the heterogeneous corpus documents indexes. We have developed IRONTO domain ontology allowing annotation of documents. We will present also the developed tool validating the proposed model.

Keywords: indexing model, user profile, multimedia document, heterogeneous of sources, ontology

Procedia PDF Downloads 318
892 Procedure for Recommendation of Archival Documents

Authors: Marlon J. Remedios, Maria T. Morell, Jesse D. Cano

Abstract:

Diffusion and accessibility of historical collections is one of the main objectives of the institutions that aim to safeguard archival documents (General Archives). Several countries have Web applications that try to make accessible and public the large number of documents that they guard. Each of these sites has a set of features in order to facilitate access, navigability, and search for information. Different sources of information include Recommender Systems as a way of customizing content. This paper aims at describing a process for the production of archival documents relevant to the user. To comply with this, the characteristics ruling archival description, elements and main techniques that establishes the design of Recommender Systems, a set of rules to follow, and how these rules operate and the way in which take advantage of the domain knowledge are discussed. Finally, relevant issues are discussed in the design of the proposed tests and the results obtained are shown.

Keywords: archival document, recommender system, procedure, information management

Procedia PDF Downloads 483
891 On the Interactive Search with Web Documents

Authors: Mario Kubek, Herwig Unger

Abstract:

Due to the large amount of information in the World Wide Web (WWW, web) and the lengthy and usually linearly ordered result lists of web search engines that do not indicate semantic relationships between their entries, the search for topically similar and related documents can become a tedious task. Especially, the process of formulating queries with proper terms representing specific information needs requires much effort from the user. This problem gets even bigger when the user's knowledge on a subject and its technical terms is not sufficient enough to do so. This article presents the new and interactive search application DocAnalyser that addresses this problem by enabling users to find similar and related web documents based on automatic query formulation and state-of-the-art search word extraction. Additionally, this tool can be used to track topics across semantically connected web documents

Keywords: DocAnalyser, interactive web search, search word extraction, query formulation, source topic detection, topic tracking

Procedia PDF Downloads 365
890 Binarization and Recognition of Characters from Historical Degraded Documents

Authors: Bency Jacob, S.B. Waykar

Abstract:

Degradations in historical document images appear due to aging of the documents. It is very difficult to understand and retrieve text from badly degraded documents as there is variation between the document foreground and background. Thresholding of such document images either result in broken characters or detection of false texts. Numerous algorithms exist that can separate text and background efficiently in the textual regions of the document; but portions of background are mistaken as text in areas that hardly contain any text. This paper presents a way to overcome these problems by a robust binarization technique that recovers the text from a severely degraded document images and thereby increases the accuracy of optical character recognition systems. The proposed document recovery algorithm efficiently removes degradations from document images. Here we are using the ostus method ,local thresholding and global thresholding and after the binarization training and recognizing the characters in the degraded documents.

Keywords: binarization, denoising, global thresholding, local thresholding, thresholding

Procedia PDF Downloads 308
889 One-Class Support Vector Machine for Sentiment Analysis of Movie Review Documents

Authors: Chothmal, Basant Agarwal

Abstract:

Sentiment analysis means to classify a given review document into positive or negative polar document. Sentiment analysis research has been increased tremendously in recent times due to its large number of applications in the industry and academia. Sentiment analysis models can be used to determine the opinion of the user towards any entity or product. E-commerce companies can use sentiment analysis model to improve their products on the basis of users’ opinion. In this paper, we propose a new One-class Support Vector Machine (One-class SVM) based sentiment analysis model for movie review documents. In the proposed approach, we initially extract features from one class of documents, and further test the given documents with the one-class SVM model if a given new test document lies in the model or it is an outlier. Experimental results show the effectiveness of the proposed sentiment analysis model.

Keywords: feature selection methods, machine learning, NB, one-class SVM, sentiment analysis, support vector machine

Procedia PDF Downloads 476
888 Model-Based Field Extraction from Different Class of Administrative Documents

Authors: Jinen Daghrir, Anis Kricha, Karim Kalti

Abstract:

The amount of incoming administrative documents is massive and manually processing these documents is a costly task especially on the timescale. In fact, this problem has led an important amount of research and development in the context of automatically extracting fields from administrative documents, in order to reduce the charges and to increase the citizen satisfaction in administrations. In this matter, we introduce an administrative document understanding system. Given a document in which a user has to select fields that have to be retrieved from a document class, a document model is automatically built. A document model is represented by an attributed relational graph (ARG) where nodes represent fields to extract, and edges represent the relation between them. Both of vertices and edges are attached with some feature vectors. When another document arrives to the system, the layout objects are extracted and an ARG is generated. The fields extraction is translated into a problem of matching two ARGs which relies mainly on the comparison of the spatial relationships between layout objects. Experimental results yield accuracy rates from 75% to 100% tested on eight document classes. Our proposed method has a good performance knowing that the document model is constructed using only one single document.

Keywords: administrative document understanding, logical labelling, logical layout analysis, fields extraction from administrative documents

Procedia PDF Downloads 180
887 Providing a Secure, Reliable and Decentralized Document Management Solution Using Blockchain by a Virtual Identity Card

Authors: Meet Shah, Ankita Aditya, Dhruv Bindra, V. S. Omkar, Aashruti Seervi

Abstract:

In today's world, we need documents everywhere for a smooth workflow in the identification process or any other security aspects. The current system and techniques which are used for identification need one thing, that is ‘proof of existence’, which involves valid documents, for example, educational, financial, etc. The main issue with the current identity access management system and digital identification process is that the system is centralized in their network, which makes it inefficient. The paper presents the system which resolves all these cited issues. It is based on ‘blockchain’ technology, which is a 'decentralized system'. It allows transactions in a decentralized and immutable manner. The primary notion of the model is to ‘have everything with nothing’. It involves inter-linking required documents of a person with a single identity card so that a person can go anywhere without having the required documents with him/her. The person just needs to be physically present at a place wherein documents are necessary, and using a fingerprint impression and an iris scan print, the rest of the verification will progress. Furthermore, some technical overheads and advancements are listed. This paper also aims to layout its far-vision scenario of blockchain and its impact on future trends.

Keywords: blockchain, decentralized system, fingerprint impression, identity management, iris scan

Procedia PDF Downloads 96
886 Societal Acceptability Conditions of Genome Editing for Upland Rice in Madagascar

Authors: Anny Lucrece Nlend Nkott, Ludovic Temple

Abstract:

The appearance in 2012 of the CRISPR-CaS9 genome editing technique marks a turning point in the field of genetics. This technique would make it possible to create new varieties quickly and cheaply. Although some consider CRISPR-CaS9 to be revolutionary, others consider it a potential societal threat. To document the controversy, we explain the socioeconomic conditions under which this technique could be accepted for the creation of a rainfed rice variety in Madagascar. The methodological framework is based on 38 individual and semistructured interviews, a multistakeholder forum with 27 participants, and a survey of 148 rice producers. Results reveal that the acceptability of genome editing requires (i) strengthening the seed system through the operationalization of regulatory structures and the upgrading of stakeholders' knowledge of genetically modified organisms, (ii) assessing the effects of the edited variety on biodiversity and soil nitrogen dynamics, and (iii) strengthening the technical and human capacities of the biosafety body. Structural mechanisms for regulating the seed system are necessary to ensure safe experimentation of genome editing techniques. Organizational innovation also appears to be necessary. The study documents how collective learning between communities of scientists and nonscientists is a component of systemic processes of varietal innovation. This study was carried out with the financial support of the GENERICE project (Generation and Deployment of Genome-Edited, Nitrogen-use-Efficient Rice Varieties), funded by the Agropolis Foundation.

Keywords: CRISPR-CaS9, varietal innovation, seed system, innovation system

Procedia PDF Downloads 113
885 DocPro: A Framework for Processing Semantic and Layout Information in Business Documents

Authors: Ming-Jen Huang, Chun-Fang Huang, Chiching Wei

Abstract:

With the recent advance of the deep neural network, we observe new applications of NLP (natural language processing) and CV (computer vision) powered by deep neural networks for processing business documents. However, creating a real-world document processing system needs to integrate several NLP and CV tasks, rather than treating them separately. There is a need to have a unified approach for processing documents containing textual and graphical elements with rich formats, diverse layout arrangement, and distinct semantics. In this paper, a framework that fulfills this unified approach is presented. The framework includes a representation model definition for holding the information generated by various tasks and specifications defining the coordination between these tasks. The framework is a blueprint for building a system that can process documents with rich formats, styles, and multiple types of elements. The flexible and lightweight design of the framework can help build a system for diverse business scenarios, such as contract monitoring and reviewing.

Keywords: document processing, framework, formal definition, machine learning

Procedia PDF Downloads 183
884 Contribution of a Higher Education Institute towards Built Environment Sustainability

Authors: Tayyab Ahmad, Gerard Healey

Abstract:

The potential role of higher education institutes in sustainable development cannot be undermined. In this regard, it is important to investigate the established concept of sustainability in such institutes to explore the room for further improvement. In this paper, a case study of the University of Melbourne is conducted, and the institute’s commitments towards sustainability are examined by a detailed qualitative review of its policy and design standard documents. These documents are reviewed as through these; the institute portrays its vision of building environment facilities, which it aspires to procure and use. From detailed review, it is realized that these documents are updated at different times, creating the potential for mismatch between them. The occurrence of different goals and objectives in different documents is highlighted, and the interrelationships between different goals and operational objectives are explored. The role of the university aspired goals/objectives in terms of built environment sustainability is discussed, and the gaps in the articulation of goals and operational objectives are highlighted. Recommendations are provided for enhancing the built environment sustainability at the University of Melbourne.

Keywords: university, design standards, policy, sustainability, built environment

Procedia PDF Downloads 141
883 Arabic Text Classification: Review Study

Authors: M. Hijazi, A. Zeki, A. Ismail

Abstract:

An enormous amount of valuable human knowledge is preserved in documents. The rapid growth in the number of machine-readable documents for public or private access requires the use of automatic text classification. Text classification can be defined as assigning or structuring documents into a defined set of classes known in advance. Arabic text classification methods have emerged as a natural result of the existence of a massive amount of varied textual information written in the Arabic language on the web. This paper presents a review on the published researches of Arabic Text Classification using classical data representation, Bag of words (BoW), and using conceptual data representation based on semantic resources such as Arabic WordNet and Wikipedia.

Keywords: Arabic text classification, Arabic WordNet, bag of words, conceptual representation, semantic relations

Procedia PDF Downloads 395
882 Application of Ontologies to Contract for Difference Documents

Authors: Renato Figueira Franco

Abstract:

This paper aims to create a representational information system applied to the securities market, particularly the development of an ontology applied to the analysis of the Key Information Documents of Contracts for Difference. The process of obtaining knowledge and its proper formal representation has raised the attention both from the scientific literature and the capital markets supervisory authorities. The formal knowledge representation is embodied in the construction of ontologies, which are responsible for defining a knowledge base structure of a given scientific domain, facilitating its understanding, and allowing its sharing among the scientific community. The scope of this study is restricted to the analysis of capital markets ontologies in order to capture its structure, semantics and knowledge sharing between people and systems.

Keywords: ontology, financial markets, CFD, PRIIPs, key information documents

Procedia PDF Downloads 32
881 A Proposed Approach for Emotion Lexicon Enrichment

Authors: Amr Mansour Mohsen, Hesham Ahmed Hassan, Amira M. Idrees

Abstract:

Document Analysis is an important research field that aims to gather the information by analyzing the data in documents. As one of the important targets for many fields is to understand what people actually want, sentimental analysis field has been one of the vital fields that are tightly related to the document analysis. This research focuses on analyzing text documents to classify each document according to its opinion. The aim of this research is to detect the emotions from text documents based on enriching the lexicon with adapting their content based on semantic patterns extraction. The proposed approach has been presented, and different experiments are applied by different perspectives to reveal the positive impact of the proposed approach on the classification results.

Keywords: document analysis, sentimental analysis, emotion detection, WEKA tool, NRC lexicon

Procedia PDF Downloads 391
880 Framework for Detecting External Plagiarism from Monolingual Documents: Use of Shallow NLP and N-Gram Frequency Comparison

Authors: Saugata Bose, Ritambhra Korpal

Abstract:

The internet has increased the copy-paste scenarios amongst students as well as amongst researchers leading to different levels of plagiarized documents. For this reason, much of research is focused on for detecting plagiarism automatically. In this paper, an initiative is discussed where Natural Language Processing (NLP) techniques as well as supervised machine learning algorithms have been combined to detect plagiarized texts. Here, the major emphasis is on to construct a framework which detects external plagiarism from monolingual texts successfully. For successfully detecting the plagiarism, n-gram frequency comparison approach has been implemented to construct the model framework. The framework is based on 120 characteristics which have been extracted during pre-processing the documents using NLP approach. Afterwards, filter metrics has been applied to select most relevant characteristics and then supervised classification learning algorithm has been used to classify the documents in four levels of plagiarism. Confusion matrix was built to estimate the false positives and false negatives. Our plagiarism framework achieved a very high the accuracy score.

Keywords: lexical matching, shallow NLP, supervised machine learning algorithm, word n-gram

Procedia PDF Downloads 325
879 Wasting Human and Computer Resources

Authors: Mária Csernoch, Piroska Biró

Abstract:

The legends about “user-friendly” and “easy-to-use” birotical tools (computer-related office tools) have been spreading and misleading end-users. This approach has led us to the extremely high number of incorrect documents, causing serious financial losses in the creating, modifying, and retrieving processes. Our research proved that there are at least two sources of this underachievement: (1) The lack of the definition of the correctly edited, formatted documents. Consequently, end-users do not know whether their methods and results are correct or not. They are not aware of their ignorance. They are so ignorant that their ignorance does not allow them to realize their lack of knowledge. (2) The end-users’ problem-solving methods. We have found that in non-traditional programming environments end-users apply, almost exclusively, surface approach metacognitive methods to carry out their computer related activities, which are proved less effective than deep approach methods. Based on these findings we have developed deep approach methods which are based on and adapted from traditional programming languages. In this study, we focus on the most popular type of birotical documents, the text-based documents. We have provided the definition of the correctly edited text, and based on this definition, adapted the debugging method known in programming. According to the method, before the realization of text editing, a thorough debugging of already existing texts and the categorization of errors are carried out. With this method in advance to real text editing users learn the requirements of text-based documents and also of the correctly formatted text. The method has been proved much more effective than the previously applied surface approach methods. The advantages of the method are that the real text handling requires much less human and computer sources than clicking aimlessly in the GUI (Graphical User Interface), and the data retrieval is much more effective than from error-prone documents.

Keywords: deep approach metacognitive methods, error-prone birotical documents, financial losses, human and computer resources

Procedia PDF Downloads 356
878 Lecture Video Indexing and Retrieval Using Topic Keywords

Authors: B. J. Sandesh, Saurabha Jirgi, S. Vidya, Prakash Eljer, Gowri Srinivasa

Abstract:

In this paper, we propose a framework to help users to search and retrieve the portions in the lecture video of their interest. This is achieved by temporally segmenting and indexing the lecture video using the topic keywords. We use transcribed text from the video and documents relevant to the video topic extracted from the web for this purpose. The keywords for indexing are found by applying the non-negative matrix factorization (NMF) topic modeling techniques on the web documents. Our proposed technique first creates indices on the transcribed documents using the topic keywords, and these are mapped to the video to find the start and end time of the portions of the video for a particular topic. This time information is stored in the index table along with the topic keyword which is used to retrieve the specific portions of the video for the query provided by the users.

Keywords: video indexing and retrieval, lecture videos, content based video search, multimodal indexing

Procedia PDF Downloads 204
877 Nonmedical Determinants of Congenital Heart Diseases in Children from the Perspective of Mothers: A Qualitative Study in Iran

Authors: Maryam Borjali

Abstract:

Introduction. Mortality due to noncommunicable diseases has increased in the world today with the advent of demographic shifts, growing age, and lifestyle patterns in the world, which have been affected by economic and social crises. Congenital heart defects are one of the forms of diseases that have raised infant mortality worldwide. e objective of present study was to identify nonmedical determinants related to this abnormality from the mother’s perspectives. Methods. is research was a qualitative study and the data collection method was a semistructured interview with mothers who had children with congenital heart diseases referring to the Shahid Rajaei Heart Hospital in Tehran, Iran. A thematic analysis approach was employed to analyze transcribed documents assisted by MAXQDA Plus version 12. Results. Four general themes and ten subthemes including social contexts (social harms, social interactions, and social necessities), psychological contexts (mood disorders and mental well-being), cultural contexts (unhealthy lifestyle, family culture, and poor parental health behaviors), and environmental contexts (living area and polluted air) were extracted from interviews with mothers of children with congenital heart diseases. Conclusions. Results suggest that factors such as childhood poverty, lack of parental awareness of congenital diseases, lack of proper nutrition and health facilities, education, and lack of medical supervision during pregnancy were most related with the birth of children with congenital heart disease from mothers’ prospective. In this regard, targeted and intersectorial collaborations are proposed to address nonmedical determinants related to the incidence of congenital heart diseases.

Keywords: congenital_cou, cultural, social, platform

Procedia PDF Downloads 65
876 Precarious ID Cards - Studying Documentary Practices in India through the Lens of Internal Migration

Authors: Ambuja Raj

Abstract:

This research will attempt to understand how documents are materially indispensable civic artifacts for migrants in their encounters with the state. Documents such as ID cards are sites of mediation and bureaucratic manifestation which reveal the inherent dynamics of power between the state and a delocalized people. While ID cards allow the holder to retain a different identity and articulate their demands as a citizen, they at the same time transform subjects into ‘objects’ in the exercise of governmental power. The research is based on the study of internal migrants in India, who are ‘visible’ to the state through its host of ID documents such as the ‘Aadhaar card’, electoral IDs, Ration cards, and a variety of region-specific documents, without the possession of which, not only are they unable to access jobs, public goods and services, and accommodation, but are liable to exploitation from state forces and mediators. Through semi-structured interviews with social actors in the processes of documentation and welfare of migrants, as well as with settlements of migrants themselves located in the state of Kerala in India, the thesis will attempt to understand the salience of documentary practices in the lives of inter-state migrants who move within Indian states in the hope of bettering their economic conditions. The research will trace the material and evolving significance of ID cards in the tenacity of states dealing with these ‘illegible’ populations. It will try to bring theories of governmentality, biopolitics and Weberian bureaucracy into the migrant issue while critically grounding itself on secondary literature by scholars who have worked on South Asian ‘governments of paper’.

Keywords: migration, historiography of documents, anthropology of state, documentary practices

Procedia PDF Downloads 161
875 Documents Emotions Classification Model Based on TF-IDF Weighting Measure

Authors: Amr Mansour Mohsen, Hesham Ahmed Hassan, Amira M. Idrees

Abstract:

Emotions classification of text documents is applied to reveal if the document expresses a determined emotion from its writer. As different supervised methods are previously used for emotion documents’ classification, in this research we present a novel model that supports the classification algorithms for more accurate results by the support of TF-IDF measure. Different experiments have been applied to reveal the applicability of the proposed model, the model succeeds in raising the accuracy percentage according to the determined metrics (precision, recall, and f-measure) based on applying the refinement of the lexicon, integration of lexicons using different perspectives, and applying the TF-IDF weighting measure over the classifying features. The proposed model has also been compared with other research to prove its competence in raising the results’ accuracy.

Keywords: emotion detection, TF-IDF, WEKA tool, classification algorithms

Procedia PDF Downloads 441
874 Text Localization in Fixed-Layout Documents Using Convolutional Networks in a Coarse-to-Fine Manner

Authors: Beier Zhu, Rui Zhang, Qi Song

Abstract:

Text contained within fixed-layout documents can be of great semantic value and so requires a high localization accuracy, such as ID cards, invoices, cheques, and passports. Recently, algorithms based on deep convolutional networks achieve high performance on text detection tasks. However, for text localization in fixed-layout documents, such algorithms detect word bounding boxes individually, which ignores the layout information. This paper presents a novel architecture built on convolutional neural networks (CNNs). A global text localization network and a regional bounding-box regression network are introduced to tackle the problem in a coarse-to-fine manner. The text localization network simultaneously locates word bounding points, which takes the layout information into account. The bounding-box regression network inputs the features pooled from arbitrarily sized RoIs and refine the localizations. These two networks share their convolutional features and are trained jointly. A typical type of fixed-layout documents: ID cards, is selected to evaluate the effectiveness of the proposed system. These networks are trained on data cropped from nature scene images, and synthetic data produced by a synthetic text generation engine. Experiments show that our approach locates high accuracy word bounding boxes and achieves state-of-the-art performance.

Keywords: bounding box regression, convolutional networks, fixed-layout documents, text localization

Procedia PDF Downloads 163
873 Investigation of Topic Modeling-Based Semi-Supervised Interpretable Document Classifier

Authors: Dasom Kim, William Xiu Shun Wong, Yoonjin Hyun, Donghoon Lee, Minji Paek, Sungho Byun, Namgyu Kim

Abstract:

There have been many researches on document classification for classifying voluminous documents automatically. Through document classification, we can assign a specific category to each unlabeled document on the basis of various machine learning algorithms. However, providing labeled documents manually requires considerable time and effort. To overcome the limitations, the semi-supervised learning which uses unlabeled document as well as labeled documents has been invented. However, traditional document classifiers, regardless of supervised or semi-supervised ones, cannot sufficiently explain the reason or the process of the classification. Thus, in this paper, we proposed a methodology to visualize major topics and class components of each document. We believe that our methodology for visualizing topics and classes of each document can enhance the reliability and explanatory power of document classifiers.

Keywords: data mining, document classifier, text mining, topic modeling

Procedia PDF Downloads 359
872 The Passive Recipient – How the Pupil Comes across in Local Swedish Health Policy Documents

Authors: Zofia Hammerin, Goran Basic, Disa Bergnehr

Abstract:

Ever since the Ottawa charter in 1986, health promotion through schools has been stressed across the globe. Both in the global and national discourse, schools are made responsible not only for providing education but also for working with pupil health and well-being. In Sweden, where the study is set, it is emphasized in national directives that promoting pupil health should be part of the school practice. Since the Swedish school system is decentralized, these directives need to be interpreted and recontextualized locally. This study aims to explore how the student comes across in Swedish local health policy documents. The data consists of 37 such documents called student health plans collected from different high schools throughout Sweden. The analysis was inspired by critical discourse analysis, and tentative results are divided into two main themes; the invisible actor and the passive recipient. The pupil is largely invisible in the documents, and the discourse instead focuses on school health service staff and, to some extent, the teachers. When the pupils are visible, they mainly come across as passive recipients of health promoting actions. Since participation, taking action, and feeling empowered are key aspects of health promotion, the findings could impact the pupils’ possibilities for health and well-being.

Keywords: health promotion, high school, student, sweden

Procedia PDF Downloads 73
871 Authentication of Physical Objects with Dot-Based 2D Code

Authors: Michał Glet, Kamil Kaczyński

Abstract:

Counterfeit goods and documents are a global problem, which needs more and more sophisticated methods of resolving it. Existing techniques using watermarking or embedding symbols on objects are not suitable for all use cases. To address those special needs, we created complete system allowing authentication of paper documents and physical objects with flat surface. Objects are marked using orientation independent and resistant to camera noise 2D graphic codes, named DotAuth. Based on the identifier stored in 2D code, the system is able to perform basic authentication and allows to conduct more sophisticated analysis methods, e.g., relying on augmented reality and physical properties of the object. In this paper, we present the complete architecture, algorithms and applications of the proposed system. Results of the features comparison of the proposed solution and other products are presented as well, pointing to the existence of many advantages that increase usability and efficiency in the means of protecting physical objects.

Keywords: anti-forgery, authentication, paper documents, security

Procedia PDF Downloads 97