Search results for: archival document
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 866

Search results for: archival document

866 Procedure for Recommendation of Archival Documents

Authors: Marlon J. Remedios, Maria T. Morell, Jesse D. Cano

Abstract:

Diffusion and accessibility of historical collections is one of the main objectives of the institutions that aim to safeguard archival documents (General Archives). Several countries have Web applications that try to make accessible and public the large number of documents that they guard. Each of these sites has a set of features in order to facilitate access, navigability, and search for information. Different sources of information include Recommender Systems as a way of customizing content. This paper aims at describing a process for the production of archival documents relevant to the user. To comply with this, the characteristics ruling archival description, elements and main techniques that establishes the design of Recommender Systems, a set of rules to follow, and how these rules operate and the way in which take advantage of the domain knowledge are discussed. Finally, relevant issues are discussed in the design of the proposed tests and the results obtained are shown.

Keywords: archival document, recommender system, procedure, information management

Procedia PDF Downloads 483
865 Investigation of Topic Modeling-Based Semi-Supervised Interpretable Document Classifier

Authors: Dasom Kim, William Xiu Shun Wong, Yoonjin Hyun, Donghoon Lee, Minji Paek, Sungho Byun, Namgyu Kim

Abstract:

There have been many researches on document classification for classifying voluminous documents automatically. Through document classification, we can assign a specific category to each unlabeled document on the basis of various machine learning algorithms. However, providing labeled documents manually requires considerable time and effort. To overcome the limitations, the semi-supervised learning which uses unlabeled document as well as labeled documents has been invented. However, traditional document classifiers, regardless of supervised or semi-supervised ones, cannot sufficiently explain the reason or the process of the classification. Thus, in this paper, we proposed a methodology to visualize major topics and class components of each document. We believe that our methodology for visualizing topics and classes of each document can enhance the reliability and explanatory power of document classifiers.

Keywords: data mining, document classifier, text mining, topic modeling

Procedia PDF Downloads 359
864 Incremental Learning of Independent Topic Analysis

Authors: Takahiro Nishigaki, Katsumi Nitta, Takashi Onoda

Abstract:

In this paper, we present a method of applying Independent Topic Analysis (ITA) to increasing the number of document data. The number of document data has been increasing since the spread of the Internet. ITA was presented as one method to analyze the document data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis (ICA). ICA is a technique in the signal processing; however, it is difficult to apply the ITA to increasing number of document data. Because ITA must use the all document data so temporal and spatial cost is very high. Therefore, we present Incremental ITA which extracts the independent topics from increasing number of document data. Incremental ITA is a method of updating the independent topics when the document data is added after extracted the independent topics from a just previous the data. In addition, Incremental ITA updates the independent topics when the document data is added. And we show the result applied Incremental ITA to benchmark datasets.

Keywords: text mining, topic extraction, independent, incremental, independent component analysis

Procedia PDF Downloads 275
863 Methodological Proposal, Archival Thesaurus in Colombian Sign Language

Authors: Pedro A. Medina-Rios, Marly Yolie Quintana-Daza

Abstract:

Having the opportunity to communicate in a social, academic and work context is very relevant for any individual and more for a deaf person when oral language is not their natural language, and written language is their second language. Currently, in Colombia, there is not a specialized dictionary for our best knowledge in sign language archiving. Archival is one of the areas that the deaf community has a greater chance of performing. Nourishing new signs in dictionaries for deaf people extends the possibility that they have the appropriate signs to communicate and improve their performance. The aim of this work was to illustrate the importance of designing pedagogical and technological strategies of knowledge management, for the academic inclusion of deaf people through proposals of lexicon in Colombian sign language (LSC) in the area of archival. As a method, the analytical study was used to identify relevant words in the technical area of the archival and its counterpart with the LSC, 30 deaf people, apprentices - students of the Servicio Nacional de Aprendizaje (SENA) in Documentary or Archival Management programs, were evaluated through direct interviews in LSC. For the analysis tools were maintained to evaluate correlation patterns and linguistic methods of visual, gestural analysis and corpus; besides, methods of linear regression were used. Among the results, significant data were found among the variables socioeconomic stratum, academic level, labor location. The need to generate new signals on the subject of the file to improve communication between the deaf person, listener and the sign language interpreter. It is concluded that the generation of new signs to nourish the LSC dictionary in archival subjects is necessary to improve the labor inclusion of deaf people in Colombia.

Keywords: archival, inclusion, deaf, thesaurus

Procedia PDF Downloads 243
862 Distorted Document Images Dataset for Text Detection and Recognition

Authors: Ilia Zharikov, Philipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan

Abstract:

With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models.

Keywords: document analysis, open dataset, optical character recognition, text detection

Procedia PDF Downloads 132
861 Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition

Authors: L. Hamsaveni, Navya Prakash, Suresha

Abstract:

Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document images to obtain an original document with complete information. In case, degraded document image captured is skewed, it has to be straightened (deskew) to perform further process. A special format of image storing known as YCbCr is used as a tool to convert the Grayscale image to RGB image format. The presented algorithm is tested on various types of degraded documents such as printed documents, handwritten documents, old script documents and handwritten image sketches in documents. The purpose of this research is to obtain an original document for a given set of degraded documents of the same source.

Keywords: grayscale image format, image fusing, RGB image format, SURF detection, YCbCr image format

Procedia PDF Downloads 341
860 A Case Study of the Digital Translation of the Lucy Lloyd and Wilhelm Bleek |Xam and !Kun Notebooks into The Digital Bleek and Lloyd

Authors: F. Saptouw

Abstract:

This paper will examine the digitization process of the |Xam and !Kun notebooks, authored by Lucy Lloyd, Dorothea Bleek and Wilhelm Bleek, and their collaborators |a!kunta, ||kabbo, ≠kasin, Dia!kwain, !kweiten ta ||ken, |han≠kass'o, !nanni, Tamme, |uma, and Da during the 19th century. Detail will be provided about the status of the archive, the creation of the digital archive and selected research projects linked to the archive. The Digital Bleek and Lloyd project is an example of institutional collaboration by the University of Cape Town, University of South Africa, Iziko South African Museum, the National Library of South Africa and the Western Cape Provincial Archives and Records Service. The contemporary value of the archive will be discussed in relation to its current manifestation as a collection of archival and digital objects, each with its own set of properties and archival risk factors. This tension between the two ways to access the archive will be interrogated to shed light on the slippages between the digital object and the archival object. The primary argument is that the process of digitization generates an ontological shift in the status of the archival object. The secondary argument is an engagement with practices to curate the encounters with these ontologically shifted objects and how to relate to each as a contemporary viewer. In conclusion this paper will argue for regarding these archival objects according to the interpretive framework utilized to engage secular relics.

Keywords: archive, curatorship, digitization, museum practice

Procedia PDF Downloads 117
859 Model-Based Field Extraction from Different Class of Administrative Documents

Authors: Jinen Daghrir, Anis Kricha, Karim Kalti

Abstract:

The amount of incoming administrative documents is massive and manually processing these documents is a costly task especially on the timescale. In fact, this problem has led an important amount of research and development in the context of automatically extracting fields from administrative documents, in order to reduce the charges and to increase the citizen satisfaction in administrations. In this matter, we introduce an administrative document understanding system. Given a document in which a user has to select fields that have to be retrieved from a document class, a document model is automatically built. A document model is represented by an attributed relational graph (ARG) where nodes represent fields to extract, and edges represent the relation between them. Both of vertices and edges are attached with some feature vectors. When another document arrives to the system, the layout objects are extracted and an ARG is generated. The fields extraction is translated into a problem of matching two ARGs which relies mainly on the comparison of the spatial relationships between layout objects. Experimental results yield accuracy rates from 75% to 100% tested on eight document classes. Our proposed method has a good performance knowing that the document model is constructed using only one single document.

Keywords: administrative document understanding, logical labelling, logical layout analysis, fields extraction from administrative documents

Procedia PDF Downloads 180
858 Augmenting Cultural Heritage Through 4.0 Technologies: A Research on the Archival Jewelry of the Gianfranco Ferré Research Center

Authors: Greta Rizzi, Ashley Gallitto, Federica Vacca

Abstract:

Looking at design artifacts as bearers and disseminators of material knowledge and intangible socio-cultural meanings, the significance of archival jewelry was investigated following digital cultural heritage research streams. The application of the reverse engineering concept guided the research path: starting with the study of Gianfranco Ferré's archival jewelry and analyzing its technical heritage and symbolic value, the digitalization, dematerialization, and rematerialization of the artifact were carried out. According to that, the proposed paper results from research conducted within the residency program between the Gianfranco Ferré Research Center (GFRC) and Massachusetts Institute of Technology (MIT), involving both the Design and Mechanical Engineering Departments of Politecnico di Milano. The paper will discuss the analysis of traditional design manufacturing techniques, re-imagined through 3D scanning, 3D modeling, and 3D printing technical knowledge while emphasizing the significance of the designer's role as an explorer of socio-cultural meanings and technological mediators in the analog-digital-analog transition.

Keywords: Archival jewelry, cultural heritage, rematerialization, reverse engineering.

Procedia PDF Downloads 14
857 Binarization and Recognition of Characters from Historical Degraded Documents

Authors: Bency Jacob, S.B. Waykar

Abstract:

Degradations in historical document images appear due to aging of the documents. It is very difficult to understand and retrieve text from badly degraded documents as there is variation between the document foreground and background. Thresholding of such document images either result in broken characters or detection of false texts. Numerous algorithms exist that can separate text and background efficiently in the textual regions of the document; but portions of background are mistaken as text in areas that hardly contain any text. This paper presents a way to overcome these problems by a robust binarization technique that recovers the text from a severely degraded document images and thereby increases the accuracy of optical character recognition systems. The proposed document recovery algorithm efficiently removes degradations from document images. Here we are using the ostus method ,local thresholding and global thresholding and after the binarization training and recognizing the characters in the degraded documents.

Keywords: binarization, denoising, global thresholding, local thresholding, thresholding

Procedia PDF Downloads 308
856 Using Closed Frequent Itemsets for Hierarchical Document Clustering

Authors: Cheng-Jhe Lee, Chiun-Chieh Hsu

Abstract:

Due to the rapid development of the Internet and the increased availability of digital documents, the excessive information on the Internet has led to information overflow problem. In order to solve these problems for effective information retrieval, document clustering in text mining becomes a popular research topic. Clustering is the unsupervised classification of data items into groups without the need of training data. Many conventional document clustering methods perform inefficiently for large document collections because they were originally designed for relational database. Therefore they are impractical in real-world document clustering and require special handling for high dimensionality and high volume. We propose the FIHC (Frequent Itemset-based Hierarchical Clustering) method, which is a hierarchical clustering method developed for document clustering, where the intuition of FIHC is that there exist some common words for each cluster. FIHC uses such words to cluster documents and builds hierarchical topic tree. In this paper, we combine FIHC algorithm with ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results show that our method is more accurate than those of well-known document clustering algorithms.

Keywords: FIHC, documents clustering, ontology, closed frequent itemset

Procedia PDF Downloads 366
855 A Proposed Approach for Emotion Lexicon Enrichment

Authors: Amr Mansour Mohsen, Hesham Ahmed Hassan, Amira M. Idrees

Abstract:

Document Analysis is an important research field that aims to gather the information by analyzing the data in documents. As one of the important targets for many fields is to understand what people actually want, sentimental analysis field has been one of the vital fields that are tightly related to the document analysis. This research focuses on analyzing text documents to classify each document according to its opinion. The aim of this research is to detect the emotions from text documents based on enriching the lexicon with adapting their content based on semantic patterns extraction. The proposed approach has been presented, and different experiments are applied by different perspectives to reveal the positive impact of the proposed approach on the classification results.

Keywords: document analysis, sentimental analysis, emotion detection, WEKA tool, NRC lexicon

Procedia PDF Downloads 391
854 Internal Audit Function Contributions to the External Audit

Authors: Douglas F. Prawitt, Nathan Y. Sharp, David A. Wood

Abstract:

Consistent with prior experimental and survey studies, we find that IAFs that spend more time directly assisting the external auditor is associated with lower external audit fees. Interestingly, we do not find evidence that external auditors reduce fees based on work previously performed by the IAF. We also find that the time spent assisting the external auditor has a greater negative effect on external audit fees than the time spent performing tasks upon which the auditor may rely but that are not performed as direct assistance to the external audit. Our results also show that previous proxies used to measure this relation is either not associated with or are negatively associated with our direct measures of how the IAF can contribute to the external audit and are highly positively associated with the size and the complexity of the organization. Thus, we conclude the disparate experimental and archival results may be attributable to issues surrounding the construct validity of measures used in previous archival studies and that when measures similar to those used in experimental studies are employed in archival tests, the archival results are consistent with experimental findings. Our research makes four primary contributions to the literature. First, we provide evidence that internal auditing contributes to a reduction in external audit fees. Second, we replicate and provide an explanation for why previous archival studies find that internal auditing has either no association with external audit fees or is associated with an increase in those fees: prior studies generally use proxies of internal audit contribution that do not adequately capture the intended construct. Third, our research expands on survey-based research (e.g., Oil Libya sh.co.) by separately examining the impact on the audit fee of the internal auditors’ work, indirectly assisting external auditors and internal auditors’ prior work upon which external auditors can rely. Finally, we extend prior research by using a new, independent data source to validate and extend prior studies. This data set also allows for a sample of examining the impact of internal auditing on the external audit fee and the use of a more comprehensive external audit fee model that better controls for determinants of the external audit fee.

Keywords: internal audit, contribution, external audit, function

Procedia PDF Downloads 88
853 Improving the Performance of Requisition Document Online System for Royal Thai Army by Using Time Series Model

Authors: D. Prangchumpol

Abstract:

This research presents a forecasting method of requisition document demands for Military units by using Exponential Smoothing methods to analyze data. The data used in the forecast is an actual data requisition document of The Adjutant General Department. The results of the forecasting model to forecast the requisition of the document found that Holt–Winters’ trend and seasonality method of α=0.1, β=0, γ=0 is appropriate and matches for requisition of documents. In addition, the researcher has developed a requisition online system to improve the performance of requisition documents of The Adjutant General Department, and also ensuring that the operation can be checked.

Keywords: requisition, holt–winters, time series, royal thai army

Procedia PDF Downloads 278
852 Off-Topic Text Detection System Using a Hybrid Model

Authors: Usama Shahid

Abstract:

Be it written documents, news columns, or students' essays, verifying the content can be a time-consuming task. Apart from the spelling and grammar mistakes, the proofreader is also supposed to verify whether the content included in the essay or document is relevant or not. The irrelevant content in any document or essay is referred to as off-topic text and in this paper, we will address the problem of off-topic text detection from a document using machine learning techniques. Our study aims to identify the off-topic content from a document using Echo state network model and we will also compare data with other models. The previous study uses Convolutional Neural Networks and TFIDF to detect off-topic text. We will rearrange the existing datasets and take new classifiers along with new word embeddings and implement them on existing and new datasets in order to compare the results with the previously existing CNN model.

Keywords: off topic, text detection, eco state network, machine learning

Procedia PDF Downloads 49
851 A Methodology for Automatic Diversification of Document Categories

Authors: Dasom Kim, Chen Liu, Myungsu Lim, Su-Hyeon Jeon, ByeoungKug Jeon, Kee-Young Kwahk, Namgyu Kim

Abstract:

Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we previously proposed a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. In this paper, we design a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology.

Keywords: big data analysis, document classification, multi-category, text mining, topic analysis

Procedia PDF Downloads 238
850 Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison

Authors: Po-Fang Hsu, Chiching Wei

Abstract:

In this paper, we present a novel neural graph matching approach applied to document comparison. Document comparison is a common task in the legal and financial industries. In some cases, the most important differences may be the addition or omission of words, sentences, clauses, or paragraphs. However, it is a challenging task without recording or tracing the whole edited process. Under many temporal uncertainties, we explore the potentiality of our approach to proximate the accurate comparison to make sure which element blocks have a relation of edition with others. In the beginning, we apply a document layout analysis that combines traditional and modern technics to segment layouts in blocks of various types appropriately. Then we transform this issue into a problem of layout graph matching with textual awareness. Regarding graph matching, it is a long-studied problem with a broad range of applications. However, different from previous works focusing on visual images or structural layout, we also bring textual features into our model for adapting this domain. Specifically, based on the electronic document, we introduce an encoder to deal with the visual presentation decoding from PDF. Additionally, because the modifications can cause the inconsistency of document layout analysis between modified documents and the blocks can be merged and split, Sinkhorn divergence is adopted in our neural graph approach, which tries to overcome both these issues with many-to-many block matching. We demonstrate this on two categories of layouts, as follows., legal agreement and scientific articles, collected from our real-case datasets.

Keywords: document comparison, graph matching, graph neural network, modification similarity, multi-modal

Procedia PDF Downloads 146
849 Use of Interpretable Evolved Search Query Classifiers for Sinhala Documents

Authors: Prasanna Haddela

Abstract:

Document analysis is a well matured yet still active research field, partly as a result of the intricate nature of building computational tools but also due to the inherent problems arising from the variety and complexity of human languages. Breaking down language barriers is vital in enabling access to a number of recent technologies. This paper investigates the application of document classification methods to new Sinhalese datasets. This language is geographically isolated and rich with many of its own unique features. We will examine the interpretability of the classification models with a particular focus on the use of evolved Lucene search queries generated using a Genetic Algorithm (GA) as a method of document classification. We will compare the accuracy and interpretability of these search queries with other popular classifiers. The results are promising and are roughly in line with previous work on English language datasets.

Keywords: evolved search queries, Sinhala document classification, Lucene Sinhala analyzer, interpretable text classification, genetic algorithm

Procedia PDF Downloads 87
848 Direct Blind Separation Methods for Convolutive Images Mixtures

Authors: Ahmed Hammed, Wady Naanaa

Abstract:

In this paper, we propose a general approach to deal with the problem of a convolutive mixture of images. We use a direct blind source separation method by adding only one non-statistical justified constraint describing the relationships between different mixing matrix at the aim to make its resolution easy. This method can be applied, provided that this constraint is known, to degraded document affected by the overlapping of text-patterns and images. This is due to chemical and physical reactions of the materials (paper, inks,...) occurring during the documents aging, and other unpredictable causes such as humidity, microorganism infestation, human handling, etc. We will demonstrate that this problem corresponds to a convolutive mixture of images. Subsequently, we will show how the validation of our method through numerical examples. We can so obtain clear images from unreadable ones which can be caused by pages superposition, a phenomenon similar to that we find every often in archival documents.

Keywords: blind source separation, convoluted mixture, degraded documents, text-patterns overlapping

Procedia PDF Downloads 294
847 DCDNet: Lightweight Document Corner Detection Network Based on Attention Mechanism

Authors: Kun Xu, Yuan Xu, Jia Qiao

Abstract:

The document detection plays an important role in optical character recognition and text analysis. Because the traditional detection methods have weak generalization ability, and deep neural network has complex structure and large number of parameters, which cannot be well applied in mobile devices, this paper proposes a lightweight Document Corner Detection Network (DCDNet). DCDNet is a two-stage architecture. The first stage with Encoder-Decoder structure adopts depthwise separable convolution to greatly reduce the network parameters. After introducing the Feature Attention Union (FAU) module, the second stage enhances the feature information of spatial and channel dim and adaptively adjusts the size of receptive field to enhance the feature expression ability of the model. Aiming at solving the problem of the large difference in the number of pixel distribution between corner and non-corner, Weighted Binary Cross Entropy Loss (WBCE Loss) is proposed to define corner detection problem as a classification problem to make the training process more efficient. In order to make up for the lack of Dataset of document corner detection, a Dataset containing 6620 images named Document Corner Detection Dataset (DCDD) is made. Experimental results show that the proposed method can obtain fast, stable and accurate detection results on DCDD.

Keywords: document detection, corner detection, attention mechanism, lightweight

Procedia PDF Downloads 322
846 An Experiential Learning of Ontology-Based Multi-document Summarization by Removal Summarization Techniques

Authors: Pranjali Avinash Yadav-Deshmukh

Abstract:

Remarkable development of the Internet along with the new technological innovation, such as high-speed systems and affordable large storage space have led to a tremendous increase in the amount and accessibility to digital records. For any person, studying of all these data is tremendously time intensive, so there is a great need to access effective multi-document summarization (MDS) systems, which can successfully reduce details found in several records into a short, understandable summary or conclusion. For semantic representation of textual details in ontology area, as a theoretical design, our system provides a significant structure. The stability of using the ontology in fixing multi-document summarization problems in the sector of catastrophe control is finding its recommended design. Saliency ranking is usually allocated to each phrase and phrases are rated according to the ranking, then the top rated phrases are chosen as the conclusion. With regards to the conclusion quality, wide tests on a selection of media announcements are appropriate for “Jammu Kashmir Overflow in 2014” records. Ontology centered multi-document summarization methods using “NLP centered extraction” outshine other baselines. Our participation in recommended component is to implement the details removal methods (NLP) to enhance the results.

Keywords: disaster management, extraction technique, k-means, multi-document summarization, NLP, ontology, sentence extraction

Procedia PDF Downloads 346
845 Data Gathering and Analysis for Arabic Historical Documents

Authors: Ali Dulla

Abstract:

This paper introduces a new dataset (and the methodology used to generate it) based on a wide range of historical Arabic documents containing clean data simple and homogeneous-page layouts. The experiments are implemented on printed and handwritten documents obtained respectively from some important libraries such as Qatar Digital Library, the British Library and the Library of Congress. We have gathered and commented on 150 archival document images from different locations and time periods. It is based on different documents from the 17th-19th century. The dataset comprises differing page layouts and degradations that challenge text line segmentation methods. Ground truth is produced using the Aletheia tool by PRImA and stored in an XML representation, in the PAGE (Page Analysis and Ground truth Elements) format. The dataset presented will be easily available to researchers world-wide for research into the obstacles facing various historical Arabic documents such as geometric correction of historical Arabic documents.

Keywords: dataset production, ground truth production, historical documents, arbitrary warping, geometric correction

Procedia PDF Downloads 141
844 Using Artificial Intelligence Technology to Build the User-Oriented Platform for Integrated Archival Service

Authors: Lai Wenfang

Abstract:

Tthis study will describe how to use artificial intelligence (AI) technology to build the user-oriented platform for integrated archival service. The platform will be launched in 2020 by the National Archives Administration (NAA) in Taiwan. With the progression of information communication technology (ICT) the NAA has built many systems to provide archival service. In order to cope with new challenges, such as new ICT, artificial intelligence or blockchain etc. the NAA will try to use the natural language processing (NLP) and machine learning (ML) skill to build a training model and propose suggestions based on the data sent to the platform. NAA expects the platform not only can automatically inform the sending agencies’ staffs which records catalogues are against the transfer or destroy rules, but also can use the model to find the details hidden in the catalogues and suggest NAA’s staff whether the records should be or not to be, to shorten the auditing time. The platform keeps all the users’ browse trails; so that the platform can predict what kinds of archives user could be interested and recommend the search terms by visualization, moreover, inform them the new coming archives. In addition, according to the Archives Act, the NAA’s staff must spend a lot of time to mark or remove the personal data, classified data, etc. before archives provided. To upgrade the archives access service process, the platform will use some text recognition pattern to black out automatically, the staff only need to adjust the error and upload the correct one, when the platform has learned the accuracy will be getting higher. In short, the purpose of the platform is to deduct the government digital transformation and implement the vision of a service-oriented smart government.

Keywords: artificial intelligence, natural language processing, machine learning, visualization

Procedia PDF Downloads 139
843 Visual Template Detection and Compositional Automatic Regular Expression Generation for Business Invoice Extraction

Authors: Anthony Proschka, Deepak Mishra, Merlyn Ramanan, Zurab Baratashvili

Abstract:

Small and medium-sized businesses receive over 160 billion invoices every year. Since these documents exhibit many subtle differences in layout and text, extracting structured fields such as sender name, amount, and VAT rate from them automatically is an open research question. In this paper, existing work in template-based document extraction is extended, and a system is devised that is able to reliably extract all required fields for up to 70% of all documents in the data set, more than any other previously reported method. The approaches are described for 1) detecting through visual features which template a given document belongs to, 2) automatically generating extraction rules for a given new template by composing regular expressions from multiple components, and 3) computing confidence scores that indicate the accuracy of the automatic extractions. The system can generate templates with as little as one training sample and only requires the ground truth field values instead of detailed annotations such as bounding boxes that are hard to obtain. The system is deployed and used inside a commercial accounting software.

Keywords: data mining, information retrieval, business, feature extraction, layout, business data processing, document handling, end-user trained information extraction, document archiving, scanned business documents, automated document processing, F1-measure, commercial accounting software

Procedia PDF Downloads 95
842 Adaptation of Hough Transform Algorithm for Text Document Skew Angle Detection

Authors: Kayode A. Olaniyi, Olabanji F. Omotoye, Adeola A. Ogunleye

Abstract:

The skew detection and correction form an important part of digital document analysis. This is because uncompensated skew can deteriorate document features and can complicate further document image processing steps. Efficient text document analysis and digitization can rarely be achieved when a document is skewed even at a small angle. Once the documents have been digitized through the scanning system and binarization also achieved, document skew correction is required before further image analysis. Research efforts have been put in this area with algorithms developed to eliminate document skew. Skew angle correction algorithms can be compared based on performance criteria. Most important performance criteria are accuracy of skew angle detection, range of skew angle for detection, speed of processing the image, computational complexity and consequently memory space used. The standard Hough Transform has successfully been implemented for text documentation skew angle estimation application. However, the standard Hough Transform algorithm level of accuracy depends largely on how much fine the step size for the angle used. This consequently consumes more time and memory space for increase accuracy and, especially where number of pixels is considerable large. Whenever the Hough transform is used, there is always a tradeoff between accuracy and speed. So a more efficient solution is needed that optimizes space as well as time. In this paper, an improved Hough transform (HT) technique that optimizes space as well as time to robustly detect document skew is presented. The modified algorithm of Hough Transform presents solution to the contradiction between the memory space, running time and accuracy. Our algorithm starts with the first step of angle estimation accurate up to zero decimal place using the standard Hough Transform algorithm achieving minimal running time and space but lacks relative accuracy. Then to increase accuracy, suppose estimated angle found using the basic Hough algorithm is x degree, we then run again basic algorithm from range between ±x degrees with accuracy of one decimal place. Same process is iterated till level of desired accuracy is achieved. The procedure of our skew estimation and correction algorithm of text images is implemented using MATLAB. The memory space estimation and process time are also tabulated with skew angle assumption of within 00 and 450. The simulation results which is demonstrated in Matlab show the high performance of our algorithms with less computational time and memory space used in detecting document skew for a variety of documents with different levels of complexity.

Keywords: hough-transform, skew-detection, skew-angle, skew-correction, text-document

Procedia PDF Downloads 123
841 Web Search Engine Based Naming Procedure for Independent Topic

Authors: Takahiro Nishigaki, Takashi Onoda

Abstract:

In recent years, the number of document data has been increasing since the spread of the Internet. Many methods have been studied for extracting topics from large document data. We proposed Independent Topic Analysis (ITA) to extract topics independent of each other from large document data such as newspaper data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis. The topic represented by ITA is represented by a set of words. However, the set of words is quite different from the topics the user imagines. For example, the top five words with high independence of a topic are as follows. Topic1 = {"scor", "game", "lead", "quarter", "rebound"}. This Topic 1 is considered to represent the topic of "SPORTS". This topic name "SPORTS" has to be attached by the user. ITA cannot name topics. Therefore, in this research, we propose a method to obtain topics easy for people to understand by using the web search engine, topics given by the set of words given by independent topic analysis. In particular, we search a set of topical words, and the title of the homepage of the search result is taken as the topic name. And we also use the proposed method for some data and verify its effectiveness.

Keywords: independent topic analysis, topic extraction, topic naming, web search engine

Procedia PDF Downloads 93
840 A Lost Tradition: Reflections towards Select Tribal Songs of Odisha

Authors: Akshaya K. Rath, Manjit Mahanta

Abstract:

The paper aims at examining the oral tradition of the Kondh and Oroan people of Odisha. Highlighting the translated versions of Kondh and Oroan songs—chiefly highlighting issues on agriculture—we argue that the relevance of these songs have fallen apart in the recent decades with the advancement of modern knowledge and thinking. What remains instead is a faint voice in the oral tradition that sings the past indigenous knowledge in the form of oral literature. Though there have been few attempts to document the rich cultural tradition by some individuals—Sitakant Mahapatra’s can be cited as an example—the need to document the tradition remains ever arching. In short, the thesis examines Kondh and Oroan “songs” and argues for a need to document the tradition. It also shows a comparative study on both the tribes on Agriculture which shows their cultural identity and a diversification of both the tribes in nature and how these tribal groups are associated with nature and the cycle of it.

Keywords: oral tradition, Meriah, folklore, karma, Oroan

Procedia PDF Downloads 434
839 Celebrating Community Heritage through the People’s Collection Wales: A Case Study in the Development of Collecting Traditions and Engagement

Authors: Gruffydd E. Jones

Abstract:

The world’s largest collection of historical, cultural, and heritage material is unarchived and undocumented in the hands of the public. Not only does this material represent the missing collections in heritage sector archives today, but it is also the key to providing a diverse range of communities with the means to express their history in their own words and to celebrate their unique, personal heritage. The People’s Collection Wales (PCW) acts as a platform on which the heritage of Wales and her people can be collated and shared, at the heart of which is a thriving community engagement programme across a network of museums, archives, and libraries. By providing communities with the archival skillset commonly employed throughout the heritage sector, PCW enables local projects, societies, and individuals to express their understanding of local heritage with their own voices, empowering communities to embrace their diverse and complex identities around Wales. Drawing on key examples from the project’s history, this paper will demonstrate the successful way in which museums have been developed as hubs for community engagement where the public was at the heart of collection and documentation activities, informing collection and curatorial policies to benefit both the institute and its local community. This paper will also highlight how collections from marginalised, under-represented, and minority communities have been published and celebrated extensively around Wales, including adoption by the education system in classrooms today. Any activity within the heritage sector, whether of collection, preservation, digitisation, or accessibility, should be considerate of community engagement opportunities not only to remain relevant but in order to develop as community hubs, pivots around which local heritage is supported and preserved. Attention will be drawn to our digitisation workflow, which, through training and support from museums and libraries, has allowed the public not only to become involved but to actively lead the contemporary evolution of documentation strategies in Wales. This paper will demonstrate how the PCW online access archive is promoting museum collections, encouraging user interaction, and providing an invaluable platform on which a broader community can inform, preserve and celebrate their cultural heritage through their own archival material too. The continuing evolution of heritage engagement depends wholly on placing communities at the heart of the sector, recognising their wealth of cultural knowledge, and developing the archival skillset necessary for them to become archival practitioners of their own.

Keywords: social history, cultural heritage, community heritage, museums, archives, libraries, community engagement, oral history, community archives

Procedia PDF Downloads 51
838 A Similarity Measure for Classification and Clustering in Image Based Medical and Text Based Banking Applications

Authors: K. P. Sandesh, M. H. Suman

Abstract:

Text processing plays an important role in information retrieval, data-mining, and web search. Measuring the similarity between the documents is an important operation in the text processing field. In this project, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature the proposed measure takes the following three cases into account: (1) The feature appears in both documents; (2) The feature appears in only one document and; (3) The feature appears in none of the documents. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, especially in banking and health sectors. The results show that the performance obtained by the proposed measure is better than that achieved by the other measures.

Keywords: document classification, document clustering, entropy, accuracy, classifiers, clustering algorithms

Procedia PDF Downloads 478
837 One-Class Support Vector Machine for Sentiment Analysis of Movie Review Documents

Authors: Chothmal, Basant Agarwal

Abstract:

Sentiment analysis means to classify a given review document into positive or negative polar document. Sentiment analysis research has been increased tremendously in recent times due to its large number of applications in the industry and academia. Sentiment analysis models can be used to determine the opinion of the user towards any entity or product. E-commerce companies can use sentiment analysis model to improve their products on the basis of users’ opinion. In this paper, we propose a new One-class Support Vector Machine (One-class SVM) based sentiment analysis model for movie review documents. In the proposed approach, we initially extract features from one class of documents, and further test the given documents with the one-class SVM model if a given new test document lies in the model or it is an outlier. Experimental results show the effectiveness of the proposed sentiment analysis model.

Keywords: feature selection methods, machine learning, NB, one-class SVM, sentiment analysis, support vector machine

Procedia PDF Downloads 476