Search results for: document detection
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 4030

Search results for: document detection

4030 Distorted Document Images Dataset for Text Detection and Recognition

Authors: Ilia Zharikov, Philipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan

Abstract:

With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models.

Keywords: document analysis, open dataset, optical character recognition, text detection

Procedia PDF Downloads 134
4029 DCDNet: Lightweight Document Corner Detection Network Based on Attention Mechanism

Authors: Kun Xu, Yuan Xu, Jia Qiao

Abstract:

The document detection plays an important role in optical character recognition and text analysis. Because the traditional detection methods have weak generalization ability, and deep neural network has complex structure and large number of parameters, which cannot be well applied in mobile devices, this paper proposes a lightweight Document Corner Detection Network (DCDNet). DCDNet is a two-stage architecture. The first stage with Encoder-Decoder structure adopts depthwise separable convolution to greatly reduce the network parameters. After introducing the Feature Attention Union (FAU) module, the second stage enhances the feature information of spatial and channel dim and adaptively adjusts the size of receptive field to enhance the feature expression ability of the model. Aiming at solving the problem of the large difference in the number of pixel distribution between corner and non-corner, Weighted Binary Cross Entropy Loss (WBCE Loss) is proposed to define corner detection problem as a classification problem to make the training process more efficient. In order to make up for the lack of Dataset of document corner detection, a Dataset containing 6620 images named Document Corner Detection Dataset (DCDD) is made. Experimental results show that the proposed method can obtain fast, stable and accurate detection results on DCDD.

Keywords: document detection, corner detection, attention mechanism, lightweight

Procedia PDF Downloads 323
4028 Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition

Authors: L. Hamsaveni, Navya Prakash, Suresha

Abstract:

Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document images to obtain an original document with complete information. In case, degraded document image captured is skewed, it has to be straightened (deskew) to perform further process. A special format of image storing known as YCbCr is used as a tool to convert the Grayscale image to RGB image format. The presented algorithm is tested on various types of degraded documents such as printed documents, handwritten documents, old script documents and handwritten image sketches in documents. The purpose of this research is to obtain an original document for a given set of degraded documents of the same source.

Keywords: grayscale image format, image fusing, RGB image format, SURF detection, YCbCr image format

Procedia PDF Downloads 341
4027 Off-Topic Text Detection System Using a Hybrid Model

Authors: Usama Shahid

Abstract:

Be it written documents, news columns, or students' essays, verifying the content can be a time-consuming task. Apart from the spelling and grammar mistakes, the proofreader is also supposed to verify whether the content included in the essay or document is relevant or not. The irrelevant content in any document or essay is referred to as off-topic text and in this paper, we will address the problem of off-topic text detection from a document using machine learning techniques. Our study aims to identify the off-topic content from a document using Echo state network model and we will also compare data with other models. The previous study uses Convolutional Neural Networks and TFIDF to detect off-topic text. We will rearrange the existing datasets and take new classifiers along with new word embeddings and implement them on existing and new datasets in order to compare the results with the previously existing CNN model.

Keywords: off topic, text detection, eco state network, machine learning

Procedia PDF Downloads 51
4026 Binarization and Recognition of Characters from Historical Degraded Documents

Authors: Bency Jacob, S.B. Waykar

Abstract:

Degradations in historical document images appear due to aging of the documents. It is very difficult to understand and retrieve text from badly degraded documents as there is variation between the document foreground and background. Thresholding of such document images either result in broken characters or detection of false texts. Numerous algorithms exist that can separate text and background efficiently in the textual regions of the document; but portions of background are mistaken as text in areas that hardly contain any text. This paper presents a way to overcome these problems by a robust binarization technique that recovers the text from a severely degraded document images and thereby increases the accuracy of optical character recognition systems. The proposed document recovery algorithm efficiently removes degradations from document images. Here we are using the ostus method ,local thresholding and global thresholding and after the binarization training and recognizing the characters in the degraded documents.

Keywords: binarization, denoising, global thresholding, local thresholding, thresholding

Procedia PDF Downloads 310
4025 Adaptation of Hough Transform Algorithm for Text Document Skew Angle Detection

Authors: Kayode A. Olaniyi, Olabanji F. Omotoye, Adeola A. Ogunleye

Abstract:

The skew detection and correction form an important part of digital document analysis. This is because uncompensated skew can deteriorate document features and can complicate further document image processing steps. Efficient text document analysis and digitization can rarely be achieved when a document is skewed even at a small angle. Once the documents have been digitized through the scanning system and binarization also achieved, document skew correction is required before further image analysis. Research efforts have been put in this area with algorithms developed to eliminate document skew. Skew angle correction algorithms can be compared based on performance criteria. Most important performance criteria are accuracy of skew angle detection, range of skew angle for detection, speed of processing the image, computational complexity and consequently memory space used. The standard Hough Transform has successfully been implemented for text documentation skew angle estimation application. However, the standard Hough Transform algorithm level of accuracy depends largely on how much fine the step size for the angle used. This consequently consumes more time and memory space for increase accuracy and, especially where number of pixels is considerable large. Whenever the Hough transform is used, there is always a tradeoff between accuracy and speed. So a more efficient solution is needed that optimizes space as well as time. In this paper, an improved Hough transform (HT) technique that optimizes space as well as time to robustly detect document skew is presented. The modified algorithm of Hough Transform presents solution to the contradiction between the memory space, running time and accuracy. Our algorithm starts with the first step of angle estimation accurate up to zero decimal place using the standard Hough Transform algorithm achieving minimal running time and space but lacks relative accuracy. Then to increase accuracy, suppose estimated angle found using the basic Hough algorithm is x degree, we then run again basic algorithm from range between ±x degrees with accuracy of one decimal place. Same process is iterated till level of desired accuracy is achieved. The procedure of our skew estimation and correction algorithm of text images is implemented using MATLAB. The memory space estimation and process time are also tabulated with skew angle assumption of within 00 and 450. The simulation results which is demonstrated in Matlab show the high performance of our algorithms with less computational time and memory space used in detecting document skew for a variety of documents with different levels of complexity.

Keywords: hough-transform, skew-detection, skew-angle, skew-correction, text-document

Procedia PDF Downloads 124
4024 Investigation of Topic Modeling-Based Semi-Supervised Interpretable Document Classifier

Authors: Dasom Kim, William Xiu Shun Wong, Yoonjin Hyun, Donghoon Lee, Minji Paek, Sungho Byun, Namgyu Kim

Abstract:

There have been many researches on document classification for classifying voluminous documents automatically. Through document classification, we can assign a specific category to each unlabeled document on the basis of various machine learning algorithms. However, providing labeled documents manually requires considerable time and effort. To overcome the limitations, the semi-supervised learning which uses unlabeled document as well as labeled documents has been invented. However, traditional document classifiers, regardless of supervised or semi-supervised ones, cannot sufficiently explain the reason or the process of the classification. Thus, in this paper, we proposed a methodology to visualize major topics and class components of each document. We believe that our methodology for visualizing topics and classes of each document can enhance the reliability and explanatory power of document classifiers.

Keywords: data mining, document classifier, text mining, topic modeling

Procedia PDF Downloads 359
4023 A Proposed Approach for Emotion Lexicon Enrichment

Authors: Amr Mansour Mohsen, Hesham Ahmed Hassan, Amira M. Idrees

Abstract:

Document Analysis is an important research field that aims to gather the information by analyzing the data in documents. As one of the important targets for many fields is to understand what people actually want, sentimental analysis field has been one of the vital fields that are tightly related to the document analysis. This research focuses on analyzing text documents to classify each document according to its opinion. The aim of this research is to detect the emotions from text documents based on enriching the lexicon with adapting their content based on semantic patterns extraction. The proposed approach has been presented, and different experiments are applied by different perspectives to reveal the positive impact of the proposed approach on the classification results.

Keywords: document analysis, sentimental analysis, emotion detection, WEKA tool, NRC lexicon

Procedia PDF Downloads 394
4022 Incremental Learning of Independent Topic Analysis

Authors: Takahiro Nishigaki, Katsumi Nitta, Takashi Onoda

Abstract:

In this paper, we present a method of applying Independent Topic Analysis (ITA) to increasing the number of document data. The number of document data has been increasing since the spread of the Internet. ITA was presented as one method to analyze the document data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis (ICA). ICA is a technique in the signal processing; however, it is difficult to apply the ITA to increasing number of document data. Because ITA must use the all document data so temporal and spatial cost is very high. Therefore, we present Incremental ITA which extracts the independent topics from increasing number of document data. Incremental ITA is a method of updating the independent topics when the document data is added after extracted the independent topics from a just previous the data. In addition, Incremental ITA updates the independent topics when the document data is added. And we show the result applied Incremental ITA to benchmark datasets.

Keywords: text mining, topic extraction, independent, incremental, independent component analysis

Procedia PDF Downloads 275
4021 Lexical Based Method for Opinion Detection on Tripadvisor Collection

Authors: Faiza Belbachir, Thibault Schienhinski

Abstract:

The massive development of online social networks allows users to post and share their opinions on various topics. With this huge volume of opinion, it is interesting to extract and interpret these information for different domains, e.g., product and service benchmarking, politic, system of recommendation. This is why opinion detection is one of the most important research tasks. It consists on differentiating between opinion data and factual data. The difficulty of this task is to determine an approach which returns opinionated document. Generally, there are two approaches used for opinion detection i.e. Lexical based approaches and Machine Learning based approaches. In Lexical based approaches, a dictionary of sentimental words is used, words are associated with weights. The opinion score of document is derived by the occurrence of words from this dictionary. In Machine learning approaches, usually a classifier is trained using a set of annotated document containing sentiment, and features such as n-grams of words, part-of-speech tags, and logical forms. Majority of these works are based on documents text to determine opinion score but dont take into account if these texts are really correct. Thus, it is interesting to exploit other information to improve opinion detection. In our work, we will develop a new way to consider the opinion score. We introduce the notion of trust score. We determine opinionated documents but also if these opinions are really trustable information in relation with topics. For that we use lexical SentiWordNet to calculate opinion and trust scores, we compute different features about users like (numbers of their comments, numbers of their useful comments, Average useful review). After that, we combine opinion score and trust score to obtain a final score. We applied our method to detect trust opinions in TRIPADVISOR collection. Our experimental results report that the combination between opinion score and trust score improves opinion detection.

Keywords: Tripadvisor, opinion detection, SentiWordNet, trust score

Procedia PDF Downloads 164
4020 Electrical Dault Detection of Photovoltaic System: A Short-Circuit Fault Case

Authors: Moustapha H. Ibrahim, Dahir Abdourahman

Abstract:

This document presents a short-circuit fault detection process in a photovoltaic (PV) system. The proposed method is developed in MATLAB/Simulink. It determines whatever the size of the installation number of the short circuit module. The proposed algorithm indicates the presence or absence of an abnormality on the power of the PV system through measures of hourly global irradiation, power output, and ambient temperature. In case a fault is detected, it displays the number of modules in a short circuit. This fault detection method has been successfully tested on two different PV installations.

Keywords: PV system, short-circuit, fault detection, modelling, MATLAB-Simulink

Procedia PDF Downloads 201
4019 Model-Based Field Extraction from Different Class of Administrative Documents

Authors: Jinen Daghrir, Anis Kricha, Karim Kalti

Abstract:

The amount of incoming administrative documents is massive and manually processing these documents is a costly task especially on the timescale. In fact, this problem has led an important amount of research and development in the context of automatically extracting fields from administrative documents, in order to reduce the charges and to increase the citizen satisfaction in administrations. In this matter, we introduce an administrative document understanding system. Given a document in which a user has to select fields that have to be retrieved from a document class, a document model is automatically built. A document model is represented by an attributed relational graph (ARG) where nodes represent fields to extract, and edges represent the relation between them. Both of vertices and edges are attached with some feature vectors. When another document arrives to the system, the layout objects are extracted and an ARG is generated. The fields extraction is translated into a problem of matching two ARGs which relies mainly on the comparison of the spatial relationships between layout objects. Experimental results yield accuracy rates from 75% to 100% tested on eight document classes. Our proposed method has a good performance knowing that the document model is constructed using only one single document.

Keywords: administrative document understanding, logical labelling, logical layout analysis, fields extraction from administrative documents

Procedia PDF Downloads 182
4018 Adaptation of Projection Profile Algorithm for Skewed Handwritten Text Line Detection

Authors: Kayode A. Olaniyi, Tola. M. Osifeko, Adeola A. Ogunleye

Abstract:

Text line segmentation is an important step in document image processing. It represents a labeling process that assigns the same label using distance metric probability to spatially aligned units. Text line detection techniques have successfully been implemented mainly in printed documents. However, processing of the handwritten texts especially unconstrained documents has remained a key problem. This is because the unconstrained hand-written text lines are often not uniformly skewed. The spaces between text lines may not be obvious, complicated by the nature of handwriting and, overlapping ascenders and/or descenders of some characters. Hence, text lines detection and segmentation represents a leading challenge in handwritten document image processing. Text line detection methods that rely on the traditional global projection profile of the text document cannot efficiently confront with the problem of variable skew angles between different text lines. Hence, the formulation of a horizontal line as a separator is often not efficient. This paper presents a technique to segment a handwritten document into distinct lines of text. The proposed algorithm starts, by partitioning the initial text image into columns, across its width into chunks of about 5% each. At each vertical strip of 5%, the histogram of horizontal runs is projected. We have worked with the assumption that text appearing in a single strip is almost parallel to each other. The algorithm developed provides a sliding window through the first vertical strip on the left side of the page. It runs through to identify the new minimum corresponding to a valley in the projection profile. Each valley would represent the starting point of the orientation line and the ending point is the minimum point on the projection profile of the next vertical strip. The derived text-lines traverse around any obstructing handwritten vertical strips of connected component by associating it to either the line above or below. A decision of associating such connected component is made by the probability obtained from a distance metric decision. The technique outperforms the global projection profile for text line segmentation and it is robust to handle skewed documents and those with lines running into each other.

Keywords: connected-component, projection-profile, segmentation, text-line

Procedia PDF Downloads 92
4017 Visual Template Detection and Compositional Automatic Regular Expression Generation for Business Invoice Extraction

Authors: Anthony Proschka, Deepak Mishra, Merlyn Ramanan, Zurab Baratashvili

Abstract:

Small and medium-sized businesses receive over 160 billion invoices every year. Since these documents exhibit many subtle differences in layout and text, extracting structured fields such as sender name, amount, and VAT rate from them automatically is an open research question. In this paper, existing work in template-based document extraction is extended, and a system is devised that is able to reliably extract all required fields for up to 70% of all documents in the data set, more than any other previously reported method. The approaches are described for 1) detecting through visual features which template a given document belongs to, 2) automatically generating extraction rules for a given new template by composing regular expressions from multiple components, and 3) computing confidence scores that indicate the accuracy of the automatic extractions. The system can generate templates with as little as one training sample and only requires the ground truth field values instead of detailed annotations such as bounding boxes that are hard to obtain. The system is deployed and used inside a commercial accounting software.

Keywords: data mining, information retrieval, business, feature extraction, layout, business data processing, document handling, end-user trained information extraction, document archiving, scanned business documents, automated document processing, F1-measure, commercial accounting software

Procedia PDF Downloads 96
4016 Comprehensive Review of Adversarial Machine Learning in PDF Malware

Authors: Preston Nabors, Nasseh Tabrizi

Abstract:

Portable Document Format (PDF) files have gained significant popularity for sharing and distributing documents due to their universal compatibility. However, the widespread use of PDF files has made them attractive targets for cybercriminals, who exploit vulnerabilities to deliver malware and compromise the security of end-user systems. This paper reviews notable contributions in PDF malware detection, including static, dynamic, signature-based, and hybrid analysis. It presents a comprehensive examination of PDF malware detection techniques, focusing on the emerging threat of adversarial sampling and the need for robust defense mechanisms. The paper highlights the vulnerability of machine learning classifiers to evasion attacks. It explores adversarial sampling techniques in PDF malware detection to produce mimicry and reverse mimicry evasion attacks, which aim to bypass detection systems. Improvements for future research are identified, including accessible methods, applying adversarial sampling techniques to malicious payloads, evaluating other models, evaluating the importance of features to malware, implementing adversarial defense techniques, and conducting comprehensive examination across various scenarios. By addressing these opportunities, researchers can enhance PDF malware detection and develop more resilient defense mechanisms against adversarial attacks.

Keywords: adversarial attacks, adversarial defense, adversarial machine learning, intrusion detection, PDF malware, malware detection, malware detection evasion

Procedia PDF Downloads 13
4015 Anatomical Survey for Text Pattern Detection

Authors: S. Tehsin, S. Kausar

Abstract:

The ultimate aim of machine intelligence is to explore and materialize the human capabilities, one of which is the ability to detect various text objects within one or more images displayed on any canvas including prints, videos or electronic displays. Multimedia data has increased rapidly in past years. Textual information present in multimedia contains important information about the image/video content. However, it needs to technologically testify the commonly used human intelligence of detecting and differentiating the text within an image, for computers. Hence in this paper feature set based on anatomical study of human text detection system is proposed. Subsequent examination bears testimony to the fact that the features extracted proved instrumental to text detection.

Keywords: biologically inspired vision, content based retrieval, document analysis, text extraction

Procedia PDF Downloads 417
4014 Continuous Land Cover Change Detection in Subtropical Thicket Ecosystems

Authors: Craig Mahlasi

Abstract:

The Subtropical Thicket Biome has been in peril of transformation. Estimates indicate that as much as 63% of the Subtropical Thicket Biome is severely degraded. Agricultural expansion is the main driver of transformation. While several studies have sought to document and map the long term transformations, there is a lack of information on disturbance events that allow for timely intervention by authorities. Furthermore, tools that seek to perform continuous land cover change detection are often developed for forests and thus tend to perform poorly in thicket ecosystems. This study investigates the utility of Earth Observation data for continuous land cover change detection in Subtropical Thicket ecosystems. Temporal Neural Networks are implemented on a time series of Sentinel-2 observations. The model obtained 0.93 accuracy, a recall score of 0.93, and a precision score of 0.91 in detecting Thicket disturbances. The study demonstrates the potential of continuous land cover change in Subtropical Thicket ecosystems.

Keywords: remote sensing, land cover change detection, subtropical thickets, near-real time

Procedia PDF Downloads 130
4013 Using Closed Frequent Itemsets for Hierarchical Document Clustering

Authors: Cheng-Jhe Lee, Chiun-Chieh Hsu

Abstract:

Due to the rapid development of the Internet and the increased availability of digital documents, the excessive information on the Internet has led to information overflow problem. In order to solve these problems for effective information retrieval, document clustering in text mining becomes a popular research topic. Clustering is the unsupervised classification of data items into groups without the need of training data. Many conventional document clustering methods perform inefficiently for large document collections because they were originally designed for relational database. Therefore they are impractical in real-world document clustering and require special handling for high dimensionality and high volume. We propose the FIHC (Frequent Itemset-based Hierarchical Clustering) method, which is a hierarchical clustering method developed for document clustering, where the intuition of FIHC is that there exist some common words for each cluster. FIHC uses such words to cluster documents and builds hierarchical topic tree. In this paper, we combine FIHC algorithm with ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results show that our method is more accurate than those of well-known document clustering algorithms.

Keywords: FIHC, documents clustering, ontology, closed frequent itemset

Procedia PDF Downloads 366
4012 Multimedia Data Fusion for Event Detection in Twitter by Using Dempster-Shafer Evidence Theory

Authors: Samar M. Alqhtani, Suhuai Luo, Brian Regan

Abstract:

Data fusion technology can be the best way to extract useful information from multiple sources of data. It has been widely applied in various applications. This paper presents a data fusion approach in multimedia data for event detection in twitter by using Dempster-Shafer evidence theory. The methodology applies a mining algorithm to detect the event. There are two types of data in the fusion. The first is features extracted from text by using the bag-ofwords method which is calculated using the term frequency-inverse document frequency (TF-IDF). The second is the visual features extracted by applying scale-invariant feature transform (SIFT). The Dempster - Shafer theory of evidence is applied in order to fuse the information from these two sources. Our experiments have indicated that comparing to the approaches using individual data source, the proposed data fusion approach can increase the prediction accuracy for event detection. The experimental result showed that the proposed method achieved a high accuracy of 0.97, comparing with 0.93 with texts only, and 0.86 with images only.

Keywords: data fusion, Dempster-Shafer theory, data mining, event detection

Procedia PDF Downloads 371
4011 Efficient Signal Detection Using QRD-M Based on Channel Condition in MIMO-OFDM System

Authors: Jae-Jeong Kim, Ki-Ro Kim, Hyoung-Kyu Song

Abstract:

In this paper, we propose an efficient signal detector that switches M parameter of QRD-M detection scheme is proposed for MIMO-OFDM system. The proposed detection scheme calculates the threshold by 1-norm condition number and then switches M parameter of QRD-M detection scheme according to channel information. If channel condition is bad, the parameter M is set to high value to increase the accuracy of detection. If channel condition is good, the parameter M is set to low value to reduce complexity of detection. Therefore, the proposed detection scheme has better trade off between BER performance and complexity than the conventional detection scheme. The simulation result shows that the complexity of proposed detection scheme is lower than QRD-M detection scheme with similar BER performance.

Keywords: MIMO-OFDM, QRD-M, channel condition, BER

Procedia PDF Downloads 331
4010 Reduced Complexity of ML Detection Combined with DFE

Authors: Jae-Hyun Ro, Yong-Jun Kim, Chang-Bin Ha, Hyoung-Kyu Song

Abstract:

In multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems, many detection schemes have been developed to improve the error performance and to reduce the complexity. Maximum likelihood (ML) detection has optimal error performance but it has very high complexity. Thus, this paper proposes reduced complexity of ML detection combined with decision feedback equalizer (DFE). The error performance of the proposed detection scheme is higher than the conventional DFE. But the complexity of the proposed scheme is lower than the conventional ML detection.

Keywords: detection, DFE, MIMO-OFDM, ML

Procedia PDF Downloads 574
4009 Improving the Performance of Requisition Document Online System for Royal Thai Army by Using Time Series Model

Authors: D. Prangchumpol

Abstract:

This research presents a forecasting method of requisition document demands for Military units by using Exponential Smoothing methods to analyze data. The data used in the forecast is an actual data requisition document of The Adjutant General Department. The results of the forecasting model to forecast the requisition of the document found that Holt–Winters’ trend and seasonality method of α=0.1, β=0, γ=0 is appropriate and matches for requisition of documents. In addition, the researcher has developed a requisition online system to improve the performance of requisition documents of The Adjutant General Department, and also ensuring that the operation can be checked.

Keywords: requisition, holt–winters, time series, royal thai army

Procedia PDF Downloads 279
4008 A Methodology for Automatic Diversification of Document Categories

Authors: Dasom Kim, Chen Liu, Myungsu Lim, Su-Hyeon Jeon, ByeoungKug Jeon, Kee-Young Kwahk, Namgyu Kim

Abstract:

Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we previously proposed a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. In this paper, we design a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology.

Keywords: big data analysis, document classification, multi-category, text mining, topic analysis

Procedia PDF Downloads 239
4007 Cigarette Smoke Detection Based on YOLOV3

Authors: Wei Li, Tuo Yang

Abstract:

In order to satisfy the real-time and accurate requirements of cigarette smoke detection in complex scenes, a cigarette smoke detection technology based on the combination of deep learning and color features was proposed. Firstly, based on the color features of cigarette smoke, the suspicious cigarette smoke area in the image is extracted. Secondly, combined with the efficiency of cigarette smoke detection and the problem of network overfitting, a network model for cigarette smoke detection was designed according to YOLOV3 algorithm to reduce the false detection rate. The experimental results show that the method is feasible and effective, and the accuracy of cigarette smoke detection is up to 99.13%, which satisfies the requirements of real-time cigarette smoke detection in complex scenes.

Keywords: deep learning, computer vision, cigarette smoke detection, YOLOV3, color feature extraction

Procedia PDF Downloads 53
4006 Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison

Authors: Po-Fang Hsu, Chiching Wei

Abstract:

In this paper, we present a novel neural graph matching approach applied to document comparison. Document comparison is a common task in the legal and financial industries. In some cases, the most important differences may be the addition or omission of words, sentences, clauses, or paragraphs. However, it is a challenging task without recording or tracing the whole edited process. Under many temporal uncertainties, we explore the potentiality of our approach to proximate the accurate comparison to make sure which element blocks have a relation of edition with others. In the beginning, we apply a document layout analysis that combines traditional and modern technics to segment layouts in blocks of various types appropriately. Then we transform this issue into a problem of layout graph matching with textual awareness. Regarding graph matching, it is a long-studied problem with a broad range of applications. However, different from previous works focusing on visual images or structural layout, we also bring textual features into our model for adapting this domain. Specifically, based on the electronic document, we introduce an encoder to deal with the visual presentation decoding from PDF. Additionally, because the modifications can cause the inconsistency of document layout analysis between modified documents and the blocks can be merged and split, Sinkhorn divergence is adopted in our neural graph approach, which tries to overcome both these issues with many-to-many block matching. We demonstrate this on two categories of layouts, as follows., legal agreement and scientific articles, collected from our real-case datasets.

Keywords: document comparison, graph matching, graph neural network, modification similarity, multi-modal

Procedia PDF Downloads 149
4005 An Architecture for New Generation of Distributed Intrusion Detection System Based on Preventive Detection

Authors: H. Benmoussa, A. A. El Kalam, A. Ait Ouahman

Abstract:

The design and implementation of intrusion detection systems (IDS) remain an important area of research in the security of information systems. Despite the importance and reputation of the current intrusion detection systems, their efficiency and effectiveness remain limited as they should include active defense approach to allow anticipating and predicting intrusions before their occurrence. Consequently, they must be readapted. For this purpose we suggest a new generation of distributed intrusion detection system based on preventive detection approach and using intelligent and mobile agents. Our architecture benefits from mobile agent features and addresses some of the issues with centralized and hierarchical models. Also, it presents advantages in terms of increasing scalability and flexibility.

Keywords: Intrusion Detection System (IDS), preventive detection, mobile agents, distributed architecture

Procedia PDF Downloads 546
4004 Use of Interpretable Evolved Search Query Classifiers for Sinhala Documents

Authors: Prasanna Haddela

Abstract:

Document analysis is a well matured yet still active research field, partly as a result of the intricate nature of building computational tools but also due to the inherent problems arising from the variety and complexity of human languages. Breaking down language barriers is vital in enabling access to a number of recent technologies. This paper investigates the application of document classification methods to new Sinhalese datasets. This language is geographically isolated and rich with many of its own unique features. We will examine the interpretability of the classification models with a particular focus on the use of evolved Lucene search queries generated using a Genetic Algorithm (GA) as a method of document classification. We will compare the accuracy and interpretability of these search queries with other popular classifiers. The results are promising and are roughly in line with previous work on English language datasets.

Keywords: evolved search queries, Sinhala document classification, Lucene Sinhala analyzer, interpretable text classification, genetic algorithm

Procedia PDF Downloads 88
4003 Video Based Ambient Smoke Detection By Detecting Directional Contrast Decrease

Authors: Omair Ghori, Anton Stadler, Stefan Wilk, Wolfgang Effelsberg

Abstract:

Fire-related incidents account for extensive loss of life and material damage. Quick and reliable detection of occurring fires has high real world implications. Whereas a major research focus lies on the detection of outdoor fires, indoor camera-based fire detection is still an open issue. Cameras in combination with computer vision helps to detect flames and smoke more quickly than conventional fire detectors. In this work, we present a computer vision-based smoke detection algorithm based on contrast changes and a multi-step classification. This work accelerates computer vision-based fire detection considerably in comparison with classical indoor-fire detection.

Keywords: contrast analysis, early fire detection, video smoke detection, video surveillance

Procedia PDF Downloads 404
4002 Intrusion Detection Techniques in NaaS in the Cloud: A Review

Authors: Rashid Mahmood

Abstract:

The network as a service (NaaS) usage has been well-known from the last few years in the many applications, like mission critical applications. In the NaaS, prevention method is not adequate as the security concerned, so the detection method should be added to the security issues in NaaS. The authentication and encryption are considered the first solution of the NaaS problem whereas now these are not sufficient as NaaS use is increasing. In this paper, we are going to present the concept of intrusion detection and then survey some of major intrusion detection techniques in NaaS and aim to compare in some important fields.

Keywords: IDS, cloud, naas, detection

Procedia PDF Downloads 280
4001 Securing Web Servers by the Intrusion Detection System (IDS)

Authors: Yousef Farhaoui

Abstract:

An IDS is a tool which is used to improve the level of security. We present in this paper different architectures of IDS. We will also discuss measures that define the effectiveness of IDS and the very recent works of standardization and homogenization of IDS. At the end, we propose a new model of IDS called BiIDS (IDS Based on the two principles of detection) for securing web servers and applications by the Intrusion Detection System (IDS).

Keywords: intrusion detection, architectures, characteristic, tools, security, web server

Procedia PDF Downloads 384