Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 69

Search results for: Semi--structured Document

69 Organization Model of Semantic Document Repository and Search Techniques for Studying Information Technology

Authors: Nhon Do, Thuong Huynh, An Pham

Abstract:

Nowadays, organizing a repository of documents and resources for learning on a special field as Information Technology (IT), together with search techniques based on domain knowledge or document-s content is an urgent need in practice of teaching, learning and researching. There have been several works related to methods of organization and search by content. However, the results are still limited and insufficient to meet user-s demand for semantic document retrieval. This paper presents a solution for the organization of a repository that supports semantic representation and processing in search. The proposed solution is a model which integrates components such as an ontology describing domain knowledge, a database of document repository, semantic representation for documents and a file system; with problems, semantic processing techniques and advanced search techniques based on measuring semantic similarity. The solution is applied to build a IT learning materials management system of a university with semantic search function serving students, teachers, and manager as well. The application has been implemented, tested at the University of Information Technology, Ho Chi Minh City, Vietnam and has achieved good results.

Keywords: document retrieval system, knowledgerepresentation, document representation, semantic search, ontology.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
68 Combining Color and Layout Features for the Identification of Low-resolution Documents

Authors: Ardhendu Behera, Denis Lalanne, Rolf Ingold

Abstract:

This paper proposes a method, combining color and layout features, for identifying documents captured from lowresolution handheld devices. On one hand, the document image color density surface is estimated and represented with an equivalent ellipse and on the other hand, the document shallow layout structure is computed and hierarchically represented. The combined color and layout features are arranged in a symbolic file, which is unique for each document and is called the document-s visual signature. Our identification method first uses the color information in the signatures in order to focus the search space on documents having a similar color distribution, and finally selects the document having the most similar layout structure in the remaining search space. Finally, our experiment considers slide documents, which are often captured using handheld devices.

Keywords: Document color modeling, document visual signature, kernel density estimation, document identification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
67 Color and Layout-based Identification of Documents Captured from Handheld Devices

Authors: Ardhendu Behera, Denis Lalanne, Rolf Ingold

Abstract:

This paper proposes a method, combining color and layout features, for identifying documents captured from low-resolution handheld devices. On one hand, the document image color density surface is estimated and represented with an equivalent ellipse and on the other hand, the document shallow layout structure is computed and hierarchically represented. Our identification method first uses the color information in the documents in order to focus the search space on documents having a similar color distribution, and finally selects the document having the most similar layout structure in the remaining of the search space.

Keywords: Document color modeling, document visualsignature, kernel density estimation, document identification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
66 Ultra High Speed Approach for Document Skew Detection and Correction Based On Centre of Gravity

Authors: Seyyed Yasser Hashemi

Abstract:

Skew detection and correction (SDC) has a direct effect in efficiency and exactitude of documents’ segmentation and analysis and thus is considered as a very important step in documents’ analysis field. Skew is a major problem in documents’ analysis for every language. For Arabic/Persian document scripts this problem is more severe because of special features of these languages. In this paper an efficient and fast algorithm for Document Skew Detection (DSD) based on the concept of segmentation and Center of Gravity (COG) is proposed. This algorithm is examined for 150 Arabic/Persian and English documents and SDC process are done successfully for 93 percent of documents with error rate of less than 1°. This algorithm shows better results for English documents compared to Arabic/Persian documents. The proposed method is also represents favorable results for handwritten, printed and also complicated documents such as newspapers and journals even with very low quality and resolution.

Keywords: Arabic/Persian document, Baseline, Centre of gravity, Document segmentation, Skew detection and correction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
65 Highlighting Document's Structure

Authors: Sylvie Ratté, Wilfried Njomgue, Pierre-André Ménard

Abstract:

In this paper, we present symbolic recognition models to extract knowledge characterized by document structures. Focussing on the extraction and the meticulous exploitation of the semantic structure of documents, we obtain a meaningful contextual tagging corresponding to different unit types (title, chapter, section, enumeration, etc.).

Keywords: Information retrieval, document structures, symbolic grammars.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
64 Persian/Arabic Document Segmentation Based On Pyramidal Image Structure

Authors: Seyyed Yasser Hashemi, Khalil Monfaredi

Abstract:

Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus in most previously reported methods it is inevitable to include these parameters. This problem becomes excessively acute and severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ considerably from the English scripts, most of the proposed methods for the English scripts do not render good results for the Persian scripts. In this paper, we present a novel parameter-free method for segmenting the Persian/Arabic document images which also works well for English scripts. This method segments the document image into maximal homogeneous regions and identifies them as texts and non-texts based on a pyramidal image structure. In other words the proposed method is capable of document segmentation without considering the character font sizes, text line spacing, and document layout structures. This algorithm is examined for 150 Arabic/Persian and English documents and document segmentation process are done successfully for 96 percent of documents.

Keywords: Persian/Arabic document, document segmentation, Pyramidal Image Structure, skew detection and correction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
63 Compression of Semistructured Documents

Authors: Leo Galambos, Jan Lansky, Katsiaryna Chernik

Abstract:

EGOTHOR is a search engine that indexes the Web and allows us to search the Web documents. Its hit list contains URL and title of the hits, and also some snippet which tries to shortly show a match. The snippet can be almost always assembled by an algorithm that has a full knowledge of the original document (mostly HTML page). It implies that the search engine is required to store the full text of the documents as a part of the index. Such a requirement leads us to pick up an appropriate compression algorithm which would reduce the space demand. One of the solutions could be to use common compression methods, for instance gzip or bzip2, but it might be preferable if we develop a new method which would take advantage of the document structure, or rather, the textual character of the documents. There already exist a special compression text algorithms and methods for a compression of XML documents. The aim of this paper is an integration of the two approaches to achieve an optimal level of the compression ratio

Keywords: Compression, search engine, HTML, XML.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
62 A Study on Finding Similar Document with Multiple Categories

Authors: R. Saraçoğlu, N. Allahverdi

Abstract:

Searching similar documents and document management subjects have important place in text mining. One of the most important parts of similar document research studies is the process of classifying or clustering the documents. In this study, a similar document search approach that includes discussion of out the case of belonging to multiple categories (multiple categories problem) has been carried. The proposed method that based on Fuzzy Similarity Classification (FSC) has been compared with Rocchio algorithm and naive Bayes method which are widely used in text mining. Empirical results show that the proposed method is quite successful and can be applied effectively. For the second stage, multiple categories vector method based on information of categories regarding to frequency of being seen together has been used. Empirical results show that achievement is increased almost two times, when proposed method is compared with classical approach.

Keywords: Document similarity, Fuzzy classification, Multiple categories, Text mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
61 WebGD: A CORBA-based Document Classification and Retrieval System on the Web

Authors: Fuyang Peng, Bo Deng, Chao Qi, Mou Zhan

Abstract:

This paper presents the design and implementation of the WebGD, a CORBA-based document classification and retrieval system on Internet. The WebGD makes use of such techniques as Web, CORBA, Java, NLP, fuzzy technique, knowledge-based processing and database technology. Unified classification and retrieval model, classifying and retrieving with one reasoning engine and flexible working mode configuration are some of its main features. The architecture of WebGD, the unified classification and retrieval model, the components of the WebGD server and the fuzzy inference engine are discussed in this paper in detail.

Keywords: Text Mining, document classification, knowledgeprocessing, fuzzy logic, Web, CORBA

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
60 A Methodology for Automatic Diversification of Document Categories

Authors: Dasom Kim, Chen Liu, Myungsu Lim, Soo-Hyeon Jeon, Byeoung Kug Jeon, Kee-Young Kwahk, Namgyu Kim

Abstract:

Recently, numerous documents including large volumes of unstructured data and text have been created because of the rapid increase in the use of social media and the Internet. Usually, these documents are categorized for the convenience of users. Because the accuracy of manual categorization is not guaranteed, and such categorization requires a large amount of time and incurs huge costs. Many studies on automatic categorization have been conducted to help mitigate the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorize complex documents with multiple topics because they work on the assumption that individual documents can be categorized into single categories only. Therefore, to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, the learning process employed in these studies involves training using a multi-categorized document set. These methods therefore cannot be applied to the multi-categorization of most documents unless multi-categorized training sets using traditional multi-categorization algorithms are provided. To overcome this limitation, in this study, we review our novel methodology for extending the category of a single-categorized document to multiple categorizes, and then introduce a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology.

Keywords: Big Data Analysis, Document Classification, Text Mining, Topic Analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
59 Restoration of Noisy Document Images with an Efficient Bi-Level Adaptive Thresholding

Authors: Abhijit Mitra

Abstract:

An effective approach for extracting document images from a noisy background is introduced. The entire scheme is divided into three sub- stechniques – the initial preprocessing operations for noise cluster tightening, introduction of a new thresholding method by maximizing the ratio of stan- dard deviations of the combined effect on the image to the sum of weighted classes and finally the image restoration phase by image binarization utiliz- ing the proposed optimum threshold level. The proposed method is found to be efficient compared to the existing schemes in terms of computational complexity as well as speed with better noise rejection.

Keywords: Document image extraction, Preprocessing, Ratio of stan-dard deviations, Bi-level adaptive thresholding.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
58 Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition

Authors: L. Hamsaveni, Navya Prakash, Suresha

Abstract:

Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document images to obtain an original document with complete information. In case, degraded document image captured is skewed, it has to be straightened (deskew) to perform further process. A special format of image storing known as YCbCr is used as a tool to convert the Grayscale image to RGB image format. The presented algorithm is tested on various types of degraded documents such as printed documents, handwritten documents, old script documents and handwritten image sketches in documents. The purpose of this research is to obtain an original document for a given set of degraded documents of the same source.

Keywords: Grayscale image format, image fusing, SURF detection, YCbCr image format.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
57 Skew Detection Technique for Binary Document Images based on Hough Transform

Authors: Manjunath Aradhya V N, Hemantha Kumar G, Shivakumara P

Abstract:

Document image processing has become an increasingly important technology in the automation of office documentation tasks. During document scanning, skew is inevitably introduced into the incoming document image. Since the algorithm for layout analysis and character recognition are generally very sensitive to the page skew. Hence, skew detection and correction in document images are the critical steps before layout analysis. In this paper, a novel skew detection method is presented for binary document images. The method considered the some selected characters of the text which may be subjected to thinning and Hough transform to estimate skew angle accurately. Several experiments have been conducted on various types of documents such as documents containing English Documents, Journals, Text-Book, Different Languages and Document with different fonts, Documents with different resolutions, to reveal the robustness of the proposed method. The experimental results revealed that the proposed method is accurate compared to the results of well-known existing methods.

Keywords: Optical Character Recognition, Skew angle, Thinning, Hough transform, Document processing

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
56 Bottom Up Text Mining through Hierarchical Document Representation

Authors: Y. Djouadi., F. Souam.

Abstract:

Most of the existing text mining approaches are proposed, keeping in mind, transaction databases model. Thus, the mined dataset is structured using just one concept: the “transaction", whereas the whole dataset is modeled using the “set" abstract type. In such cases, the structure of the whole dataset and the relationships among the transactions themselves are not modeled and consequently, not considered in the mining process. We believe that taking into account structure properties of hierarchically structured information (e.g. textual document, etc ...) in the mining process, can leads to best results. For this purpose, an hierarchical associations rule mining approach for textual documents is proposed in this paper and the classical set-oriented mining approach is reconsidered profits to a Direct Acyclic Graph (DAG) oriented approach. Natural languages processing techniques are used in order to obtain the DAG structure. Based on this graph model, an hierarchical bottom up algorithm is proposed. The main idea is that each node is mined with its parent node.

Keywords: Graph based association rules mining, Hierarchical document structure, Text mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
55 A Keyword-Based Filtering Technique of Document-Centric XML using NFA Representation

Authors: Changwoo Byun, Kyounghan Lee, Seog Park

Abstract:

XML is becoming a de facto standard for online data exchange. Existing XML filtering techniques based on a publish/subscribe model are focused on the highly structured data marked up with XML tags. These techniques are efficient in filtering the documents of data-centric XML but are not effective in filtering the element contents of the document-centric XML. In this paper, we propose an extended XPath specification which includes a special matching character '%' used in the LIKE operation of SQL in order to solve the difficulty of writing some queries to adequately filter element contents using the previous XPath specification. We also present a novel technique for filtering a collection of document-centric XMLs, called Pfilter, which is able to exploit the extended XPath specification. We show several performance studies, efficiency and scalability using the multi-query processing time (MQPT).

Keywords: XML Data Stream, Document-centric XML, Filtering Technique, Value-based Predicates.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
54 Estimation of Skew Angle in Binary Document Images Using Hough Transform

Authors: Nandini N., Srikanta Murthy K., G. Hemantha Kumar

Abstract:

This paper includes two novel techniques for skew estimation of binary document images. These algorithms are based on connected component analysis and Hough transform. Both these methods focus on reducing the amount of input data provided to Hough transform. In the first method, referred as word centroid approach, the centroids of selected words are used for skew detection. In the second method, referred as dilate & thin approach, the selected characters are blocked and dilated to get word blocks and later thinning is applied. The final image fed to Hough transform has the thinned coordinates of word blocks in the image. The methods have been successful in reducing the computational complexity of Hough transform based skew estimation algorithms. Promising experimental results are also provided to prove the effectiveness of the proposed methods.

Keywords: Dilation, Document processing, Hough transform, Optical Character Recognition, Skew estimation, and Thinning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
53 Research on Applying the Continuity Care Document to Generate a Medical Record with Entry Level

Authors: Hsing-Yi Kao, Der-Ming Liou

Abstract:

Transferring patient information between medical care sites is necessary to deliver better patient care and to reduce medical cost. So developing of electronic medical records is an important trend for the world.The Continuity of Care Document (CCD) is product of collaboration between CDA and CCR standards. In this study, we will develop a system to generate medical records with entry level based on CCD template module.

Keywords: Continuity Care Document, medical record, entrylevel

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
52 Collaborative Document Evaluation: An Alternative Approach to Classic Peer Review

Authors: J. Beel, B. Gipp

Abstract:

Research papers are usually evaluated via peer review. However, peer review has limitations in evaluating research papers. In this paper, Scienstein and the new idea of 'collaborative document evaluation' are presented. Scienstein is a project to evaluate scientific papers collaboratively based on ratings, links, annotations and classifications by the scientific community using the internet. In this paper, critical success factors of collaborative document evaluation are analyzed. That is the scientists- motivation to participate as reviewers, the reviewers- competence and the reviewers- trustworthiness. It is shown that if these factors are ensured, collaborative document evaluation may prove to be a more objective, faster and less resource intensive approach to scientific document evaluation in comparison to the classical peer review process. It is shown that additional advantages exist as collaborative document evaluation supports interdisciplinary work, allows continuous post-publishing quality assessments and enables the implementation of academic recommendation engines. In the long term, it seems possible that collaborative document evaluation will successively substitute peer review and decrease the need for journals.

Keywords: Peer Review, Alternative, Collaboration, Document Evaluation, Rating, Annotations.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
51 Data Migration between Document-Oriented and Relational Databases

Authors: Bogdan Walek, Cyril Klimes

Abstract:

Current tools for data migration between documentoriented and relational databases have several disadvantages. We propose a new approach for data migration between documentoriented and relational databases. During data migration the relational schema of the target (relational database) is automatically created from collection of XML documents. Proposed approach is verified on data migration between document-oriented database IBM Lotus/ Notes Domino and relational database implemented in relational database management system (RDBMS) MySQL.

Keywords: data migration, database, document-oriented database, XML, relational schema

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
50 The Usefulness of Logical Structure in Flexible Document Categorization

Authors: Jebari Chaker, Ounalli Habib

Abstract:

This paper presents a new approach for automatic document categorization. Exploiting the logical structure of the document, our approach assigns a HTML document to one or more categories (thesis, paper, call for papers, email, ...). Using a set of training documents, our approach generates a set of rules used to categorize new documents. The approach flexibility is carried out with rule weight association representing your importance in the discrimination between possible categories. This weight is dynamically modified at each new document categorization. The experimentation of the proposed approach provides satisfactory results.

Keywords: categorization rule, document categorization, flexible categorization, logical structure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
49 Query Algebra for Semistuctured Data

Authors: Ei Ei Myat, Ni Lar Thein

Abstract:

With the tremendous growth of World Wide Web (WWW) data, there is an emerging need for effective information retrieval at the document level. Several query languages such as XML-QL, XPath, XQL, Quilt and XQuery are proposed in recent years to provide faster way of querying XML data, but they still lack of generality and efficiency. Our approach towards evolving a framework for querying semistructured documents is based on formal query algebra. Two elements are introduced in the proposed framework: first, a generic and flexible data model for logical representation of semistructured data and second, a set of operators for the manipulation of objects defined in the data model. In additional to accommodating several peculiarities of semistructured data, our model offers novel features such as bidirectional paths for navigational querying and partitions for data transformation that are not available in other proposals.

Keywords: Algebra, Semistructured data, Query Algebra.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
48 Sustainable Development in Construction

Authors: Ali Hemmati, Ali Kheyroddin

Abstract:

Semnan is a city in semnan province, northern Iran with a population estimated at 119,778 inhabitants. It is the provincial capital of semnan province. Iran is a developing country and construction is a basic factor of developing too. Hence, Semnan city needs to a special programming for construction of buildings, structures and infrastructures. Semnan municipality tries to begin this program. In addition to, city has some historical monuments which can be interesting for tourists. Hence, Semnan inhabitants can benefit from tourist industry. Optimization of Energy in construction industry is another activity of this municipality and the inhabitants who execute these regulations receive some discounts. Many parts of Iran such as semnan are located in highly seismic zones and structures must be constructed safe e.g., according to recent seismic codes. In this paper opportunities of IT in construction industry of Iran are investigated in three categories. Pre-construction phase, construction phase and earthquake disaster mitigation are studied. Studies show that information technology can be used in these items for reducing the losses and increasing the benefits. Both government and private sectors must contribute to this strategic project for obtaining the best result.

Keywords: approval, building, construction, document, industry, IT, Semnan

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
47 Auto Classification for Search Intelligence

Authors: Lilac A. E. Al-Safadi

Abstract:

This paper proposes an auto-classification algorithm of Web pages using Data mining techniques. We consider the problem of discovering association rules between terms in a set of Web pages belonging to a category in a search engine database, and present an auto-classification algorithm for solving this problem that are fundamentally based on Apriori algorithm. The proposed technique has two phases. The first phase is a training phase where human experts determines the categories of different Web pages, and the supervised Data mining algorithm will combine these categories with appropriate weighted index terms according to the highest supported rules among the most frequent words. The second phase is the categorization phase where a web crawler will crawl through the World Wide Web to build a database categorized according to the result of the data mining approach. This database contains URLs and their categories.

Keywords: Information Processing on the Web, Data Mining, Document Classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
46 Recognition of Grocery Products in Images Captured by Cellular Phones

Authors: Farshideh Einsele, Hassan Foroosh

Abstract:

In this paper, we present a robust algorithm to recognize extracted text from grocery product images captured by mobile phone cameras. Recognition of such text is challenging since text in grocery product images varies in its size, orientation, style, illumination, and can suffer from perspective distortion. Pre-processing is performed to make the characters scale and rotation invariant. Since text degradations can not be appropriately defined using well-known geometric transformations such as translation, rotation, affine transformation and shearing, we use the whole character black pixels as our feature vector. Classification is performed with minimum distance classifier using the maximum likelihood criterion, which delivers very promising Character Recognition Rate (CRR) of 89%. We achieve considerably higher Word Recognition Rate (WRR) of 99% when using lower level linguistic knowledge about product words during the recognition process.

Keywords: Camera-based OCR, Feature extraction, Document and image processing.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
45 Entropy Based Data Hiding for Document Images

Authors: Swetha Kurup, Sridhar G., Sridhar V.

Abstract:

In this paper we present a novel technique for data hiding in binary document images. We use the concept of entropy in order to identify document specific least distortive areas throughout the binary document image. The document image is treated as any other image and the proposed method utilizes the standard document characteristics for the embedding process. Proposed method minimizes perceptual distortion due to embedding and allows watermark extraction without the requirement of any side information at the decoder end.

Keywords: Entropy, Steganography, Watermarking.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
44 A New Approach for Flexible Document Categorization

Authors: Jebari Chaker, Ounelli Habib

Abstract:

In this paper we propose a new approach for flexible document categorization according to the document type or genre instead of topic. Our approach implements two homogenous classifiers: contextual classifier and logical classifier. The contextual classifier is based on the document URL, whereas, the logical classifier use the logical structure of the document to perform the categorization. The final categorization is obtained by combining contextual and logical categorizations. In our approach, each document is assigned to all predefined categories with different membership degrees. Our experiments demonstrate that our approach is best than other genre categorization approaches.

Keywords: Categorization, combination, flexible, logicalstructure, genre, category, URL.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
43 Prediction of Writer Using Tamil Handwritten Document Image Based on Pooled Features

Authors: T. Thendral, M. S. Vijaya, S. Karpagavalli

Abstract:

Tamil handwritten document is taken as a key source of data to identify the writer. Tamil is a classical language which has 247 characters include compound characters, consonants, vowels and special character. Most characters of Tamil are multifaceted in nature. Handwriting is a unique feature of an individual. Writer may change their handwritings according to their frame of mind and this place a risky challenge in identifying the writer. A new discriminative model with pooled features of handwriting is proposed and implemented using support vector machine. It has been reported on 100% of prediction accuracy by RBF and polynomial kernel based classification model.

Keywords: Classification, Feature extraction, Support vector machine, Training, Writer.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
42 Prediction of Writer Using Tamil Handwritten Document Image Based on Pooled Features

Authors: T. Thendral, M. S. Vijaya, S. Karpagavalli

Abstract:

Tamil handwritten document is taken as a key source of data to identify the writer. Tamil is a classical language which has 247 characters include compound characters, consonants, vowels and special character. Most characters of Tamil are multifaceted in nature. Handwriting is a unique feature of an individual. Writer may change their handwritings according to their frame of mind and this place a risky challenge in identifying the writer. A new discriminative model with pooled features of handwriting is proposed and implemented using support vector machine. It has been reported on 100% of prediction accuracy by RBF and polynomial kernel based classification model.

Keywords: Classification, Feature extraction, Support vector machine, Training, Writer.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
41 Improved Zero Text Watermarking Algorithm against Meaning Preserving Attacks

Authors: Jalil Z., Farooq M., Zafar H., Sabir M., Ashraf E.

Abstract:

Internet is largely composed of textual contents and a huge volume of digital contents gets floated over the Internet daily. The ease of information sharing and re-production has made it difficult to preserve author-s copyright. Digital watermarking came up as a solution for copyright protection of plain text problem after 1993. In this paper, we propose a zero text watermarking algorithm based on occurrence frequency of non-vowel ASCII characters and words for copyright protection of plain text. The embedding algorithm makes use of frequency non-vowel ASCII characters and words to generate a specialized author key. The extraction algorithm uses this key to extract watermark, hence identify the original copyright owner. Experimental results illustrate the effectiveness of the proposed algorithm on text encountering meaning preserving attacks performed by five independent attackers.

Keywords: Copyright protection, Digital watermarking, Document authentication, Information security, Watermark.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
40 A Study of the Variability of Very Low Resolution Characters and the Feasibility of Their Discrimination Using Geometrical Features

Authors: Farshideh Einsele, Rolf Ingold

Abstract:

Current OCR technology does not allow to accurately recognizing small text images, such as those found in web images. Our goal is to investigate new approaches to recognize very low resolution text images containing antialiased character shapes. This paper presents a preliminary study on the variability of such characters and the feasibility to discriminate them by using geometrical features. In a first stage we analyze the distribution of these features. In a second stage we present a study on the discriminative power for recognizing isolated characters, using various rendering methods and font properties. Finally we present interesting results of our evaluation tests leading to our conclusion and future focus.

Keywords: World Wide Web, document analysis, pattern recognition, Optical Character Recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF