Search results for: Document similarity
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 630

Search results for: Document similarity

600 Color and Layout-based Identification of Documents Captured from Handheld Devices

Authors: Ardhendu Behera, Denis Lalanne, Rolf Ingold

Abstract:

This paper proposes a method, combining color and layout features, for identifying documents captured from low-resolution handheld devices. On one hand, the document image color density surface is estimated and represented with an equivalent ellipse and on the other hand, the document shallow layout structure is computed and hierarchically represented. Our identification method first uses the color information in the documents in order to focus the search space on documents having a similar color distribution, and finally selects the document having the most similar layout structure in the remaining of the search space.

Keywords: Document color modeling, document visualsignature, kernel density estimation, document identification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1570
599 Highlighting Document's Structure

Authors: Sylvie Ratté, Wilfried Njomgue, Pierre-André Ménard

Abstract:

In this paper, we present symbolic recognition models to extract knowledge characterized by document structures. Focussing on the extraction and the meticulous exploitation of the semantic structure of documents, we obtain a meaningful contextual tagging corresponding to different unit types (title, chapter, section, enumeration, etc.).

Keywords: Information retrieval, document structures, symbolic grammars.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1227
598 Collaborative Document Evaluation: An Alternative Approach to Classic Peer Review

Authors: J. Beel, B. Gipp

Abstract:

Research papers are usually evaluated via peer review. However, peer review has limitations in evaluating research papers. In this paper, Scienstein and the new idea of 'collaborative document evaluation' are presented. Scienstein is a project to evaluate scientific papers collaboratively based on ratings, links, annotations and classifications by the scientific community using the internet. In this paper, critical success factors of collaborative document evaluation are analyzed. That is the scientists- motivation to participate as reviewers, the reviewers- competence and the reviewers- trustworthiness. It is shown that if these factors are ensured, collaborative document evaluation may prove to be a more objective, faster and less resource intensive approach to scientific document evaluation in comparison to the classical peer review process. It is shown that additional advantages exist as collaborative document evaluation supports interdisciplinary work, allows continuous post-publishing quality assessments and enables the implementation of academic recommendation engines. In the long term, it seems possible that collaborative document evaluation will successively substitute peer review and decrease the need for journals.

Keywords: Peer Review, Alternative, Collaboration, Document Evaluation, Rating, Annotations.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1489
597 Data Extraction of XML Files using Searching and Indexing Techniques

Authors: Sushma Satpute, Vaishali Katkar, Nilesh Sahare

Abstract:

XML files contain data which is in well formatted manner. By studying the format or semantics of the grammar it will be helpful for fast retrieval of the data. There are many algorithms which describes about searching the data from XML files. There are no. of approaches which uses data structure or are related to the contents of the document. In these cases user must know about the structure of the document and information retrieval techniques using NLPs is related to content of the document. Hence the result may be irrelevant or not so successful and may take more time to search.. This paper presents fast XML retrieval techniques by using new indexing technique and the concept of RXML. When indexing an XML document, the system takes into account both the document content and the document structure and assigns the value to each tag from file. To query the system, a user is not constrained about fixed format of query.

Keywords: XML Retrieval, Indexed Search, Information Retrieval.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1783
596 Skew Detection Technique for Binary Document Images based on Hough Transform

Authors: Manjunath Aradhya V N, Hemantha Kumar G, Shivakumara P

Abstract:

Document image processing has become an increasingly important technology in the automation of office documentation tasks. During document scanning, skew is inevitably introduced into the incoming document image. Since the algorithm for layout analysis and character recognition are generally very sensitive to the page skew. Hence, skew detection and correction in document images are the critical steps before layout analysis. In this paper, a novel skew detection method is presented for binary document images. The method considered the some selected characters of the text which may be subjected to thinning and Hough transform to estimate skew angle accurately. Several experiments have been conducted on various types of documents such as documents containing English Documents, Journals, Text-Book, Different Languages and Document with different fonts, Documents with different resolutions, to reveal the robustness of the proposed method. The experimental results revealed that the proposed method is accurate compared to the results of well-known existing methods.

Keywords: Optical Character Recognition, Skew angle, Thinning, Hough transform, Document processing

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2095
595 Arabic Word Semantic Similarity

Authors: Faaza A, Almarsoomi, James D, O'Shea, Zuhair A, Bandar, Keeley A, Crockett

Abstract:

This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS and hopefully it will be considered as a reference basis from which to evaluate and compare different methodologies in the field.

Keywords: Arabic categories, benchmark dataset, semantic similarity, word pair, stimulus Arabic words

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3106
594 Persian/Arabic Document Segmentation Based On Pyramidal Image Structure

Authors: Seyyed Yasser Hashemi, Khalil Monfaredi

Abstract:

Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus in most previously reported methods it is inevitable to include these parameters. This problem becomes excessively acute and severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ considerably from the English scripts, most of the proposed methods for the English scripts do not render good results for the Persian scripts. In this paper, we present a novel parameter-free method for segmenting the Persian/Arabic document images which also works well for English scripts. This method segments the document image into maximal homogeneous regions and identifies them as texts and non-texts based on a pyramidal image structure. In other words the proposed method is capable of document segmentation without considering the character font sizes, text line spacing, and document layout structures. This algorithm is examined for 150 Arabic/Persian and English documents and document segmentation process are done successfully for 96 percent of documents.

Keywords: Persian/Arabic document, document segmentation, Pyramidal Image Structure, skew detection and correction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1765
593 Shape-Based Image Retrieval Using Shape Matrix

Authors: C. Sheng, Y. Xin

Abstract:

Retrieval image by shape similarity, given a template shape is particularly challenging, owning to the difficulty to derive a similarity measurement that closely conforms to the common perception of similarity by humans. In this paper, a new method for the representation and comparison of shapes is present which is based on the shape matrix and snake model. It is scaling, rotation, translation invariant. And it can retrieve the shape images with some missing or occluded parts. In the method, the deformation spent by the template to match the shape images and the matching degree is used to evaluate the similarity between them.

Keywords: shape representation, shape matching, shape matrix, deformation

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1511
592 Similarity Based Membership of Elements to Uncertain Concept in Information System

Authors: M. Kamel El-Sayed

Abstract:

The process of determining the degree of membership for an element to an uncertain concept has been found in many ways, using equivalence and symmetry relations in information systems. In the case of similarity, these methods did not take into account the degree of symmetry between elements. In this paper, we use a new definition for finding the membership based on the degree of symmetry. We provide an example to clarify the suggested methods and compare it with previous methods. This method opens the door to more accurate decisions in information systems.

Keywords: Information system, uncertain concept, membership function, similarity relation, degree of similarity.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 810
591 Agglomerative Hierarchical Clustering Using the Tθ Family of Similarity Measures

Authors: Salima Kouici, Abdelkader Khelladi

Abstract:

In this work, we begin with the presentation of the Tθ family of usual similarity measures concerning multidimensional binary data. Subsequently, some properties of these measures are proposed. Finally the impact of the use of different inter-elements measures on the results of the Agglomerative Hierarchical Clustering Methods is studied.

Keywords: Binary data, similarity measure, Tθ measures, Agglomerative Hierarchical Clustering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3445
590 Improving Similarity Search Using Clustered Data

Authors: Deokho Kim, Wonwoo Lee, Jaewoong Lee, Teresa Ng, Gun-Ill Lee, Jiwon Jeong

Abstract:

This paper presents a method for improving object search accuracy using a deep learning model. A major limitation to provide accurate similarity with deep learning is the requirement of huge amount of data for training pairwise similarity scores (metrics), which is impractical to collect. Thus, similarity scores are usually trained with a relatively small dataset, which comes from a different domain, causing limited accuracy on measuring similarity. For this reason, this paper proposes a deep learning model that can be trained with a significantly small amount of data, a clustered data which of each cluster contains a set of visually similar images. In order to measure similarity distance with the proposed method, visual features of two images are extracted from intermediate layers of a convolutional neural network with various pooling methods, and the network is trained with pairwise similarity scores which is defined zero for images in identical cluster. The proposed method outperforms the state-of-the-art object similarity scoring techniques on evaluation for finding exact items. The proposed method achieves 86.5% of accuracy compared to the accuracy of the state-of-the-art technique, which is 59.9%. That is, an exact item can be found among four retrieved images with an accuracy of 86.5%, and the rest can possibly be similar products more than the accuracy. Therefore, the proposed method can greatly reduce the amount of training data with an order of magnitude as well as providing a reliable similarity metric.

Keywords: Visual search, deep learning, convolutional neural network, machine learning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 825
589 Image Similarity: A Genetic Algorithm Based Approach

Authors: R. C. Joshi, Shashikala Tapaswi

Abstract:

The paper proposes an approach using genetic algorithm for computing the region based image similarity. The image is denoted using a set of segmented regions reflecting color and texture properties of an image. An image is associated with a family of image features corresponding to the regions. The resemblance of two images is then defined as the overall similarity between two families of image features, and quantified by a similarity measure, which integrates properties of all the regions in the images. A genetic algorithm is applied to decide the most plausible matching. The performance of the proposed method is illustrated using examples from an image database of general-purpose images, and is shown to produce good results.

Keywords: Image Features, color descriptor, segmented classes, texture descriptors, genetic algorithm.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2326
588 A Combination of Similarity Ranking and Time for Social Research Paper Searching

Authors: P. Jomsri

Abstract:

Nowadays social media are important tools for web resource discovery. The performance and capabilities of web searches are vital, especially search results from social research paper bookmarking. This paper proposes a new algorithm for ranking method that is a combination of similarity ranking with paper posted time or CSTRank. The paper posted time is static ranking for improving search results. For this particular study, the paper posted time is combined with similarity ranking to produce a better ranking than other methods such as similarity ranking or SimRank. The retrieval performance of combination rankings is evaluated using mean values of NDCG. The evaluation in the experiments implies that the chosen CSTRank ranking by using weight score at ratio 90:10 can improve the efficiency of research paper searching on social bookmarking websites.

Keywords: combination ranking, information retrieval, time, similarity ranking, static ranking, weight score

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1666
587 Impact of Similarity Ratings on Human Judgement

Authors: Ian A. McCulloh, Madelaine Zinser, Jesse Patsolic, Michael Ramos

Abstract:

Recommender systems are a common artificial intelligence (AI) application. For any given input, a search system will return a rank-ordered list of similar items. As users review returned items, they must decide when to halt the search and either revise search terms or conclude their requirement is novel with no similar items in the database. We present a statistically designed experiment that investigates the impact of similarity ratings on human judgement to conclude a search item is novel and halt the search. In the study, 450 participants were recruited from Amazon Mechanical Turk to render judgement across 12 decision tasks. We find the inclusion of ratings increases the human perception that items are novel. Percent similarity increases novelty discernment when compared with star-rated similarity or the absence of a rating. Ratings reduce the time to decide and improve decision confidence. This suggests that the inclusion of similarity ratings can aid human decision-makers in knowledge search tasks.

Keywords: Ratings, rankings, crowdsourcing, empirical studies, user studies, similarity measures, human-centered computing, novelty in information retrieval.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 417
586 Discovering the Dimension of Abstractness: Structure-Based Model that Learns New Categories and Categorizes on Different Levels of Abstraction

Authors: Georgi I. Petkov, Ivan I. Vankov, Yolina A. Petrova

Abstract:

A structure-based model of category learning and categorization at different levels of abstraction is presented. The model compares different structures and expresses their similarity implicitly in the forms of mappings. Based on this similarity, the model can categorize different targets either as members of categories that it already has or creates new categories. The model is novel using two threshold parameters to evaluate the structural correspondence. If the similarity between two structures exceeds the higher threshold, a new sub-ordinate category is created. Vice versa, if the similarity does not exceed the higher threshold but does the lower one, the model creates a new category on higher level of abstraction.

Keywords: Analogy-making, categorization, learning of categories, abstraction, hierarchical structure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 780
585 Comparative Analysis of Diversity and Similarity Indices with Special Relevance to Vegetations around Sewage Drains

Authors: Ekta Singh

Abstract:

Indices summarizing community structure are used to evaluate fundamental community ecology, species interaction, biogeographical factors, and environmental stress. Some of these indices are insensitive to gross community changes induced by contaminants of pollution. Diversity indices and similarity indices are reviewed considering their ecological application, both theoretical and practical. For some useful indices, empirical equations are given to calculate the expected maximum value of the indices to which the observed values can be related at any combination of sample sizes at the experimental sites. This paper examines the effects of sample size and diversity on the expected values of diversity indices and similarity indices, using various formulae. It has been shown that all indices are strongly affected by sample size and diversity. In some indices, this influence is greater than the others and an attempt has been made to deal with these influences.

Keywords: Biogeographical factors, Diversity Indices, Ecology and Similarity Indices

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2994
584 Investigation of Self-Similarity Solution for Wake Flow of a Cylinder

Authors: A. B. Khoshnevis, F. Zeydabadi, F. Sokhanvar

Abstract:

The data measurement of mean velocity has been taken for the wake of single circular cylinder with three different diameters for two different velocities. The effects of change in diameter and in velocity are studied in self-similar coordinate system. The spatial variations of velocity defect and that of the half-width have been investigated. The results are compared with those published by H.Schlichting. In the normalized coordinates, it is also observed that all cases except for the first station are self-similar. By attention to self-similarity profiles of mean velocity, it is observed for all the cases at the each station curves tend to zero at a same point.

Keywords: Self-similarity, wake of single circular cylinder

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2397
583 Improving Topic Quality of Scripts by Using Scene Similarity Based Word Co-Occurrence

Authors: Yunseok Noh, Chang-Uk Kwak, Sun-Joong Kim, Seong-Bae Park

Abstract:

Scripts are one of the basic text resources to understand broadcasting contents. Topic modeling is the method to get the summary of the broadcasting contents from its scripts. Generally, scripts represent contents descriptively with directions and speeches, and provide scene segments that can be seen as semantic units. Therefore, a script can be topic modeled by treating a scene segment as a document. Because scene segments consist of speeches mainly, however, relatively small co-occurrences among words in the scene segments are observed. This causes inevitably the bad quality of topics by statistical learning method. To tackle this problem, we propose a method to improve topic quality with additional word co-occurrence information obtained using scene similarities. The main idea of improving topic quality is that the information that two or more texts are topically related can be useful to learn high quality of topics. In addition, more accurate topical representations lead to get information more accurate whether two texts are related or not. In this paper, we regard two scene segments are related if their topical similarity is high enough. We also consider that words are co-occurred if they are in topically related scene segments together. By iteratively inferring topics and determining semantically neighborhood scene segments, we draw a topic space represents broadcasting contents well. In the experiments, we showed the proposed method generates a higher quality of topics from Korean drama scripts than the baselines.

Keywords: Broadcasting contents, generalized P´olya urn model, scripts, text similarity, topic model.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1817
582 Map Matching Performance under Various Similarity Metrics for Heterogeneous Robot Teams

Authors: M. C. Akay, A. Aybakan, H. Temeltas

Abstract:

Aerial and ground robots have various advantages of usage in different missions. Aerial robots can move quickly and get a different sight of view of the area, but those vehicles cannot carry heavy payloads. On the other hand, unmanned ground vehicles (UGVs) are slow moving vehicles, since those can carry heavier payloads than unmanned aerial vehicles (UAVs). In this context, we investigate the performances of various Similarity Metrics to provide a common map for Heterogeneous Robot Team (HRT) in complex environments. Within the usage of Lidar Odometry and Octree Mapping technique, the local 3D maps of the environment are gathered.  In order to obtain a common map for HRT, informative theoretic similarity metrics are exploited. All types of these similarity metrics gave adequate as allowable simulation time and accurate results that can be used in different types of applications. For the heterogeneous multi robot team, those methods can be used to match different types of maps.

Keywords: Common maps, heterogeneous robot team, map matching, informative theoretic similarity metrics.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 900
581 Research on Applying the Continuity Care Document to Generate a Medical Record with Entry Level

Authors: Hsing-Yi Kao, Der-Ming Liou

Abstract:

Transferring patient information between medical care sites is necessary to deliver better patient care and to reduce medical cost. So developing of electronic medical records is an important trend for the world.The Continuity of Care Document (CCD) is product of collaboration between CDA and CCR standards. In this study, we will develop a system to generate medical records with entry level based on CCD template module.

Keywords: Continuity Care Document, medical record, entrylevel

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1992
580 A Proposed Hybrid Approach for Feature Selection in Text Document Categorization

Authors: M. F. Zaiyadi, B. Baharudin

Abstract:

Text document categorization involves large amount of data or features. The high dimensionality of features is a troublesome and can affect the performance of the classification. Therefore, feature selection is strongly considered as one of the crucial part in text document categorization. Selecting the best features to represent documents can reduce the dimensionality of feature space hence increase the performance. There were many approaches has been implemented by various researchers to overcome this problem. This paper proposed a novel hybrid approach for feature selection in text document categorization based on Ant Colony Optimization (ACO) and Information Gain (IG). We also presented state-of-the-art algorithms by several other researchers.

Keywords: Ant colony optimization, feature selection, information gain, text categorization, text representation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2069
579 Information Filtering using Index Word Selection based on the Topics

Authors: Takeru YOKOI, Hidekazu YANAGIMOTO, Sigeru OMATU

Abstract:

We have proposed an information filtering system using index word selection from a document set based on the topics included in a set of documents. This method narrows down the particularly characteristic words in a document set and the topics are obtained by Sparse Non-negative Matrix Factorization. In information filtering, a document is often represented with the vector in which the elements correspond to the weight of the index words, and the dimension of the vector becomes larger as the number of documents is increased. Therefore, it is possible that useless words as index words for the information filtering are included. In order to address the problem, the dimension needs to be reduced. Our proposal reduces the dimension by selecting index words based on the topics included in a document set. We have applied the Sparse Non-negative Matrix Factorization to the document set to obtain these topics. The filtering is carried out based on a centroid of the learning document set. The centroid is regarded as the user-s interest. In addition, the centroid is represented with a document vector whose elements consist of the weight of the selected index words. Using the English test collection MEDLINE, thus, we confirm the effectiveness of our proposal. Hence, our proposed selection can confirm the improvement of the recommendation accuracy from the other previous methods when selecting the appropriate number of index words. In addition, we discussed the selected index words by our proposal and we found our proposal was able to select the index words covered some minor topics included in the document set.

Keywords: Information Filtering, Sparse NMF, Index wordSelection, User Profile, Chi-squared Measure

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1456
578 A Proposed Approach for Emotion Lexicon Enrichment

Authors: Amr Mansour Mohsen, Hesham Ahmed Hassan, Amira M. Idrees

Abstract:

Document Analysis is an important research field that aims to gather the information by analyzing the data in documents. As one of the important targets for many fields is to understand what people actually want, sentimental analysis field has been one of the vital fields that are tightly related to the document analysis. This research focuses on analyzing text documents to classify each document according to its opinion. The aim of this research is to detect the emotions from text documents based on enriching the lexicon with adapting their content based on semantic patterns extraction. The proposed approach has been presented, and different experiments are applied by different perspectives to reveal the positive impact of the proposed approach on the classification results.

Keywords: Document analysis, sentimental analysis, emotion detection, WEKA tool, NRC Lexicon.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1456
577 Measuring Teachers- Beliefs about Mathematics: A Fuzzy Set Approach

Authors: M.A. Lazim, M.T.Abu Osman

Abstract:

This paper deals with the application of a fuzzy set in measuring teachers- beliefs about mathematics. The vagueness of beliefs was transformed into standard mathematical values using a fuzzy preferences model. The study employed a fuzzy approach questionnaire which consists of six attributes for measuring mathematics teachers- beliefs about mathematics. The fuzzy conjoint analysis approach based on fuzzy set theory was used to analyze the data from twenty three mathematics teachers from four secondary schools in Terengganu, Malaysia. Teachers- beliefs were recorded in form of degrees of similarity and its levels of agreement. The attribute 'Drills and practice is one of the best ways of learning mathematics' scored the highest degree of similarity at 0. 79860 with level of 'strongly agree'. The results showed that the teachers- beliefs about mathematics were varied. This is shown by different levels of agreement and degrees of similarity of the measured attributes.

Keywords: belief, membership function, degree of similarity, conjoint analysis

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2341
576 Ultra High Speed Approach for Document Skew Detection and Correction Based On Centre of Gravity

Authors: Seyyed Yasser Hashemi

Abstract:

Skew detection and correction (SDC) has a direct effect in efficiency and exactitude of documents’ segmentation and analysis and thus is considered as a very important step in documents’ analysis field. Skew is a major problem in documents’ analysis for every language. For Arabic/Persian document scripts this problem is more severe because of special features of these languages. In this paper an efficient and fast algorithm for Document Skew Detection (DSD) based on the concept of segmentation and Center of Gravity (COG) is proposed. This algorithm is examined for 150 Arabic/Persian and English documents and SDC process are done successfully for 93 percent of documents with error rate of less than 1°. This algorithm shows better results for English documents compared to Arabic/Persian documents. The proposed method is also represents favorable results for handwritten, printed and also complicated documents such as newspapers and journals even with very low quality and resolution.

Keywords: Arabic/Persian document, Baseline, Centre of gravity, Document segmentation, Skew detection and correction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1911
575 Advanced Information Extraction with n-gram based LSI

Authors: Ahmet Güven, Ö. Özgür Bozkurt, Oya Kalıpsız

Abstract:

Number of documents being created increases at an increasing pace while most of them being in already known topics and little of them introducing new concepts. This fact has started a new era in information retrieval discipline where the requirements have their own specialties. That is digging into topics and concepts and finding out subtopics or relations between topics. Up to now IR researches were interested in retrieving documents about a general topic or clustering documents under generic subjects. However these conventional approaches can-t go deep into content of documents which makes it difficult for people to reach to right documents they were searching. So we need new ways of mining document sets where the critic point is to know much about the contents of the documents. As a solution we are proposing to enhance LSI, one of the proven IR techniques by supporting its vector space with n-gram forms of words. Positive results we have obtained are shown in two different application area of IR domain; querying a document database, clustering documents in the document database.

Keywords: Document clustering, Information Extraction, Information Retrieval, LSI, n-gram.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1803
574 A Keyword-Based Filtering Technique of Document-Centric XML using NFA Representation

Authors: Changwoo Byun, Kyounghan Lee, Seog Park

Abstract:

XML is becoming a de facto standard for online data exchange. Existing XML filtering techniques based on a publish/subscribe model are focused on the highly structured data marked up with XML tags. These techniques are efficient in filtering the documents of data-centric XML but are not effective in filtering the element contents of the document-centric XML. In this paper, we propose an extended XPath specification which includes a special matching character '%' used in the LIKE operation of SQL in order to solve the difficulty of writing some queries to adequately filter element contents using the previous XPath specification. We also present a novel technique for filtering a collection of document-centric XMLs, called Pfilter, which is able to exploit the extended XPath specification. We show several performance studies, efficiency and scalability using the multi-query processing time (MQPT).

Keywords: XML Data Stream, Document-centric XML, Filtering Technique, Value-based Predicates.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1760
573 On Solution of Interval Valued Intuitionistic Fuzzy Assignment Problem Using Similarity Measure and Score Function

Authors: Gaurav Kumar, Rakesh Kumar Bajaj

Abstract:

The primary objective of the paper is to propose a new method for solving assignment problem under uncertain situation. In the classical assignment problem (AP), zpqdenotes the cost for assigning the qth job to the pth person which is deterministic in nature. Here in some uncertain situation, we have assigned a cost in the form of composite relative degree Fpq instead of  and this replaced cost is in the maximization form. In this paper, it has been solved and validated by the two proposed algorithms, a new mathematical formulation of IVIF assignment problem has been presented where the cost has been considered to be an IVIFN and the membership of elements in the set can be explained by positive and negative evidences. To determine the composite relative degree of similarity of IVIFS the concept of similarity measure and the score function is used for validating the solution which is obtained by Composite relative similarity degree method. Further, hypothetical numeric illusion is conducted to clarify the method’s effectiveness and feasibility developed in the study. Finally, conclusion and suggestion for future work are also proposed.

Keywords: Assignment problem, Interval-valued Intuitionistic Fuzzy Sets, Similarity Measures, score function.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3012
572 Similarity Detection in Collaborative Development of Object-Oriented Formal Specifications

Authors: Fathi Taibi, Fouad Mohammed Abbou, Md. Jahangir Alam

Abstract:

The complexity of today-s software systems makes collaborative development necessary to accomplish tasks. Frameworks are necessary to allow developers perform their tasks independently yet collaboratively. Similarity detection is one of the major issues to consider when developing such frameworks. It allows developers to mine existing repositories when developing their own views of a software artifact, and it is necessary for identifying the correspondences between the views to allow merging them and checking their consistency. Due to the importance of the requirements specification stage in software development, this paper proposes a framework for collaborative development of Object- Oriented formal specifications along with a similarity detection approach to support the creation, merging and consistency checking of specifications. The paper also explores the impact of using additional concepts on improving the matching results. Finally, the proposed approach is empirically evaluated.

Keywords: Collaborative Development, Formal methods, Object-Oriented, Similarity detection

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1469
571 Flocking Behaviors for Multiple Groups with Heterogeneous Agents

Authors: Jae Moon Lee

Abstract:

Most of researches for conventional simulations were studied focusing on flocks with a single species. While there exist the flocking behaviors with a single species in nature, the flocking behaviors are frequently observed with multi-species. This paper studies on the flocking simulation for heterogeneous agents. In order to simulate the flocks for heterogeneous agents, the conventional method uses the identifier of flock, while the proposed method defines the feature vector of agent and uses the similarity between agents by comparing with those feature vectors. Based on the similarity, the paper proposed the attractive force and repulsive force and then executed the simulation by applying two forces. The results of simulation showed that flock formation with heterogeneous agents is very natural in both cases. In addition, it showed that unlike the existing method, the proposed method can not only control the density of the flocks, but also be possible for two different groups of agents to flock close to each other if they have a high similarity.

Keywords: Flocking behavior, heterogeneous agents, similarity, simulation

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1599