Search results for: Jaccard’s similarity measure.
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 1367

Search results for: Jaccard’s similarity measure.

1367 A New Similarity Measure Based On Edge Counting

Authors: T. Slimani, B. Ben Yaghlane, K. Mellouli

Abstract:

In the field of concepts, the measure of Wu and Palmer [1] has the advantage of being simple to implement and have good performances compared to the other similarity measures [2]. Nevertheless, the Wu and Palmer measure present the following disadvantage: in some situations, the similarity of two elements of an IS-A ontology contained in the neighborhood exceeds the similarity value of two elements contained in the same hierarchy. This situation is inadequate within the information retrieval framework. To overcome this problem, we propose a new similarity measure based on the Wu and Palmer measure. Our objective is to obtain realistic results for concepts not located in the same way. The obtained results show that compared to the Wu and Palmer approach, our measure presents a profit in terms of relevance and execution time.

Keywords: Hierarchy, IS-A ontology, Semantic Web, Similarity Measure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1488
1366 Effects of Introducing Similarity Measures into Artificial Bee Colony Approach for Optimization of Vehicle Routing Problem

Authors: P. Shunmugapriya, S. Kanmani, P. Jude Fredieric, U. Vignesh, J. Reman Justin, K. Vivek

Abstract:

Vehicle Routing Problem (VRP) is a complex combinatorial optimization problem and it is quite difficult to find an optimal solution consisting of a set of routes for vehicles whose total cost is minimum. Evolutionary and swarm intelligent (SI) algorithms play a vital role in solving optimization problems. While the SI algorithms perform search, the diversity between the solutions they exploit is very important. This is because of the need to avoid early convergence and to get an appropriate balance between the exploration and exploitation. Therefore, it is important to check how far the solutions are diverse. In this paper, we measure the similarity between solutions, which ABC exploits while optimizing VRP. The similar solutions found are discarded at the end of the iteration and only unique solutions are passed on to the next iteration. The bees of discarded solutions become scouts and they start searching for new solutions. This process is continued and results show that the solution is optimized at lesser number of iterations but with the overhead of computing similarity in all the iterations. The problem instance from Solomon benchmarked dataset has been used for evaluating the presented methodology.

Keywords: ABC algorithm, vehicle routing problem, optimization, Jaccard’s similarity measure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 845
1365 A Similarity Measure for Clustering and its Applications

Authors: Guadalupe J. Torres, Ram B. Basnet, Andrew H. Sung, Srinivas Mukkamala, Bernardete M. Ribeiro

Abstract:

This paper introduces a measure of similarity between two clusterings of the same dataset produced by two different algorithms, or even the same algorithm (K-means, for instance, with different initializations usually produce different results in clustering the same dataset). We then apply the measure to calculate the similarity between pairs of clusterings, with special interest directed at comparing the similarity between various machine clusterings and human clustering of datasets. The similarity measure thus can be used to identify the best (in terms of most similar to human) clustering algorithm for a specific problem at hand. Experimental results pertaining to the text categorization problem of a Portuguese corpus (wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The significance and other potential applications of the proposed measure are discussed.

Keywords: Clustering Algorithms, Clustering Applications, Similarity Measures, Text Clustering

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1571
1364 Similarity Measure Functions for Strategy-Based Biometrics

Authors: Roman V. Yampolskiy, Venu Govindaraju

Abstract:

Functioning of a biometric system in large part depends on the performance of the similarity measure function. Frequently a generalized similarity distance measure function such as Euclidian distance or Mahalanobis distance is applied to the task of matching biometric feature vectors. However, often accuracy of a biometric system can be greatly improved by designing a customized matching algorithm optimized for a particular biometric application. In this paper we propose a tailored similarity measure function for behavioral biometric systems based on the expert knowledge of the feature level data in the domain. We compare performance of a proposed matching algorithm to that of other well known similarity distance functions and demonstrate its superiority with respect to the chosen domain.

Keywords: Behavioral Biometrics, Euclidian Distance, Matching, Similarity Measure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1649
1363 Application of a Similarity Measure for Graphs to Web-based Document Structures

Authors: Matthias Dehmer, Frank Emmert Streib, Alexander Mehler, Jürgen Kilian, Max Mühlhauser

Abstract:

Due to the tremendous amount of information provided by the World Wide Web (WWW) developing methods for mining the structure of web-based documents is of considerable interest. In this paper we present a similarity measure for graphs representing web-based hypertext structures. Our similarity measure is mainly based on a novel representation of a graph as linear integer strings, whose components represent structural properties of the graph. The similarity of two graphs is then defined as the optimal alignment of the underlying property strings. In this paper we apply the well known technique of sequence alignments for solving a novel and challenging problem: Measuring the structural similarity of generalized trees. In other words: We first transform our graphs considered as high dimensional objects in linear structures. Then we derive similarity values from the alignments of the property strings in order to measure the structural similarity of generalized trees. Hence, we transform a graph similarity problem to a string similarity problem for developing a efficient graph similarity measure. We demonstrate that our similarity measure captures important structural information by applying it to two different test sets consisting of graphs representing web-based document structures.

Keywords: Graph similarity, hierarchical and directed graphs, hypertext, generalized trees, web structure mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1892
1362 A New Similarity Measure on Intuitionistic Fuzzy Sets

Authors: Binyamin Yusoff, Imran Taib, Lazim Abdullah, Abd Fatah Wahab

Abstract:

Intuitionistic fuzzy sets as proposed by Atanassov, have gained much attention from past and latter researchers for applications in various fields. Similarity measures between intuitionistic fuzzy sets were developed afterwards. However, it does not cater the conflicting behavior of each element evaluated. We therefore made some modification to the similarity measure of IFS by considering conflicting concept to the model. In this paper, we concentrate on Zhang and Fu-s similarity measures for IFSs and some examples are given to validate these similarity measures. A simple modification to Zhang and Fu-s similarity measures of IFSs was proposed to find the best result according to the use of degree of indeterminacy. Finally, we mark up with the application to real decision making problems.

Keywords: Intuitionistic fuzzy sets, similarity measures, multicriteriadecision making.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2847
1361 Measuring the Structural Similarity of Web-based Documents: A Novel Approach

Authors: Matthias Dehmer, Frank Emmert Streib, Alexander Mehler, Jürgen Kilian

Abstract:

Most known methods for measuring the structural similarity of document structures are based on, e.g., tag measures, path metrics and tree measures in terms of their DOM-Trees. Other methods measures the similarity in the framework of the well known vector space model. In contrast to these we present a new approach to measuring the structural similarity of web-based documents represented by so called generalized trees which are more general than DOM-Trees which represent only directed rooted trees.We will design a new similarity measure for graphs representing web-based hypertext structures. Our similarity measure is mainly based on a novel representation of a graph as strings of linear integers, whose components represent structural properties of the graph. The similarity of two graphs is then defined as the optimal alignment of the underlying property strings. In this paper we apply the well known technique of sequence alignments to solve a novel and challenging problem: Measuring the structural similarity of generalized trees. More precisely, we first transform our graphs considered as high dimensional objects in linear structures. Then we derive similarity values from the alignments of the property strings in order to measure the structural similarity of generalized trees. Hence, we transform a graph similarity problem to a string similarity problem. We demonstrate that our similarity measure captures important structural information by applying it to two different test sets consisting of graphs representing web-based documents.

Keywords: Graph similarity, hierarchical and directed graphs, hypertext, generalized trees, web structure mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2557
1360 Quick Similarity Measurement of Binary Images via Probabilistic Pixel Mapping

Authors: Adnan A. Y. Mustafa

Abstract:

In this paper we present a quick technique to measure the similarity between binary images. The technique is based on a probabilistic mapping approach and is fast because only a minute percentage of the image pixels need to be compared to measure the similarity, and not the whole image. We exploit the power of the Probabilistic Matching Model for Binary Images (PMMBI) to arrive at an estimate of the similarity. We show that the estimate is a good approximation of the actual value, and the quality of the estimate can be improved further with increased image mappings. Furthermore, the technique is image size invariant; the similarity between big images can be measured as fast as that for small images. Examples of trials conducted on real images are presented.

Keywords: Big images, binary images, similarity, matching.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 919
1359 Measuring Text-Based Semantics Relatedness Using WordNet

Authors: Madiha Khan, Sidrah Ramzan, Seemab Khan, Shahzad Hassan, Kamran Saeed

Abstract:

Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.

Keywords: GraphViz representation, semantic relatedness, similarity measurement, WordNet similarity.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 836
1358 A Context-Sensitive Algorithm for Media Similarity Search

Authors: Guang-Ho Cha

Abstract:

This paper presents a context-sensitive media similarity search algorithm. One of the central problems regarding media search is the semantic gap between the low-level features computed automatically from media data and the human interpretation of them. This is because the notion of similarity is usually based on high-level abstraction but the low-level features do not sometimes reflect the human perception. Many media search algorithms have used the Minkowski metric to measure similarity between image pairs. However those functions cannot adequately capture the aspects of the characteristics of the human visual system as well as the nonlinear relationships in contextual information given by images in a collection. Our search algorithm tackles this problem by employing a similarity measure and a ranking strategy that reflect the nonlinearity of human perception and contextual information in a dataset. Similarity search in an image database based on this contextual information shows encouraging experimental results.

Keywords: Context-sensitive search, image search, media search, similarity ranking, similarity search.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 639
1357 Combining Similarity and Dissimilarity Measurements for the Development of QSAR Models Applied to the Prediction of Antiobesity Activity of Drugs

Authors: Irene Luque Ruiz, Manuel Urbano Cuadrado, Miguel Ángel Gómez-Nieto

Abstract:

In this paper we study different similarity based approaches for the development of QSAR model devoted to the prediction of activity of antiobesity drugs. Classical similarity approaches are compared regarding to dissimilarity models based on the consideration of the calculation of Euclidean distances between the nonisomorphic fragments extracted in the matching process. Combining the classical similarity and dissimilarity approaches into a new similarity measure, the Approximate Similarity was also studied, and better results were obtained. The application of the proposed method to the development of quantitative structure-activity relationships (QSAR) has provided reliable tools for predicting of inhibitory activity of drugs. Acceptable results were obtained for the models presented here.

Keywords: Graph similarity, Nonisomorphic dissimilarity, Approximate similarity, Drugs activity prediction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1557
1356 New Graph Similarity Measurements based on Isomorphic and Nonisomorphic Data Fusion and their Use in the Prediction of the Pharmacological Behavior of Drugs

Authors: Irene Luque Ruiz, Manuel Urbano Cuadrado, Miguel Ángel Gómez-Nieto

Abstract:

New graph similarity methods have been proposed in this work with the aim to refining the chemical information extracted from molecules matching. For this purpose, data fusion of the isomorphic and nonisomorphic subgraphs into a new similarity measure, the Approximate Similarity, was carried out by several approaches. The application of the proposed method to the development of quantitative structure-activity relationships (QSAR) has provided reliable tools for predicting several pharmacological parameters: binding of steroids to the globulin-corticosteroid receptor, the activity of benzodiazepine receptor compounds, and the blood brain barrier permeability. Acceptable results were obtained for the models presented here.

Keywords: Graph similarity, Nonisomorphic dissimilarity, Approximate similarity, Drug activity prediction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1650
1355 A New Edit Distance Method for Finding Similarity in Dna Sequence

Authors: Patsaraporn Somboonsak, Mud-Armeen Munlin

Abstract:

The P-Bigram method is a string comparison methods base on an internal two characters-based similarity measure. The edit distance between two strings is the minimal number of elementary editing operations required to transform one string into the other. The elementary editing operations include deletion, insertion, substitution two characters. In this paper, we address the P-Bigram method to sole the similarity problem in DNA sequence. This method provided an efficient algorithm that locates all minimum operation in a string. We have been implemented algorithm and found that our program calculated that smaller distance than one string. We develop PBigram edit distance and show that edit distance or the similarity and implementation using dynamic programming. The performance of the proposed approach is evaluated using number edit and percentage similarity measures.

Keywords: Edit distance, String Matching, String Similarity

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3317
1354 Agglomerative Hierarchical Clustering Using the Tθ Family of Similarity Measures

Authors: Salima Kouici, Abdelkader Khelladi

Abstract:

In this work, we begin with the presentation of the Tθ family of usual similarity measures concerning multidimensional binary data. Subsequently, some properties of these measures are proposed. Finally the impact of the use of different inter-elements measures on the results of the Agglomerative Hierarchical Clustering Methods is studied.

Keywords: Binary data, similarity measure, Tθ measures, Agglomerative Hierarchical Clustering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3445
1353 Incorporating Semantic Similarity Measure in Genetic Algorithm : An Approach for Searching the Gene Ontology Terms

Authors: Razib M. Othman, Safaai Deris, Rosli M. Illias, Hany T. Alashwal, Rohayanti Hassan, FarhanMohamed

Abstract:

The most important property of the Gene Ontology is the terms. These control vocabularies are defined to provide consistent descriptions of gene products that are shareable and computationally accessible by humans, software agent, or other machine-readable meta-data. Each term is associated with information such as definition, synonyms, database references, amino acid sequences, and relationships to other terms. This information has made the Gene Ontology broadly applied in microarray and proteomic analysis. However, the process of searching the terms is still carried out using traditional approach which is based on keyword matching. The weaknesses of this approach are: ignoring semantic relationships between terms, and highly depending on a specialist to find similar terms. Therefore, this study combines semantic similarity measure and genetic algorithm to perform a better retrieval process for searching semantically similar terms. The semantic similarity measure is used to compute similitude strength between two terms. Then, the genetic algorithm is employed to perform batch retrievals and to handle the situation of the large search space of the Gene Ontology graph. The computational results are presented to show the effectiveness of the proposed algorithm.

Keywords: Gene Ontology, Semantic similarity measure, Genetic algorithm, Ontology search

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1490
1352 Theoretical Analysis of the Effect of Accounting for Special Methods in Similarity-Based Cohesion Measurement

Authors: Jehad Al Dallal

Abstract:

Class cohesion is an important object-oriented software quality attributes, and it refers to the degree of relatedness of class attributes and methods. Several class cohesion measures are proposed in the literature, and the impact of considering the special methods (i.e., constructors, destructors, and access and delegation methods) in cohesion calculation is not thoroughly theoretically studied for most of them. In this paper, we address this issue for three popular similarity-based class cohesion measures. For each of the considered measures we theoretically study the impact of including or excluding special methods on the values that are obtained by applying the measure. This study is based on analyzing the definitions and formulas that are proposed for the measures. The results show that including/excluding special methods has a considerable effect on the obtained cohesion values and that this effect varies from one measure to another. The study shows the importance of considering the types of methods that have to be accounted for when proposing a similarity-based cohesion measure.

Keywords: Object-oriented class, software quality, class cohesion measure, class cohesion, special methods.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1670
1351 Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure

Authors: S.Aranganayagi, K.Thangavel

Abstract:

K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.

Keywords: Clustering, categorical data, K-Modes, weighted dissimilarity measure

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3689
1350 Arabic Word Semantic Similarity

Authors: Faaza A, Almarsoomi, James D, O'Shea, Zuhair A, Bandar, Keeley A, Crockett

Abstract:

This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS and hopefully it will be considered as a reference basis from which to evaluate and compare different methodologies in the field.

Keywords: Arabic categories, benchmark dataset, semantic similarity, word pair, stimulus Arabic words

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3106
1349 On Solution of Interval Valued Intuitionistic Fuzzy Assignment Problem Using Similarity Measure and Score Function

Authors: Gaurav Kumar, Rakesh Kumar Bajaj

Abstract:

The primary objective of the paper is to propose a new method for solving assignment problem under uncertain situation. In the classical assignment problem (AP), zpqdenotes the cost for assigning the qth job to the pth person which is deterministic in nature. Here in some uncertain situation, we have assigned a cost in the form of composite relative degree Fpq instead of  and this replaced cost is in the maximization form. In this paper, it has been solved and validated by the two proposed algorithms, a new mathematical formulation of IVIF assignment problem has been presented where the cost has been considered to be an IVIFN and the membership of elements in the set can be explained by positive and negative evidences. To determine the composite relative degree of similarity of IVIFS the concept of similarity measure and the score function is used for validating the solution which is obtained by Composite relative similarity degree method. Further, hypothetical numeric illusion is conducted to clarify the method’s effectiveness and feasibility developed in the study. Finally, conclusion and suggestion for future work are also proposed.

Keywords: Assignment problem, Interval-valued Intuitionistic Fuzzy Sets, Similarity Measures, score function.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3012
1348 Another Approach of Similarity Solution in Reversed Stagnation-point Flow

Authors: Vai Kuong Sin, Chon Kit Chio

Abstract:

In this paper, the two-dimensional reversed stagnationpoint flow is solved by means of an anlytic approach. There are similarity solutions in case the similarity equation and the boundary condition are modified. Finite analytic method are applied to obtain the similarity velocity function.

Keywords: reversed stagnation-point flow, similarity solutions, asymptotic solution

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1749
1347 Image Similarity: A Genetic Algorithm Based Approach

Authors: R. C. Joshi, Shashikala Tapaswi

Abstract:

The paper proposes an approach using genetic algorithm for computing the region based image similarity. The image is denoted using a set of segmented regions reflecting color and texture properties of an image. An image is associated with a family of image features corresponding to the regions. The resemblance of two images is then defined as the overall similarity between two families of image features, and quantified by a similarity measure, which integrates properties of all the regions in the images. A genetic algorithm is applied to decide the most plausible matching. The performance of the proposed method is illustrated using examples from an image database of general-purpose images, and is shown to produce good results.

Keywords: Image Features, color descriptor, segmented classes, texture descriptors, genetic algorithm.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2326
1346 Graph Cuts Segmentation Approach Using a Patch-Based Similarity Measure Applied for Interactive CT Lung Image Segmentation

Authors: Aicha Majda, Abdelhamid El Hassani

Abstract:

Lung CT image segmentation is a prerequisite in lung CT image analysis. Most of the conventional methods need a post-processing to deal with the abnormal lung CT scans such as lung nodules or other lesions. The simplest similarity measure in the standard Graph Cuts Algorithm consists of directly comparing the pixel values of the two neighboring regions, which is not accurate because this kind of metrics is extremely sensitive to minor transformations such as noise or other artifacts problems. In this work, we propose an improved version of the standard graph cuts algorithm based on the Patch-Based similarity metric. The boundary penalty term in the graph cut algorithm is defined Based on Patch-Based similarity measurement instead of the simple intensity measurement in the standard method. The weights between each pixel and its neighboring pixels are Based on the obtained new term. The graph is then created using theses weights between its nodes. Finally, the segmentation is completed with the minimum cut/Max-Flow algorithm. Experimental results show that the proposed method is very accurate and efficient, and can directly provide explicit lung regions without any post-processing operations compared to the standard method.

Keywords: Graph cuts, lung CT scan, lung parenchyma segmentation, patch based similarity metric.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 743
1345 Multi-Objective Optimal Threshold Selection for Similarity Functions in Siamese Networks for Semantic Textual Similarity Tasks

Authors: Kriuk Boris, Kriuk Fedor

Abstract:

This paper presents a comparative study of fundamental similarity functions for Siamese networks in semantic textual similarity (STS) tasks. We evaluate various similarity functions using the STS Benchmark dataset, analyzing their performance and stability. Additionally, we present a multi-objective approach for optimal threshold selection. Our findings provide insights into the effectiveness of different similarity functions and offer a straightforward method for threshold selection optimization, contributing to the advancement of Siamese network architectures in STS applications.

Keywords: Siamese networks, Semantic textual similarity, Similarity functions, STS Benchmark dataset, Threshold selection.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 77
1344 A Review on Important Aspects of Information Retrieval

Authors: Yogesh Gupta, Ashish Saini, A.K. Saxena

Abstract:

Information retrieval has become an important field of study and research under computer science due to explosive growth of information available in the form of full text, hypertext, administrative text, directory, numeric or bibliographic text. The research work is going on various aspects of information retrieval systems so as to improve its efficiency and reliability. This paper presents a comprehensive study, which discusses not only emergence and evolution of information retrieval but also includes different information retrieval models and some important aspects such as document representation, similarity measure and query expansion.

Keywords: Information Retrieval, query expansion, similarity measure, query expansion, vector space model.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3340
1343 Improving Similarity Search Using Clustered Data

Authors: Deokho Kim, Wonwoo Lee, Jaewoong Lee, Teresa Ng, Gun-Ill Lee, Jiwon Jeong

Abstract:

This paper presents a method for improving object search accuracy using a deep learning model. A major limitation to provide accurate similarity with deep learning is the requirement of huge amount of data for training pairwise similarity scores (metrics), which is impractical to collect. Thus, similarity scores are usually trained with a relatively small dataset, which comes from a different domain, causing limited accuracy on measuring similarity. For this reason, this paper proposes a deep learning model that can be trained with a significantly small amount of data, a clustered data which of each cluster contains a set of visually similar images. In order to measure similarity distance with the proposed method, visual features of two images are extracted from intermediate layers of a convolutional neural network with various pooling methods, and the network is trained with pairwise similarity scores which is defined zero for images in identical cluster. The proposed method outperforms the state-of-the-art object similarity scoring techniques on evaluation for finding exact items. The proposed method achieves 86.5% of accuracy compared to the accuracy of the state-of-the-art technique, which is 59.9%. That is, an exact item can be found among four retrieved images with an accuracy of 86.5%, and the rest can possibly be similar products more than the accuracy. Therefore, the proposed method can greatly reduce the amount of training data with an order of magnitude as well as providing a reliable similarity metric.

Keywords: Visual search, deep learning, convolutional neural network, machine learning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 825
1342 Towards Clustering of Web-based Document Structures

Authors: Matthias Dehmer, Frank Emmert Streib, Jürgen Kilian, Andreas Zulauf

Abstract:

Methods for organizing web data into groups in order to analyze web-based hypertext data and facilitate data availability are very important in terms of the number of documents available online. Thereby, the task of clustering web-based document structures has many applications, e.g., improving information retrieval on the web, better understanding of user navigation behavior, improving web users requests servicing, and increasing web information accessibility. In this paper we investigate a new approach for clustering web-based hypertexts on the basis of their graph structures. The hypertexts will be represented as so called generalized trees which are more general than usual directed rooted trees, e.g., DOM-Trees. As a important preprocessing step we measure the structural similarity between the generalized trees on the basis of a similarity measure d. Then, we apply agglomerative clustering to the obtained similarity matrix in order to create clusters of hypertext graph patterns representing navigation structures. In the present paper we will run our approach on a data set of hypertext structures and obtain good results in Web Structure Mining. Furthermore we outline the application of our approach in Web Usage Mining as future work.

Keywords: Clustering methods, graph-based patterns, graph similarity, hypertext structures, web structure mining

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1506
1341 Alphanumeric Hand-Prints Classification: Similarity Analysis between Local Decisions

Authors: G. Dimauro, S. Impedovo, M.G. Lucchese, R. Modugno, G. Pirlo

Abstract:

This paper presents the analysis of similarity between local decisions, in the process of alphanumeric hand-prints classification. From the analysis of local characteristics of handprinted numerals and characters, extracted by a zoning method, the set of classification decisions is obtained and the similarity among them is investigated. For this purpose the Similarity Index is used, which is an estimator of similarity between classifiers, based on the analysis of agreements between their decisions. The experimental tests, carried out using numerals and characters from the CEDAR and ETL database, respectively, show to what extent different parts of the patterns provide similar classification decisions.

Keywords: Handwriting Recognition, Optical Character Recognition, Similarity Index, Zoning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1308
1340 Approximately Similarity Measurement of Web Sites Using Genetic Algorithms and Binary Trees

Authors: Doru Anastasiu Popescu, Dan Rădulescu

Abstract:

In this paper, we determine the similarity of two HTML web applications. We are going to use a genetic algorithm in order to determine the most significant web pages of each application (we are not going to use every web page of a site). Using these significant web pages, we will find the similarity value between the two applications. The algorithm is going to be efficient because we are going to use a reduced number of web pages for comparisons but it will return an approximate value of the similarity. The binary trees are used to keep the tags from the significant pages. The algorithm was implemented in Java language.

Keywords: Tag, HTML, web page, genetic algorithm, similarity value, binary tree.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1309
1339 Enhancing Word Meaning Retrieval Using FastText and NLP Techniques

Authors: Sankalp Devanand, Prateek Agasimani, V. S. Shamith, Rohith Neeraje

Abstract:

Machine translation has witnessed significant advancements in recent years, but the translation of languages with distinct linguistic characteristics, such as English and Sanskrit, remains a challenging task. This research presents the development of a dedicated English to Sanskrit machine translation model, aiming to bridge the linguistic and cultural gap between these two languages. Using a variety of natural language processing (NLP) approaches including FastText embeddings, this research proposes a thorough method to improve word meaning retrieval. Data preparation, part-of-speech tagging, dictionary searches, and transliteration are all included in the methodology. The study also addresses the implementation of an interpreter pattern and uses a word similarity task to assess the quality of word embeddings. The experimental outcomes show how the suggested approach may be used to enhance word meaning retrieval tasks with greater efficacy, accuracy, and adaptability. Evaluation of the model's performance is conducted through rigorous testing, comparing its output against existing machine translation systems. The assessment includes quantitative metrics such as BLEU scores, METEOR scores, Jaccard Similarity etc.

Keywords: Machine translation, English to Sanskrit, natural language processing, word meaning retrieval, FastText embeddings.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 120
1338 Hybrid Reliability-Similarity-Based Approach for Supervised Machine Learning

Authors: Walid Cherif

Abstract:

Data mining has, over recent years, seen big advances because of the spread of internet, which generates everyday a tremendous volume of data, and also the immense advances in technologies which facilitate the analysis of these data. In particular, classification techniques are a subdomain of Data Mining which determines in which group each data instance is related within a given dataset. It is used to classify data into different classes according to desired criteria. Generally, a classification technique is either statistical or machine learning. Each type of these techniques has its own limits. Nowadays, current data are becoming increasingly heterogeneous; consequently, current classification techniques are encountering many difficulties. This paper defines new measure functions to quantify the resemblance between instances and then combines them in a new approach which is different from actual algorithms by its reliability computations. Results of the proposed approach exceeded most common classification techniques with an f-measure exceeding 97% on the IRIS Dataset.

Keywords: Data mining, knowledge discovery, machine learning, similarity measurement, supervised classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1527