Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 1932

Search results for: similarity search

1932 A Context-Sensitive Algorithm for Media Similarity Search

Authors: Guang-Ho Cha

Abstract:

This paper presents a context-sensitive media similarity search algorithm. One of the central problems regarding media search is the semantic gap between the low-level features computed automatically from media data and the human interpretation of them. This is because the notion of similarity is usually based on high-level abstraction but the low-level features do not sometimes reflect the human perception. Many media search algorithms have used the Minkowski metric to measure similarity between image pairs. However those functions cannot adequately capture the aspects of the characteristics of the human visual system as well as the nonlinear relationships in contextual information given by images in a collection. Our search algorithm tackles this problem by employing a similarity measure and a ranking strategy that reflect the nonlinearity of human perception and contextual information in a dataset. Similarity search in an image database based on this contextual information shows encouraging experimental results.

Keywords: context-sensitive search, image search, similarity ranking, similarity search

Procedia PDF Downloads 224
1931 2D Fingerprint Performance for PubChem Chemical Database

Authors: Fatimah Zawani Abdullah, Shereena Mohd Arif, Nurul Malim

Abstract:

The study of molecular similarity search in chemical database is increasingly widespread, especially in the area of drug discovery. Similarity search is an application in the field of Chemoinformatics to measure the similarity between the molecular structure which is known as the query and the structure of chemical compounds in the database. Similarity search is also one of the approaches in virtual screening which involves computational techniques and scoring the probabilities of activity. The main objective of this work is to determine the best fingerprint when compared to the other five fingerprints selected in this study using PubChem chemical dataset. This paper will discuss the similarity searching process conducted using 6 types of descriptors, which are ECFP4, ECFC4, FCFP4, FCFC4, SRECFC4 and SRFCFC4 on 15 activity classes of PubChem dataset using Tanimoto coefficient to calculate the similarity between the query structures and each of the database structure. The results suggest that ECFP4 performs the best to be used with Tanimoto coefficient in the PubChem dataset.

Keywords: 2D fingerprints, Tanimoto, PubChem, similarity searching, chemoinformatics

Procedia PDF Downloads 211
1930 Improving Similarity Search Using Clustered Data

Authors: Deokho Kim, Wonwoo Lee, Jaewoong Lee, Teresa Ng, Gun-Ill Lee, Jiwon Jeong

Abstract:

This paper presents a method for improving object search accuracy using a deep learning model. A major limitation to provide accurate similarity with deep learning is the requirement of huge amount of data for training pairwise similarity scores (metrics), which is impractical to collect. Thus, similarity scores are usually trained with a relatively small dataset, which comes from a different domain, causing limited accuracy on measuring similarity. For this reason, this paper proposes a deep learning model that can be trained with a significantly small amount of data, a clustered data which of each cluster contains a set of visually similar images. In order to measure similarity distance with the proposed method, visual features of two images are extracted from intermediate layers of a convolutional neural network with various pooling methods, and the network is trained with pairwise similarity scores which is defined zero for images in identical cluster. The proposed method outperforms the state-of-the-art object similarity scoring techniques on evaluation for finding exact items. The proposed method achieves 86.5% of accuracy compared to the accuracy of the state-of-the-art technique, which is 59.9%. That is, an exact item can be found among four retrieved images with an accuracy of 86.5%, and the rest can possibly be similar products more than the accuracy. Therefore, the proposed method can greatly reduce the amount of training data with an order of magnitude as well as providing a reliable similarity metric.

Keywords: visual search, deep learning, convolutional neural network, machine learning

Procedia PDF Downloads 140
1929 Semantic Search Engine Based on Query Expansion with Google Ranking and Similarity Measures

Authors: Ahmad Shahin, Fadi Chakik, Walid Moudani

Abstract:

Our study is about elaborating a potential solution for a search engine that involves semantic technology to retrieve information and display it significantly. Semantic search engines are not used widely over the web as the majorities are still in Beta stage or under construction. Many problems face the current applications in semantic search, the major problem is to analyze and calculate the meaning of query in order to retrieve relevant information. Another problem is the ontology based index and its updates. Ranking results according to concept meaning and its relation with query is another challenge. In this paper, we are offering a light meta-engine (QESM) which uses Google search, and therefore Google’s index, with some adaptations to its returned results by adding multi-query expansion. The mission was to find a reliable ranking algorithm that involves semantics and uses concepts and meanings to rank results. At the beginning, the engine finds synonyms of each query term entered by the user based on a lexical database. Then, query expansion is applied to generate different semantically analogous sentences. These are generated randomly by combining the found synonyms and the original query terms. Our model suggests the use of semantic similarity measures between two sentences. Practically, we used this method to calculate semantic similarity between each query and the description of each page’s content generated by Google. The generated sentences are sent to Google engine one by one, and ranked again all together with the adapted ranking method (QESM). Finally, our system will place Google pages with higher similarities on the top of the results. We have conducted experimentations with 6 different queries. We have observed that most ranked results with QESM were altered with Google’s original generated pages. With our experimented queries, QESM generates frequently better accuracy than Google. In some worst cases, it behaves like Google.

Keywords: semantic search engine, Google indexing, query expansion, similarity measures

Procedia PDF Downloads 348
1928 A Similarity Measure for Classification and Clustering in Image Based Medical and Text Based Banking Applications

Authors: K. P. Sandesh, M. H. Suman

Abstract:

Text processing plays an important role in information retrieval, data-mining, and web search. Measuring the similarity between the documents is an important operation in the text processing field. In this project, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature the proposed measure takes the following three cases into account: (1) The feature appears in both documents; (2) The feature appears in only one document and; (3) The feature appears in none of the documents. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, especially in banking and health sectors. The results show that the performance obtained by the proposed measure is better than that achieved by the other measures.

Keywords: document classification, document clustering, entropy, accuracy, classifiers, clustering algorithms

Procedia PDF Downloads 419
1927 Algorithms for Fast Computation of Pan Matrix Profiles of Time Series Under Unnormalized Euclidean Distances

Authors: Jing Zhang, Daniel Nikovski

Abstract:

We propose an approximation algorithm called LINKUMP to compute the Pan Matrix Profile (PMP) under the unnormalized l∞ distance (useful for value-based similarity search) using double-ended queue and linear interpolation. The algorithm has comparable time/space complexities as the state-of-the-art algorithm for typical PMP computation under the normalized l₂ distance (useful for shape-based similarity search). We validate its efficiency and effectiveness through extensive numerical experiments and a real-world anomaly detection application.

Keywords: pan matrix profile, unnormalized euclidean distance, double-ended queue, discord discovery, anomaly detection

Procedia PDF Downloads 12
1926 3D Objects Indexing Using Spherical Harmonic for Optimum Measurement Similarity

Authors: S. Hellam, Y. Oulahrir, F. El Mounchid, A. Sadiq, S. Mbarki

Abstract:

In this paper, we propose a method for three-dimensional (3-D)-model indexing based on defining a new descriptor, which we call new descriptor using spherical harmonics. The purpose of the method is to minimize, the processing time on the database of objects models and the searching time of similar objects to request object. Firstly we start by defining the new descriptor using a new division of 3-D object in a sphere. Then we define a new distance which will be used in the search for similar objects in the database.

Keywords: 3D indexation, spherical harmonic, similarity of 3D objects, measurement similarity

Procedia PDF Downloads 344
1925 Nazca: A Context-Based Matching Method for Searching Heterogeneous Structures

Authors: Karine B. de Oliveira, Carina F. Dorneles

Abstract:

The structure level matching is the problem of combining elements of a structure, which can be represented as entities, classes, XML elements, web forms, and so on. This is a challenge due to large number of distinct representations of semantically similar structures. This paper describes a structure-based matching method applied to search for different representations in data sources, considering the similarity between elements of two structures and the data source context. Using real data sources, we have conducted an experimental study comparing our approach with our baseline implementation and with another important schema matching approach. We demonstrate that our proposal reaches higher precision than the baseline.

Keywords: context, data source, index, matching, search, similarity, structure

Procedia PDF Downloads 275
1924 3D Model Completion Based on Similarity Search with Slim-Tree

Authors: Alexis Aldo Mendoza Villarroel, Ademir Clemente Villena Zevallos, Cristian Jose Lopez Del Alamo

Abstract:

With the advancement of technology it is now possible to scan entire objects and obtain their digital representation by using point clouds or polygon meshes. However, some objects may be broken or have missing parts; thus, several methods focused on this problem have been proposed based on Geometric Deep Learning, such as GCNN, ACNN, PointNet, among others. In this article an approach from a different paradigm is proposed, using metric data structures to index global descriptors in the spectral domain and allow the recovery of a set of similar models in polynomial time; to later use the Iterative Close Point algorithm and recover the parts of the incomplete model using the geometry and topology of the model with less Hausdorff distance.

Keywords: 3D reconstruction method, point cloud completion, shape completion, similarity search

Procedia PDF Downloads 45
1923 Clustering of Association Rules of ISIS & Al-Qaeda Based on Similarity Measures

Authors: Tamanna Goyal, Divya Bansal, Sanjeev Sofat

Abstract:

In world-threatening terrorist attacks, where early detection, distinction, and prediction are effective diagnosis techniques and for functionally accurate and precise analysis of terrorism data, there are so many data mining & statistical approaches to assure accuracy. The computational extraction of derived patterns is a non-trivial task which comprises specific domain discovery by means of sophisticated algorithm design and analysis. This paper proposes an approach for similarity extraction by obtaining the useful attributes from the available datasets of terrorist attacks and then applying feature selection technique based on the statistical impurity measures followed by clustering techniques on the basis of similarity measures. On the basis of degree of participation of attributes in the rules, the associative dependencies between the attacks are analyzed. Consequently, to compute the similarity among the discovered rules, we applied a weighted similarity measure. Finally, the rules are grouped by applying using hierarchical clustering. We have applied it to an open source dataset to determine the usability and efficiency of our technique, and a literature search is also accomplished to support the efficiency and accuracy of our results.

Keywords: association rules, clustering, similarity measure, statistical approaches

Procedia PDF Downloads 227
1922 Pattern Recognition Search: An Advancement Over Interpolation Search

Authors: Shahpar Yilmaz, Yasir Nadeem, Syed A. Mehdi

Abstract:

Searching for a record in a dataset is always a frequent task for any data structure-related application. Hence, a fast and efficient algorithm for the approach has its importance in yielding the quickest results and enhancing the overall productivity of the company. Interpolation search is one such technique used to search through a sorted set of elements. This paper proposes a new algorithm, an advancement over interpolation search for the application of search over a sorted array. Pattern Recognition Search or PR Search (PRS), like interpolation search, is a pattern-based divide and conquer algorithm whose objective is to reduce the sample size in order to quicken the process and it does so by treating the array as a perfect arithmetic progression series and thereby deducing the key element’s position. We look to highlight some of the key drawbacks of interpolation search, which are accounted for in the Pattern Recognition Search.

Keywords: array, complexity, index, sorting, space, time

Procedia PDF Downloads 72
1921 Approximately Similarity Measurement of Web Sites Using Genetic Algorithms and Binary Trees

Authors: Doru Anastasiu Popescu, Dan Rădulescu

Abstract:

In this paper, we determine the similarity of two HTML web applications. We are going to use a genetic algorithm in order to determine the most significant web pages of each application (we are not going to use every web page of a site). Using these significant web pages, we will find the similarity value between the two applications. The algorithm is going to be efficient because we are going to use a reduced number of web pages for comparisons but it will return an approximate value of the similarity. The binary trees are used to keep the tags from the significant pages. The algorithm was implemented in Java language.

Keywords: Tag, HTML, web page, genetic algorithm, similarity value, binary tree

Procedia PDF Downloads 277
1920 Measuring Text-Based Semantics Relatedness Using WordNet

Authors: Madiha Khan, Sidrah Ramzan, Seemab Khan, Shahzad Hassan, Kamran Saeed

Abstract:

Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.

Keywords: Graphviz representation, semantic relatedness, similarity measurement, WordNet similarity

Procedia PDF Downloads 123
1919 Quick Similarity Measurement of Binary Images via Probabilistic Pixel Mapping

Authors: Adnan A. Y. Mustafa

Abstract:

In this paper we present a quick technique to measure the similarity between binary images. The technique is based on a probabilistic mapping approach and is fast because only a minute percentage of the image pixels need to be compared to measure the similarity, and not the whole image. We exploit the power of the Probabilistic Matching Model for Binary Images (PMMBI) to arrive at an estimate of the similarity. We show that the estimate is a good approximation of the actual value, and the quality of the estimate can be improved further with increased image mappings. Furthermore, the technique is image size invariant; the similarity between big images can be measured as fast as that for small images. Examples of trials conducted on real images are presented.

Keywords: big images, binary images, image matching, image similarity

Procedia PDF Downloads 109
1918 Study on the Self-Location Estimate by the Evolutional Triangle Similarity Matching Using Artificial Bee Colony Algorithm

Authors: Yuji Kageyama, Shin Nagata, Tatsuya Takino, Izuru Nomura, Hiroyuki Kamata

Abstract:

In previous study, technique to estimate a self-location by using a lunar image is proposed. We consider the improvement of the conventional method in consideration of FPGA implementation in this paper. Specifically, we introduce Artificial Bee Colony algorithm for reduction of search time. In addition, we use fixed point arithmetic to enable high-speed operation on FPGA.

Keywords: SLIM, Artificial Bee Colony Algorithm, location estimate, evolutional triangle similarity

Procedia PDF Downloads 424
1917 Review and Suggestions of the Similarity between Employee and Its Workplace

Authors: Gi Ryung Song, Kyoung Seok Kim

Abstract:

This study reviewed the literature that focused on similarity of various characteristics such as values, personality, or demographics between employee and other elements in its organization for example employee with leader, job, and organization. We divided a body of this study into two parts and organized and demonstrated recent studies in first part. Three issues appeared in this part, which are statistical ways of measuring similarity, supervisor-subordinate similarity, and person-organization fit with person-job fit. In the latter part, based on the three issues of recent studies, we suggested three propositions about points that the recent studies missed or the studies did not orient. First proposition argued about the direction of similarity, which could also be interpreted as there is causal relation between employee and its workplace environments. Second, we suggested a consideration of eliminating common variance buried in one’s characteristics or its profiles. Third proposition was about the similarity of extra role behavior between individual and organization, and we treated this organization’s level of extra role behavior as a kind of its culture. In doing so, similarity of individual’s extra role behavior and organization’s has the meaning that individual’s congruence against their organization culture.

Keywords: similarity, person-organization fit, supervisor-subordinate similarity, literature review

Procedia PDF Downloads 201
1916 On Privacy-Preserving Search in the Encrypted Domain

Authors: Chun-Shien Lu

Abstract:

Privacy-preserving query has recently received considerable attention in the signal processing and multimedia community. It is also a critical step in wireless sensor network for retrieval of sensitive data. The purposes of privacy-preserving query in both the areas of signal processing and sensor network are the same, but the similarity and difference of the adopted technologies are not fully explored. In this paper, we first review the recently developed methods of privacy-preserving query, and then describe in a comprehensive manner what we can learn from the mutual of both areas.

Keywords: encryption, privacy-preserving, search, security

Procedia PDF Downloads 139
1915 Arabic Quran Search Tool Based on Ontology

Authors: Mohammad Alqahtani, Eric Atwell

Abstract:

This paper reviews and classifies most of the important types of search techniques that have been applied on the holy Quran. Then, it addresses the limitations in these techniques. Additionally, this paper surveys most existing Quranic ontologies and what are their deficiencies. Finally, it explains a new search tool called: A semantic search tool for Al Quran based on Qur’anic ontologies. This tool will overcome all limitations in the existing Quranic search applications.

Keywords: holy Quran, natural language processing (NLP), semantic search, information retrieval (IR), ontology

Procedia PDF Downloads 481
1914 Similarity Based Membership of Elements to Uncertain Concept in Information System

Authors: M. Kamel El-Sayed

Abstract:

The process of determining the degree of membership for an element to an uncertain concept has been found in many ways, using equivalence and symmetry relations in information systems. In the case of similarity, these methods did not take into account the degree of symmetry between elements. In this paper, we use a new definition for finding the membership based on the degree of symmetry. We provide an example to clarify the suggested methods and compare it with previous methods. This method opens the door to more accurate decisions in information systems.

Keywords: information system, uncertain concept, membership function, similarity relation, degree of similarity

Procedia PDF Downloads 131
1913 Agglomerative Hierarchical Clustering Using the Tθ Family of Similarity Measures

Authors: Salima Kouici, Abdelkader Khelladi

Abstract:

In this work, we begin with the presentation of the Tθ family of usual similarity measures concerning multidimensional binary data. Subsequently, some properties of these measures are proposed. Finally, the impact of the use of different inter-elements measures on the results of the Agglomerative Hierarchical Clustering Methods is studied.

Keywords: binary data, similarity measure, Tθ measures, agglomerative hierarchical clustering

Procedia PDF Downloads 362
1912 Comparative Analysis of Dissimilarity Detection between Binary Images Based on Equivalency and Non-Equivalency of Image Inversion

Authors: Adnan A. Y. Mustafa

Abstract:

Image matching is a fundamental problem that arises frequently in many aspects of robot and computer vision. It can become a time-consuming process when matching images to a database consisting of hundreds of images, especially if the images are big. One approach to reducing the time complexity of the matching process is to reduce the search space in a pre-matching stage, by simply removing dissimilar images quickly. The Probabilistic Matching Model for Binary Images (PMMBI) showed that dissimilarity detection between binary images can be accomplished quickly by random pixel mapping and is size invariant. The model is based on the gamma binary similarity distance that recognizes an image and its inverse as containing the same scene and hence considers them to be the same image. However, in many applications, an image and its inverse are not treated as being the same but rather dissimilar. In this paper, we present a comparative analysis of dissimilarity detection between PMMBI based on the gamma binary similarity distance and a modified PMMBI model based on a similarity distance that does distinguish between an image and its inverse as being dissimilar.

Keywords: binary image, dissimilarity detection, probabilistic matching model for binary images, image mapping

Procedia PDF Downloads 61
1911 Empirical Study of Partitions Similarity Measures

Authors: Abdelkrim Alfalah, Lahcen Ouarbya, John Howroyd

Abstract:

This paper investigates and compares the performance of four existing distances and similarity measures between partitions. The partition measures considered are Rand Index (RI), Adjusted Rand Index (ARI), Variation of Information (VI), and Normalised Variation of Information (NVI). This work investigates the ability of these partition measures to capture three predefined intuitions: the variation within randomly generated partitions, the sensitivity to small perturbations, and finally the independence from the dataset scale. It has been shown that the Adjusted Rand Index performed well overall, with regards to these three intuitions.

Keywords: clustering, comparing partitions, similarity measure, partition distance, partition metric, similarity between partitions, clustering comparison.

Procedia PDF Downloads 89
1910 Improving Research by the Integration of a Collaborative Dimension in an Information Retrieval (IR) System

Authors: Amel Hannech, Mehdi Adda, Hamid Mcheick

Abstract:

In computer science, the purpose of finding useful information is still one of the most active and important research topics. The most popular application of information retrieval (IR) are Search Engines, they meet users' specific needs and aim to locate the effective information in the web. However, these search engines have some limitations related to the relevancy of the results and the ease to explore those results. In this context, we proposed in previous works a Multi-Space Search Engine model that is based on a multidimensional interpretation universe. In the present paper, we integrate an additional dimension that allows to offer users new research experiences. The added component is based on creating user profiles and calculating the similarity between them that then allow the use of collaborative filtering in retrieving search results. To evaluate the effectiveness of the proposed model, a prototype is developed. The experiments showed that the additional dimension has improved the relevancy of results by predicting the interesting items of users based on their experiences and the experiences of other similar users. The offered personalization service allows users to approve the pertinent items, which allows to enrich their profiles and further improve research.

Keywords: information retrieval, v-facets, user behavior analysis, user profiles, topical ontology, association rules, data personalization

Procedia PDF Downloads 193
1909 User Modeling from the Perspective of Improvement in Search Results: A Survey of the State of the Art

Authors: Samira Karimi-Mansoub, Rahem Abri

Abstract:

Currently, users expect high quality and personalized information from search results. To satisfy user’s needs, personalized approaches to web search have been proposed. These approaches can provide the most appropriate answer for user’s needs by using user context and incorporating information about query provided by combining search technologies. To carry out personalized web search, there is a need to make different techniques on whole of user search process. There are the number of possible deployment of personalized approaches such as personalized web search, personalized recommendation, personalized summarization and filtering systems and etc. but the common feature of all approaches in various domains is that user modeling is utilized to provide personalized information from the Web. So the most important work in personalized approaches is user model mining. User modeling applications and technologies can be used in various domains depending on how the user collected information may be extracted. In addition to, the used techniques to create user model is also different in each of these applications. Since in the previous studies, there was not a complete survey in this field, our purpose is to present a survey on applications and techniques of user modeling from the viewpoint of improvement in search results by considering the existing literature and researches.

Keywords: filtering systems, personalized web search, user modeling, user search behavior

Procedia PDF Downloads 183
1908 Tool for Determining the Similarity between Two Web Applications

Authors: Doru Anastasiu Popescu, Raducanu Dragos Ionut

Abstract:

In this paper the presentation of a tool which measures the similarity between two websites is made. The websites are compound only from webpages created with HTML. The tool uses three ways of calculating the similarity between two websites based on certain results already published. The first way compares all the webpages within a website, the second way compares a webpage with all the pages within the second website and the third way compares two webpages. Java programming language and technologies such as spring, Jsoup, log4j were used for the implementation of the tool.

Keywords: Java, Jsoup, HTM, spring

Procedia PDF Downloads 301
1907 The Application of Pareto Local Search to the Single-Objective Quadratic Assignment Problem

Authors: Abdullah Alsheddy

Abstract:

This paper presents the employment of Pareto optimality as a strategy to help (single-objective) local search escaping local optima. Instead of local search, Pareto local search is applied to solve the quadratic assignment problem which is multi-objectivized by adding a helper objective. The additional objective is defined as a function of the primary one with augmented penalties that are dynamically updated.

Keywords: Pareto optimization, multi-objectivization, quadratic assignment problem, local search

Procedia PDF Downloads 355
1906 SC-LSH: An Efficient Indexing Method for Approximate Similarity Search in High Dimensional Space

Authors: Sanaa Chafik, Imane Daoudi, Mounim A. El Yacoubi, Hamid El Ouardi

Abstract:

Locality Sensitive Hashing (LSH) is one of the most promising techniques for solving nearest neighbour search problem in high dimensional space. Euclidean LSH is the most popular variation of LSH that has been successfully applied in many multimedia applications. However, the Euclidean LSH presents limitations that affect structure and query performances. The main limitation of the Euclidean LSH is the large memory consumption. In order to achieve a good accuracy, a large number of hash tables is required. In this paper, we propose a new hashing algorithm to overcome the storage space problem and improve query time, while keeping a good accuracy as similar to that achieved by the original Euclidean LSH. The Experimental results on a real large-scale dataset show that the proposed approach achieves good performances and consumes less memory than the Euclidean LSH.

Keywords: approximate nearest neighbor search, content based image retrieval (CBIR), curse of dimensionality, locality sensitive hashing, multidimensional indexing, scalability

Procedia PDF Downloads 254
1905 Text Similarity in Vector Space Models: A Comparative Study

Authors: Omid Shahmirzadi, Adam Lugowski, Kenneth Younge

Abstract:

Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of modeling patent-to-patent similarity and compare TFIDF (and related extensions), topic models (e.g., latent semantic indexing), and neural models (e.g., paragraph vectors). Contrary to expectations, the added computational cost of text embedding methods is justified only when: 1) the target text is condensed; and 2) the similarity comparison is trivial. Otherwise, TFIDF performs surprisingly well in other cases: in particular for longer and more technical texts or for making finer-grained distinctions between nearest neighbors. Unexpectedly, extensions to the TFIDF method, such as adding noun phrases or calculating term weights incrementally, were not helpful in our context.

Keywords: big data, patent, text embedding, text similarity, vector space model

Procedia PDF Downloads 64
1904 Static vs. Stream Mining Trajectories Similarity Measures

Authors: Musaab Riyadh, Norwati Mustapha, Dina Riyadh

Abstract:

Trajectory similarity can be defined as the cost of transforming one trajectory into another based on certain similarity method. It is the core of numerous mining tasks such as clustering, classification, and indexing. Various approaches have been suggested to measure similarity based on the geometric and dynamic properties of trajectory, the overlapping between trajectory segments, and the confined area between entire trajectories. In this article, an evaluation of these approaches has been done based on computational cost, usage memory, accuracy, and the amount of data which is needed in advance to determine its suitability to stream mining applications. The evaluation results show that the stream mining applications support similarity methods which have low computational cost and memory, single scan on data, and free of mathematical complexity due to the high-speed generation of data.

Keywords: global distance measure, local distance measure, semantic trajectory, spatial dimension, stream data mining

Procedia PDF Downloads 71
1903 Interactive, Topic-Oriented Search Support by a Centroid-Based Text Categorisation

Authors: Mario Kubek, Herwig Unger

Abstract:

Centroid terms are single words that semantically and topically characterise text documents and so may serve as their very compact representation in automatic text processing. In the present paper, centroids are used to measure the relevance of text documents with respect to a given search query. Thus, a new graphbased paradigm for searching texts in large corpora is proposed and evaluated against keyword-based methods. The first, promising experimental results demonstrate the usefulness of the centroid-based search procedure. It is shown that especially the routing of search queries in interactive and decentralised search systems can be greatly improved by applying this approach. A detailed discussion on further fields of its application completes this contribution.

Keywords: search algorithm, centroid, query, keyword, co-occurrence, categorisation

Procedia PDF Downloads 198