Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 3138

Search results for: similarity measure

3138 A Similarity Measure for Classification and Clustering in Image Based Medical and Text Based Banking Applications

Authors: K. P. Sandesh, M. H. Suman

Abstract:

Text processing plays an important role in information retrieval, data-mining, and web search. Measuring the similarity between the documents is an important operation in the text processing field. In this project, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature the proposed measure takes the following three cases into account: (1) The feature appears in both documents; (2) The feature appears in only one document and; (3) The feature appears in none of the documents. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, especially in banking and health sectors. The results show that the performance obtained by the proposed measure is better than that achieved by the other measures.

Keywords: document classification, document clustering, entropy, accuracy, classifiers, clustering algorithms

Procedia PDF Downloads 419
3137 Quick Similarity Measurement of Binary Images via Probabilistic Pixel Mapping

Authors: Adnan A. Y. Mustafa

Abstract:

In this paper we present a quick technique to measure the similarity between binary images. The technique is based on a probabilistic mapping approach and is fast because only a minute percentage of the image pixels need to be compared to measure the similarity, and not the whole image. We exploit the power of the Probabilistic Matching Model for Binary Images (PMMBI) to arrive at an estimate of the similarity. We show that the estimate is a good approximation of the actual value, and the quality of the estimate can be improved further with increased image mappings. Furthermore, the technique is image size invariant; the similarity between big images can be measured as fast as that for small images. Examples of trials conducted on real images are presented.

Keywords: big images, binary images, image matching, image similarity

Procedia PDF Downloads 109
3136 A Context-Sensitive Algorithm for Media Similarity Search

Authors: Guang-Ho Cha

Abstract:

This paper presents a context-sensitive media similarity search algorithm. One of the central problems regarding media search is the semantic gap between the low-level features computed automatically from media data and the human interpretation of them. This is because the notion of similarity is usually based on high-level abstraction but the low-level features do not sometimes reflect the human perception. Many media search algorithms have used the Minkowski metric to measure similarity between image pairs. However those functions cannot adequately capture the aspects of the characteristics of the human visual system as well as the nonlinear relationships in contextual information given by images in a collection. Our search algorithm tackles this problem by employing a similarity measure and a ranking strategy that reflect the nonlinearity of human perception and contextual information in a dataset. Similarity search in an image database based on this contextual information shows encouraging experimental results.

Keywords: context-sensitive search, image search, similarity ranking, similarity search

Procedia PDF Downloads 224
3135 Measuring Text-Based Semantics Relatedness Using WordNet

Authors: Madiha Khan, Sidrah Ramzan, Seemab Khan, Shahzad Hassan, Kamran Saeed

Abstract:

Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.

Keywords: Graphviz representation, semantic relatedness, similarity measurement, WordNet similarity

Procedia PDF Downloads 123
3134 Static vs. Stream Mining Trajectories Similarity Measures

Authors: Musaab Riyadh, Norwati Mustapha, Dina Riyadh

Abstract:

Trajectory similarity can be defined as the cost of transforming one trajectory into another based on certain similarity method. It is the core of numerous mining tasks such as clustering, classification, and indexing. Various approaches have been suggested to measure similarity based on the geometric and dynamic properties of trajectory, the overlapping between trajectory segments, and the confined area between entire trajectories. In this article, an evaluation of these approaches has been done based on computational cost, usage memory, accuracy, and the amount of data which is needed in advance to determine its suitability to stream mining applications. The evaluation results show that the stream mining applications support similarity methods which have low computational cost and memory, single scan on data, and free of mathematical complexity due to the high-speed generation of data.

Keywords: global distance measure, local distance measure, semantic trajectory, spatial dimension, stream data mining

Procedia PDF Downloads 72
3133 A Similarity/Dissimilarity Measure to Biological Sequence Alignment

Authors: Muhammad A. Khan, Waseem Shahzad

Abstract:

Analysis of protein sequences is carried out for the purpose to discover their structural and ancestry relationship. Sequence similarity determines similar protein structures, similar function, and homology detection. Biological sequences composed of amino acid residues or nucleotides provide significant information through sequence alignment. In this paper, we present a new similarity/dissimilarity measure to sequence alignment based on the primary structure of a protein. The approach finds the distance between the two given sequences using the novel sequence alignment algorithm and a mathematical model. The algorithm runs at a time complexity of O(n²). A distance matrix is generated to construct a phylogenetic tree of different species. The new similarity/dissimilarity measure outperforms other existing methods.

Keywords: alignment, distance, homology, mathematical model, phylogenetic tree

Procedia PDF Downloads 95
3132 Clustering of Association Rules of ISIS & Al-Qaeda Based on Similarity Measures

Authors: Tamanna Goyal, Divya Bansal, Sanjeev Sofat

Abstract:

In world-threatening terrorist attacks, where early detection, distinction, and prediction are effective diagnosis techniques and for functionally accurate and precise analysis of terrorism data, there are so many data mining & statistical approaches to assure accuracy. The computational extraction of derived patterns is a non-trivial task which comprises specific domain discovery by means of sophisticated algorithm design and analysis. This paper proposes an approach for similarity extraction by obtaining the useful attributes from the available datasets of terrorist attacks and then applying feature selection technique based on the statistical impurity measures followed by clustering techniques on the basis of similarity measures. On the basis of degree of participation of attributes in the rules, the associative dependencies between the attacks are analyzed. Consequently, to compute the similarity among the discovered rules, we applied a weighted similarity measure. Finally, the rules are grouped by applying using hierarchical clustering. We have applied it to an open source dataset to determine the usability and efficiency of our technique, and a literature search is also accomplished to support the efficiency and accuracy of our results.

Keywords: association rules, clustering, similarity measure, statistical approaches

Procedia PDF Downloads 227
3131 Agglomerative Hierarchical Clustering Using the Tθ Family of Similarity Measures

Authors: Salima Kouici, Abdelkader Khelladi

Abstract:

In this work, we begin with the presentation of the Tθ family of usual similarity measures concerning multidimensional binary data. Subsequently, some properties of these measures are proposed. Finally, the impact of the use of different inter-elements measures on the results of the Agglomerative Hierarchical Clustering Methods is studied.

Keywords: binary data, similarity measure, Tθ measures, agglomerative hierarchical clustering

Procedia PDF Downloads 362
3130 Resume Ranking Using Custom Word2vec and Rule-Based Natural Language Processing Techniques

Authors: Subodh Chandra Shakya, Rajendra Sapkota, Aakash Tamang, Shushant Pudasaini, Sujan Adhikari, Sajjan Adhikari

Abstract:

Lots of efforts have been made in order to measure the semantic similarity between the text corpora in the documents. Techniques have been evolved to measure the similarity of two documents. One such state-of-art technique in the field of Natural Language Processing (NLP) is word to vector models, which converts the words into their word-embedding and measures the similarity between the vectors. We found this to be quite useful for the task of resume ranking. So, this research paper is the implementation of the word2vec model along with other Natural Language Processing techniques in order to rank the resumes for the particular job description so as to automate the process of hiring. The research paper proposes the system and the findings that were made during the process of building the system.

Keywords: chunking, document similarity, information extraction, natural language processing, word2vec, word embedding

Procedia PDF Downloads 49
3129 2D Fingerprint Performance for PubChem Chemical Database

Authors: Fatimah Zawani Abdullah, Shereena Mohd Arif, Nurul Malim

Abstract:

The study of molecular similarity search in chemical database is increasingly widespread, especially in the area of drug discovery. Similarity search is an application in the field of Chemoinformatics to measure the similarity between the molecular structure which is known as the query and the structure of chemical compounds in the database. Similarity search is also one of the approaches in virtual screening which involves computational techniques and scoring the probabilities of activity. The main objective of this work is to determine the best fingerprint when compared to the other five fingerprints selected in this study using PubChem chemical dataset. This paper will discuss the similarity searching process conducted using 6 types of descriptors, which are ECFP4, ECFC4, FCFP4, FCFC4, SRECFC4 and SRFCFC4 on 15 activity classes of PubChem dataset using Tanimoto coefficient to calculate the similarity between the query structures and each of the database structure. The results suggest that ECFP4 performs the best to be used with Tanimoto coefficient in the PubChem dataset.

Keywords: 2D fingerprints, Tanimoto, PubChem, similarity searching, chemoinformatics

Procedia PDF Downloads 212
3128 Empirical Study of Partitions Similarity Measures

Authors: Abdelkrim Alfalah, Lahcen Ouarbya, John Howroyd

Abstract:

This paper investigates and compares the performance of four existing distances and similarity measures between partitions. The partition measures considered are Rand Index (RI), Adjusted Rand Index (ARI), Variation of Information (VI), and Normalised Variation of Information (NVI). This work investigates the ability of these partition measures to capture three predefined intuitions: the variation within randomly generated partitions, the sensitivity to small perturbations, and finally the independence from the dataset scale. It has been shown that the Adjusted Rand Index performed well overall, with regards to these three intuitions.

Keywords: clustering, comparing partitions, similarity measure, partition distance, partition metric, similarity between partitions, clustering comparison.

Procedia PDF Downloads 89
3127 Graph Cuts Segmentation Approach Using a Patch-Based Similarity Measure Applied for Interactive CT Lung Image Segmentation

Authors: Aicha Majda, Abdelhamid El Hassani

Abstract:

Lung CT image segmentation is a prerequisite in lung CT image analysis. Most of the conventional methods need a post-processing to deal with the abnormal lung CT scans such as lung nodules or other lesions. The simplest similarity measure in the standard Graph Cuts Algorithm consists of directly comparing the pixel values of the two neighboring regions, which is not accurate because this kind of metrics is extremely sensitive to minor transformations such as noise or other artifacts problems. In this work, we propose an improved version of the standard graph cuts algorithm based on the Patch-Based similarity metric. The boundary penalty term in the graph cut algorithm is defined Based on Patch-Based similarity measurement instead of the simple intensity measurement in the standard method. The weights between each pixel and its neighboring pixels are Based on the obtained new term. The graph is then created using theses weights between its nodes. Finally, the segmentation is completed with the minimum cut/Max-Flow algorithm. Experimental results show that the proposed method is very accurate and efficient, and can directly provide explicit lung regions without any post-processing operations compared to the standard method.

Keywords: graph cuts, lung CT scan, lung parenchyma segmentation, patch-based similarity metric

Procedia PDF Downloads 92
3126 Improving Similarity Search Using Clustered Data

Authors: Deokho Kim, Wonwoo Lee, Jaewoong Lee, Teresa Ng, Gun-Ill Lee, Jiwon Jeong

Abstract:

This paper presents a method for improving object search accuracy using a deep learning model. A major limitation to provide accurate similarity with deep learning is the requirement of huge amount of data for training pairwise similarity scores (metrics), which is impractical to collect. Thus, similarity scores are usually trained with a relatively small dataset, which comes from a different domain, causing limited accuracy on measuring similarity. For this reason, this paper proposes a deep learning model that can be trained with a significantly small amount of data, a clustered data which of each cluster contains a set of visually similar images. In order to measure similarity distance with the proposed method, visual features of two images are extracted from intermediate layers of a convolutional neural network with various pooling methods, and the network is trained with pairwise similarity scores which is defined zero for images in identical cluster. The proposed method outperforms the state-of-the-art object similarity scoring techniques on evaluation for finding exact items. The proposed method achieves 86.5% of accuracy compared to the accuracy of the state-of-the-art technique, which is 59.9%. That is, an exact item can be found among four retrieved images with an accuracy of 86.5%, and the rest can possibly be similar products more than the accuracy. Therefore, the proposed method can greatly reduce the amount of training data with an order of magnitude as well as providing a reliable similarity metric.

Keywords: visual search, deep learning, convolutional neural network, machine learning

Procedia PDF Downloads 140
3125 Destination Port Detection For Vessels: An Analytic Tool For Optimizing Port Authorities Resources

Authors: Lubna Eljabu, Mohammad Etemad, Stan Matwin

Abstract:

Port authorities have many challenges in congested ports to allocate their resources to provide a safe and secure loading/ unloading procedure for cargo vessels. Selecting a destination port is the decision of a vessel master based on many factors such as weather, wavelength and changes of priorities. Having access to a tool which leverages AIS messages to monitor vessel’s movements and accurately predict their next destination port promotes an effective resource allocation process for port authorities. In this research, we propose a method, namely, Reference Route of Trajectory (RRoT) to assist port authorities in predicting inflow and outflow traffic in their local environment by monitoring Automatic Identification System (AIS) messages. Our RRoT method creates a reference route based on historical AIS messages. It utilizes some of the best trajectory similarity measure to identify the destination of a vessel using their recent movement. We evaluated five different similarity measures such as Discrete Fr´echet Distance (DFD), Dynamic Time Warping (DTW), Partial Curve Mapping (PCM), Area between two curves (Area) and Curve length (CL). Our experiments show that our method identifies the destination port with an accuracy of 98.97% and an fmeasure of 99.08% using Dynamic Time Warping (DTW) similarity measure.

Keywords: spatial temporal data mining, trajectory mining, trajectory similarity, resource optimization

Procedia PDF Downloads 47
3124 Approximately Similarity Measurement of Web Sites Using Genetic Algorithms and Binary Trees

Authors: Doru Anastasiu Popescu, Dan Rădulescu

Abstract:

In this paper, we determine the similarity of two HTML web applications. We are going to use a genetic algorithm in order to determine the most significant web pages of each application (we are not going to use every web page of a site). Using these significant web pages, we will find the similarity value between the two applications. The algorithm is going to be efficient because we are going to use a reduced number of web pages for comparisons but it will return an approximate value of the similarity. The binary trees are used to keep the tags from the significant pages. The algorithm was implemented in Java language.

Keywords: Tag, HTML, web page, genetic algorithm, similarity value, binary tree

Procedia PDF Downloads 277
3123 Hybrid Reliability-Similarity-Based Approach for Supervised Machine Learning

Authors: Walid Cherif

Abstract:

Data mining has, over recent years, seen big advances because of the spread of internet, which generates everyday a tremendous volume of data, and also the immense advances in technologies which facilitate the analysis of these data. In particular, classification techniques are a subdomain of Data Mining which determines in which group each data instance is related within a given dataset. It is used to classify data into different classes according to desired criteria. Generally, a classification technique is either statistical or machine learning. Each type of these techniques has its own limits. Nowadays, current data are becoming increasingly heterogeneous; consequently, current classification techniques are encountering many difficulties. This paper defines new measure functions to quantify the resemblance between instances and then combines them in a new approach which is different from actual algorithms by its reliability computations. Results of the proposed approach exceeded most common classification techniques with an f-measure exceeding 97% on the IRIS Dataset.

Keywords: data mining, knowledge discovery, machine learning, similarity measurement, supervised classification

Procedia PDF Downloads 373
3122 Review and Suggestions of the Similarity between Employee and Its Workplace

Authors: Gi Ryung Song, Kyoung Seok Kim

Abstract:

This study reviewed the literature that focused on similarity of various characteristics such as values, personality, or demographics between employee and other elements in its organization for example employee with leader, job, and organization. We divided a body of this study into two parts and organized and demonstrated recent studies in first part. Three issues appeared in this part, which are statistical ways of measuring similarity, supervisor-subordinate similarity, and person-organization fit with person-job fit. In the latter part, based on the three issues of recent studies, we suggested three propositions about points that the recent studies missed or the studies did not orient. First proposition argued about the direction of similarity, which could also be interpreted as there is causal relation between employee and its workplace environments. Second, we suggested a consideration of eliminating common variance buried in one’s characteristics or its profiles. Third proposition was about the similarity of extra role behavior between individual and organization, and we treated this organization’s level of extra role behavior as a kind of its culture. In doing so, similarity of individual’s extra role behavior and organization’s has the meaning that individual’s congruence against their organization culture.

Keywords: similarity, person-organization fit, supervisor-subordinate similarity, literature review

Procedia PDF Downloads 203
3121 Top-K Shortest Distance as a Similarity Measure

Authors: Andrey Lebedev, Ilya Dmitrenok, JooYoung Lee, Leonard Johard

Abstract:

Top-k shortest path routing problem is an extension of finding the shortest path in a given network. Shortest path is one of the most essential measures as it reveals the relations between two nodes in a network. However, in many real world networks, whose diameters are small, top-k shortest path is more interesting as it contains more information about the network topology. Many variations to compute top-k shortest paths have been studied. In this paper, we apply an efficient top-k shortest distance routing algorithm to the link prediction problem and test its efficacy. We compare the results with other base line and state-of-the-art methods as well as with the shortest path. Then, we also propose a top-k distance based graph matching algorithm.

Keywords: graph matching, link prediction, shortest path, similarity

Procedia PDF Downloads 276
3120 Similarity Based Membership of Elements to Uncertain Concept in Information System

Authors: M. Kamel El-Sayed

Abstract:

The process of determining the degree of membership for an element to an uncertain concept has been found in many ways, using equivalence and symmetry relations in information systems. In the case of similarity, these methods did not take into account the degree of symmetry between elements. In this paper, we use a new definition for finding the membership based on the degree of symmetry. We provide an example to clarify the suggested methods and compare it with previous methods. This method opens the door to more accurate decisions in information systems.

Keywords: information system, uncertain concept, membership function, similarity relation, degree of similarity

Procedia PDF Downloads 131
3119 Tool for Determining the Similarity between Two Web Applications

Authors: Doru Anastasiu Popescu, Raducanu Dragos Ionut

Abstract:

In this paper the presentation of a tool which measures the similarity between two websites is made. The websites are compound only from webpages created with HTML. The tool uses three ways of calculating the similarity between two websites based on certain results already published. The first way compares all the webpages within a website, the second way compares a webpage with all the pages within the second website and the third way compares two webpages. Java programming language and technologies such as spring, Jsoup, log4j were used for the implementation of the tool.

Keywords: Java, Jsoup, HTM, spring

Procedia PDF Downloads 301
3118 Text Similarity in Vector Space Models: A Comparative Study

Authors: Omid Shahmirzadi, Adam Lugowski, Kenneth Younge

Abstract:

Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of modeling patent-to-patent similarity and compare TFIDF (and related extensions), topic models (e.g., latent semantic indexing), and neural models (e.g., paragraph vectors). Contrary to expectations, the added computational cost of text embedding methods is justified only when: 1) the target text is condensed; and 2) the similarity comparison is trivial. Otherwise, TFIDF performs surprisingly well in other cases: in particular for longer and more technical texts or for making finer-grained distinctions between nearest neighbors. Unexpectedly, extensions to the TFIDF method, such as adding noun phrases or calculating term weights incrementally, were not helpful in our context.

Keywords: big data, patent, text embedding, text similarity, vector space model

Procedia PDF Downloads 64
3117 Discovering the Dimension of Abstractness: Structure-Based Model that Learns New Categories and Categorizes on Different Levels of Abstraction

Authors: Georgi I. Petkov, Ivan I. Vankov, Yolina A. Petrova

Abstract:

A structure-based model of category learning and categorization at different levels of abstraction is presented. The model compares different structures and expresses their similarity implicitly in the forms of mappings. Based on this similarity, the model can categorize different targets either as members of categories that it already has or creates new categories. The model is novel using two threshold parameters to evaluate the structural correspondence. If the similarity between two structures exceeds the higher threshold, a new sub-ordinate category is created. Vice versa, if the similarity does not exceed the higher threshold but does the lower one, the model creates a new category on higher level of abstraction.

Keywords: analogy-making, categorization, learning of categories, abstraction, hierarchical structure

Procedia PDF Downloads 79
3116 Graph Similarity: Algebraic Model and Its Application to Nonuniform Signal Processing

Authors: Nileshkumar Vishnav, Aditya Tatu

Abstract:

A recent approach of representing graph signals and graph filters as polynomials is useful for graph signal processing. In this approach, the adjacency matrix plays pivotal role; instead of the more common approach involving graph-Laplacian. In this work, we follow the adjacency matrix based approach and corresponding algebraic signal model. We further expand the theory and introduce the concept of similarity of two graphs. The similarity of graphs is useful in that key properties (such as filter-response, algebra related to graph) get transferred from one graph to another. We demonstrate potential applications of the relation between two similar graphs, such as nonuniform filter design, DTMF detection and signal reconstruction.

Keywords: graph signal processing, algebraic signal processing, graph similarity, isospectral graphs, nonuniform signal processing

Procedia PDF Downloads 259
3115 Map Matching Performance under Various Similarity Metrics for Heterogeneous Robot Teams

Authors: M. C. Akay, A. Aybakan, H. Temeltas

Abstract:

Aerial and ground robots have various advantages of usage in different missions. Aerial robots can move quickly and get a different sight of view of the area, but those vehicles cannot carry heavy payloads. On the other hand, unmanned ground vehicles (UGVs) are slow moving vehicles, since those can carry heavier payloads than unmanned aerial vehicles (UAVs). In this context, we investigate the performances of various Similarity Metrics to provide a common map for Heterogeneous Robot Team (HRT) in complex environments. Within the usage of Lidar Odometry and Octree Mapping technique, the local 3D maps of the environment are gathered.  In order to obtain a common map for HRT, informative theoretic similarity metrics are exploited. All types of these similarity metrics gave adequate as allowable simulation time and accurate results that can be used in different types of applications. For the heterogeneous multi robot team, those methods can be used to match different types of maps.

Keywords: common maps, heterogeneous robot team, map matching, informative theoretic similarity metrics

Procedia PDF Downloads 83
3114 3D Objects Indexing Using Spherical Harmonic for Optimum Measurement Similarity

Authors: S. Hellam, Y. Oulahrir, F. El Mounchid, A. Sadiq, S. Mbarki

Abstract:

In this paper, we propose a method for three-dimensional (3-D)-model indexing based on defining a new descriptor, which we call new descriptor using spherical harmonics. The purpose of the method is to minimize, the processing time on the database of objects models and the searching time of similar objects to request object. Firstly we start by defining the new descriptor using a new division of 3-D object in a sphere. Then we define a new distance which will be used in the search for similar objects in the database.

Keywords: 3D indexation, spherical harmonic, similarity of 3D objects, measurement similarity

Procedia PDF Downloads 344
3113 Analytical Similarity Assessment of Bevacizumab Biosimilar Candidate MB02 Using Multiple State-of-the-Art Assays

Authors: Marie-Elise Beydon, Daniel Sacristan, Isabel Ruppen

Abstract:

MB02 (Alymsys®) is a candidate biosimilar to bevacizumab, which was developed against the reference product (RP) Avastin® sourced from both the European Union (EU) and United States (US). MB02 has been extensively characterized comparatively to Avastin® at a physicochemical and biological level using sensitive orthogonal state-of-the-art analytical methods. MB02 has been demonstrated similar to the RP with regard to its primary and higher-order structure, post- and co-translational profiles such as glycosylation, charge, and size variants. Specific focus has been put on the characterization of Fab-related activities, such as binding to VEGF A 165, which directly reflect the bevacizumab mechanism of action. Fc-related functionality was also investigated, including binding to FcRn, which is indicative of antibodies' half-life. The data generated during the analytical similarity assessment demonstrate the high analytical similarity of MB02 to its RP.

Keywords: analytical similarity, bevacizumab, biosimilar, MB02

Procedia PDF Downloads 120
3112 A Word-to-Vector Formulation for Word Representation

Authors: Sandra Rizkallah, Amir F. Atiya

Abstract:

This work presents a novel word to vector representation that is based on embedding the words into a sphere, whereby the dot product of the corresponding vectors represents the similarity between any two words. Embedding the vectors into a sphere enabled us to take into consideration the antonymity between words, not only the synonymity, because of the suitability to handle the polarity nature of words. For example, a word and its antonym can be represented as a vector and its negative. Moreover, we have managed to extract an adequate vocabulary. The obtained results show that the proposed approach can capture the essence of the language, and can be generalized to estimate a correct similarity of any new pair of words.

Keywords: natural language processing, word to vector, text similarity, text mining

Procedia PDF Downloads 168
3111 Distances over Incomplete Diabetes and Breast Cancer Data Based on Bhattacharyya Distance

Authors: Loai AbdAllah, Mahmoud Kaiyal

Abstract:

Missing values in real-world datasets are a common problem. Many algorithms were developed to deal with this problem, most of them replace the missing values with a fixed value that was computed based on the observed values. In our work, we used a distance function based on Bhattacharyya distance to measure the distance between objects with missing values. Bhattacharyya distance, which measures the similarity of two probability distributions. The proposed distance distinguishes between known and unknown values. Where the distance between two known values is the Mahalanobis distance. When, on the other hand, one of them is missing the distance is computed based on the distribution of the known values, for the coordinate that contains the missing value. This method was integrated with Wikaya, a digital health company developing a platform that helps to improve prevention of chronic diseases such as diabetes and cancer. In order for Wikaya’s recommendation system to work distance between users need to be measured. Since there are missing values in the collected data, there is a need to develop a distance function distances between incomplete users profiles. To evaluate the accuracy of the proposed distance function in reflecting the actual similarity between different objects, when some of them contain missing values, we integrated it within the framework of k nearest neighbors (kNN) classifier, since its computation is based only on the similarity between objects. To validate this, we ran the algorithm over diabetes and breast cancer datasets, standard benchmark datasets from the UCI repository. Our experiments show that kNN classifier using our proposed distance function outperforms the kNN using other existing methods.

Keywords: missing values, incomplete data, distance, incomplete diabetes data

Procedia PDF Downloads 136
3110 Garment Industry Development in South East Asia and Competitiveness

Authors: P. Nayak, Shakeel Shaikh

Abstract:

In this paper, we analyse the apparel export performance of Southeast Asian Nations (ASEAN) in the world market. The study covers the 2003-2012 period at the sector as well as product levels (6 digit HS) and analysis is based HS 2002 nomenclature. We measure export similarity among Southeast Asian nations for the apparel sector (two digit HS-61 & 62), besides analysing the products performance in the world through Revealed Comparative Advantage (RCA) technique. Coupled with RCA, the price as a factor of competitiveness was examined from the available Unit Value Realizations (UVR). Further to this, the resource availability or outsourced from the region was considered as an extension to the analysis of competitiveness between the nations. With the help of these methodologies, we examine the degree of competition between the exports of southeast nations in the world market. Our results show that Cambodia, Indonesia, Thailand, and Vietnam are well performing states within ASEAN. The paper further delves into sustainability of the export performing countries within ASEAN.

Keywords: export competitiveness, export similarity index, revealed comparative advantage, unit value realisation

Procedia PDF Downloads 212
3109 Unsteady Similarity Solution for a Slender Dry Patch in a Thin Newtonian Fluid Film

Authors: S. S. Abas, Y. M. Yatim

Abstract:

In this paper the unsteady, slender, symmetric dry patch in an infinitely wide and thin liquid film of Newtonian fluid draining under gravity down an inclined plane in the presence of strong surface-tension effect is considered. A similarity transformation, named a travelling-wave similarity solution is used to reduce the governing partial differential equation into the ordinary differential equation which is then solved numerically using a shooting method. The introduction of surface-tension effect on the flow leads to a fourth-order ordinary differential equation. The solution obtained predicts that the dry patch has a quartic shape and the free surface has a capillary ridge near the contact line which decays in an oscillatory manner far from it.

Keywords: dry patch, Newtonian fluid, similarity solution, surface-tension effect, travelling-wave, unsteady thin-film flow

Procedia PDF Downloads 224