A Distance Function for Data with Missing Values and Its Application
Authors: Loai AbdAllah, Ilan Shimshoni
Abstract:
Missing values in data are common in real world applications. Since the performance of many data mining algorithms depend critically on it being given a good metric over the input space, we decided in this paper to define a distance function for unlabeled datasets with missing values. We use the Bhattacharyya distance, which measures the similarity of two probability distributions, to define our new distance function. According to this distance, the distance between two points without missing attributes values is simply the Mahalanobis distance. When on the other hand there is a missing value of one of the coordinates, the distance is computed according to the distribution of the missing coordinate. Our distance is general and can be used as part of any algorithm that computes the distance between data points. Because its performance depends strongly on the chosen distance measure, we opted for the k nearest neighbor classifier to evaluate its ability to accurately reflect object similarity. We experimented on standard numerical datasets from the UCI repository from different fields. On these datasets we simulated missing values and compared the performance of the kNN classifier using our distance to other three basic methods. Our experiments show that kNN using our distance function outperforms the kNN using other methods. Moreover, the runtime performance of our method is only slightly higher than the other methods.
Keywords: Missing values, Distance metric, Bhattacharyya distance.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1088404
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2756References:
[1] Gustavo Batista and Maria Carolina Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6):519–533, 2003.
[2] Knonenko. Bratko and E. I. Roskar. Experiments in automatic learning of medical diagnostic rules. Technical Report, Jozef Stefan Institute, Lljubljana, Yugoslavia, 1984.
[3] Krzysztof J Cios and Lukasz A Kurgan. Trends in data mining and knowledge discovery. Advanced techniques in knowledge discovery and data mining, pages 1–26, 2005.
[4] Peter Clark and Tim Niblett. The cn2 induction algorithm. Machine learning, 3(4):261–283, 1989.
[5] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967.
[6] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[7] Jerzy Grzymala-Busse and Ming Hu. A comparison of several approaches to missing attribute values in data mining. In Rough sets and current trends in computing, pages 378–385. Springer, 2001.
[8] Matteo Magnani. Techniques for dealing with missing data in knowledge discovery tasks. Obtido http://magnanim.web.cs.unibo.it/index.html, 15(01):2007, 2004.
[9] Nambiraj Suguna and Keppana G Thanushkodi. Predicting missing attribute values using k-means clustering. Journal of Computer Science, 7(2):216–224, 2011.
[10] Shichao Zhang. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 35(1):123–133, 2011.