Missing values in data are common in real world applications. Since the performance of many data mining algorithms depend critically on it being given a good metric over the input space, we decided in this paper to define a distance function for unlabeled

\r\ndatasets with missing values. We use the Bhattacharyya distance, which measures the similarity of two probability distributions, to define our new distance function. According to this distance, the distance between two points without missing attributes values is simply the Mahalanobis distance. When on the other hand there is a missing value of one of the coordinates, the distance is computed according to the distribution of the missing coordinate. Our distance is general and can be used as part of any algorithm that computes the distance between data points. Because its performance depends strongly on the chosen distance measure, we opted for the k nearest neighbor classifier to evaluate its ability to accurately reflect object similarity. We experimented on standard numerical datasets from the UCI repository from different fields. On these datasets we simulated missing values and compared the performance of the kNN classifier using our distance to other three basic methods. Our experiments show that kNN using our distance function outperforms the kNN using other methods. Moreover, the runtime performance of our method is only slightly higher than the other methods.<\/p>\r\n","references":"[1] Gustavo Batista and Maria Carolina Monard. An analysis of four missing\r\ndata treatment methods for supervised learning. Applied Artificial\r\nIntelligence, 17(5-6):519\u2013533, 2003.\r\n[2] Knonenko. Bratko and E. I. Roskar. Experiments in automatic learning\r\nof medical diagnostic rules. Technical Report, Jozef Stefan Institute,\r\nLljubljana, Yugoslavia, 1984.\r\n[3] Krzysztof J Cios and Lukasz A Kurgan. Trends in data mining and\r\nknowledge discovery. Advanced techniques in knowledge discovery and\r\ndata mining, pages 1\u201326, 2005.\r\n[4] Peter Clark and Tim Niblett. The cn2 induction algorithm. Machine\r\nlearning, 3(4):261\u2013283, 1989.\r\n[5] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE\r\nTransactions on Information Theory, 13(1):21\u201327, 1967.\r\n[6] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\r\n[7] Jerzy Grzymala-Busse and Ming Hu. A comparison of several approaches\r\nto missing attribute values in data mining. In Rough sets and\r\ncurrent trends in computing, pages 378\u2013385. Springer, 2001.\r\n[8] Matteo Magnani. Techniques for dealing with missing data in knowledge\r\ndiscovery tasks. Obtido http:\/\/magnanim.web.cs.unibo.it\/index.html,\r\n15(01):2007, 2004.\r\n[9] Nambiraj Suguna and Keppana G Thanushkodi. Predicting missing\r\nattribute values using k-means clustering. Journal of Computer Science,\r\n7(2):216\u2013224, 2011.\r\n[10] Shichao Zhang. Shell-neighbor method and its application in missing\r\ndata imputation. Applied Intelligence, 35(1):123\u2013133, 2011.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 82, 2013"}