Commenced in January 2007
Paper Count: 30855
Distances over Incomplete Diabetes and Breast Cancer Data Based on Bhattacharyya Distance
Abstract:Missing values in real-world datasets are a common problem. Many algorithms were developed to deal with this problem, most of them replace the missing values with a fixed value that was computed based on the observed values. In our work, we used a distance function based on Bhattacharyya distance to measure the distance between objects with missing values. Bhattacharyya distance, which measures the similarity of two probability distributions. The proposed distance distinguishes between known and unknown values. Where the distance between two known values is the Mahalanobis distance. When, on the other hand, one of them is missing the distance is computed based on the distribution of the known values, for the coordinate that contains the missing value. This method was integrated with Wikaya, a digital health company developing a platform that helps to improve prevention of chronic diseases such as diabetes and cancer. In order for Wikaya’s recommendation system to work distance between users need to be measured. Since there are missing values in the collected data, there is a need to develop a distance function distances between incomplete users profiles. To evaluate the accuracy of the proposed distance function in reflecting the actual similarity between different objects, when some of them contain missing values, we integrated it within the framework of k nearest neighbors (kNN) classifier, since its computation is based only on the similarity between objects. To validate this, we ran the algorithm over diabetes and breast cancer datasets, standard benchmark datasets from the UCI repository. Our experiments show that kNN classifier using our proposed distance function outperforms the kNN using other existing methods.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1340420Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 453
 L. Abedallah and I. Shimshoni. A distance function for data with missing values and its application. Proc. of the 13th Int. Conf. on Data Mining and Knowledge Engineering, 2013.
 G. Batista and M.C. Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6):519–533, 2003.
 Krzysztof J Cios and Lukasz A Kurgan. Trends in data mining and knowledge discovery. Advanced techniques in knowledge discovery and data mining, pages 1–26, 2005.
 A Rogier T Donders, Geert JMG van der Heijden, Theo Stijnen, and Karel GM Moons. Review: a gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087–1091, 2006.
 A. Frank and A. Asuncion. UCI machine learning repository at http://archive.ics.uci.edu/ml. visited (2013), 2010.
 Jerzy Grzymala-Busse and Ming Hu. A comparison of several approaches to missing attribute values in data mining. In Proc. Rough Sets and Current Trends in Computing, pages 378–385. Springer, 2001.
 Joseph G Ibrahim, Ming-Hui Chen, Stuart R Lipsitz, and Amy H Herring. Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association, 100(469):332–346, 2005.
 Roderick JA Little. Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6(3):287–296, 1988.
 Roderick JA Little and Donald B Rubin. Statistical analysis with missing data. John Wiley & Sons, 2014.
 Matteo Magnani. Techniques for dealing with missing data in knowledge discovery tasks. Obtido http://magnanim.web.cs.unibo.it/index.html, 15(01):2007, 2004.
 S. Zhang, Z. Qin, C.X. Ling, and S. Sheng. Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans. on KDE, 17(12):1689–1693, 2005.
 Shichao Zhang. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 35(1):123–133, 2011.