The Classification Performance in Parametric and Nonparametric Discriminant Analysis for a Class- Unbalanced Data of Diabetes Risk Groups
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32797
The Classification Performance in Parametric and Nonparametric Discriminant Analysis for a Class- Unbalanced Data of Diabetes Risk Groups

Authors: Lily Ingsrisawang, Tasanee Nacharoen

Abstract:

The problems arising from unbalanced data sets generally appear in real world applications. Due to unequal class distribution, many researchers have found that the performance of existing classifiers tends to be biased towards the majority class. The k-nearest neighbors’ nonparametric discriminant analysis is a method that was proposed for classifying unbalanced classes with good performance. In this study, the methods of discriminant analysis are of interest in investigating misclassification error rates for classimbalanced data of three diabetes risk groups. The purpose of this study was to compare the classification performance between parametric discriminant analysis and nonparametric discriminant analysis in a three-class classification of class-imbalanced data of diabetes risk groups. Data from a project maintaining healthy conditions for 599 employees of a government hospital in Bangkok were obtained for the classification problem. The employees were divided into three diabetes risk groups: non-risk (90%), risk (5%), and diabetic (5%). The original data including the variables of diabetes risk group, age, gender, blood glucose, and BMI were analyzed and bootstrapped for 50 and 100 samples, 599 observations per sample, for additional estimation of the misclassification error rate. Each data set was explored for the departure of multivariate normality and the equality of covariance matrices of the three risk groups. Both the original data and the bootstrap samples showed nonnormality and unequal covariance matrices. The parametric linear discriminant function, quadratic discriminant function, and the nonparametric k-nearest neighbors’ discriminant function were performed over 50 and 100 bootstrap samples and applied to the original data. Searching the optimal classification rule, the choices of prior probabilities were set up for both equal proportions (0.33: 0.33: 0.33) and unequal proportions of (0.90:0.05:0.05), (0.80: 0.10: 0.10) and (0.70, 0.15, 0.15). The results from 50 and 100 bootstrap samples indicated that the k-nearest neighbors approach when k=3 or k=4 and the defined prior probabilities of non-risk: risk: diabetic as 0.90: 0.05:0.05 or 0.80:0.10:0.10 gave the smallest error rate of misclassification. The k-nearest neighbors approach would be suggested for classifying a three-class-imbalanced data of diabetes risk groups.

Keywords: Bootstrap, diabetes risk groups, error rate, k-nearest neighbors.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1106287

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1962

References:


[1] A.C. Rencher, Methods of multivariate analysis, New York: Join Wiley & Sons, 1995, ch 9.
[2] A.J.A. Ferrer, and W. Lin, “Comparing the classification accuracy among nonparametric, parametric discriminant analysis and logistic regression methods,” Meeting Papers. Annu. Meeting of the American Educational Research Association, Montreal, April 13-17, 1999, pp. 1– 23.
[3] B. Efron, and R.J. Tibshirani, An introduction to the bootstrap, New York: Chapman & Hall, 1993, ch 6.
[4] G. Menardi, “Statistical issues emerging in modeling unbalanced data sets (Abstract),” full text not available for download.
[5] M.H. Kutner, C.J. Nachtsheim, and J. Neter, Applied linear regression models, Singapore: McGraw Hill, 2008, pp. 372-375.
[6] N. Japkowicz, and S. Stephen, “The class imbalance problem: a systematic study,” Intelligent Data Analysis, vol. 6, pp. 429-450, January 2002.
[7] R. Khattree, and D.N. Naik, Multivariate data reduction and discrimination, Cary, NC: SAS Institute Inc., 2000, ch 5.
[8] R. Longadge, S.S. Dongre, and L. Malik, “Class imbalance problem in data mining: review,” IJCSN, vol. 2, no. 1, pp. 83-87, February 2013.
[9] R.A. Johnson, and D.W. Wichen, Applied multivariate statistical analysis, 4th ed., New Jersy: Printice Hall, 1998, ch.11.
[10] R.J. Rossi, Applied biostatistics for the health sciences, Montana: Join Wiley & Sons, 2010, pp.215-218.
[11] V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” IJETAE, vol. 2, no. 4, pp. 42-47, April 2012.