Evaluation of the Impact of Dataset Characteristics for Classification Problems in Biological Applications
Authors: Kanthida Kusonmano, Michael Netzer, Bernhard Pfeifer, Christian Baumgartner, Klaus R. Liedl, Armin Graber
Abstract:
Availability of high dimensional biological datasets such as from gene expression, proteomic, and metabolic experiments can be leveraged for the diagnosis and prognosis of diseases. Many classification methods in this area have been studied to predict disease states and separate between predefined classes such as patients with a special disease versus healthy controls. However, most of the existing research only focuses on a specific dataset. There is a lack of generic comparison between classifiers, which might provide a guideline for biologists or bioinformaticians to select the proper algorithm for new datasets. In this study, we compare the performance of popular classifiers, which are Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbor (k-NN), Naive Bayes, Decision Tree, and Random Forest based on mock datasets. We mimic common biological scenarios simulating various proportions of real discriminating biomarkers and different effect sizes thereof. The result shows that SVM performs quite stable and reaches a higher AUC compared to other methods. This may be explained due to the ability of SVM to minimize the probability of error. Moreover, Decision Tree with its good applicability for diagnosis and prognosis shows good performance in our experimental setup. Logistic Regression and Random Forest, however, strongly depend on the ratio of discriminators and perform better when having a higher number of discriminators.
Keywords: Classification, High dimensional data, Machine learning
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1327927
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2390References:
[1] R. Clarke et al., "The properties of high-dimensional data spaces: implication for exploring gene and protein and expression data", Nature Reviews Cancer, vol. 8, pp. 37-49, January, 2008.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Element of Statistical Learning: Data Mining, Inference and Prediction, Springer, 2009.
[3] A. C. Tan, D. Q. Naiman, L. Xu, R. L. Winslow, and D. Geman, "Simple decision rules for classifying human cancers from gene expression profiles", Bioinformatics, vol. 21, pp. 3869-3904, August, 2005.
[4] R. Diaz-Uriarte, and S. Alvarez de Andres, "Gene selection and classification of microarray data using random forest", BMC Bioinformatics, vol. 7, January, 2006.
[5] A. Statnikov, L. Wang, and C. F. Aliferis, "A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification", BMC Bioinformatics, vol. 9, July, 2008.
[6] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, "A comparative study of different machine learning methods on microarray gene expression data", BMC Genomics, vol. 9, March, 2008.
[7] S. Cho, and H. Won, "Machine Learning in DNA Microarray Analysis for Cancer Classification", Proc. of the First Asia-Pacific bioinformatics conference on Bioinformatics, Australia, 2003, vol. 19, pp. 189-198.
[8] Z. R. Yang, "Biological applications of support vector machines", BRIEF IN BIOIFORMATICS, vol. 5, no. 4, pp. 328-338, December, 2004.
[9] M. Netzer, G. Millonig, M. Osl, B. Pfeifer, S. Praun, J. Villinger, W. Vogel, C. Baumgartner, "A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry", Bioinformatics, vol. 25, pp. 941-947, April, 2009.
[10] D. W. Hosmer, and S. Lemeshow, Applied logistic regression, John Wiley and Sons, New York, USA, 2000.
[11] G. Tripepi, K. J. Jager, F. W. Dekker, and C. Zoccali, "Linear and logistic regression analysis", Kidney International, vol. 73, pp. 806-810, 2008.
[12] C. Baumgartner, and A. Graber, “Data mining and knowledge discovery in metabolomics”, in F. Masseglia, P. Poncelet, M. Teisseire (eds.) Successes and new directions in data mining, Idea Group Inc., 2007, pp. 141-166.
[13] I. H. Witten, and E. Frank, Data mining: practical machine learning tools and techniques, Morgan Kaufmann, 2005.
[14] H. Pang, I. Kim, and H. Zhao, “Pathway-Based Methods for Analyzing Microarray Data”, in F. Emmert-Streib, M. Dehmer (eds.) Analysis of Microarray Data, WILEY-VCH, 2008, pp. 356-358.
[15] F. Hong, and R. Breitling, “A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments”, BIOINFORMATICS, vol. 24, no. 3, pp. 374–382, December, 2008.
[16] T. A. Lasko, J. G. Bhagwat, K. H. Zou, and L. Ohno-Machado “The use of receiver operating characteristic curves in biomedical informatics”, Journal of Biomedical Informatics, vol. 38, pp. 404-415, April, 2005.
[17] E. Frank, M. Hall, L. Trigg, G. Holmes, and I. H. Witten, “Data mining in bioinformatics using Weka”, BIOINFORMATICS, vol. 20, no. 15, pp. 2479-2481, April, 2004.