An Analysis of Classification of Imbalanced Datasets by Using Synthetic Minority Over-Sampling Technique

Ghada A. Alfattni

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32795

An Analysis of Classification of Imbalanced Datasets by Using Synthetic Minority Over-Sampling Technique

Authors: Ghada A. Alfattni

Abstract:

Analysing unbalanced datasets is one of the challenges that practitioners in machine learning field face. However, many researches have been carried out to determine the effectiveness of the use of the synthetic minority over-sampling technique (SMOTE) to address this issue. The aim of this study was therefore to compare the effectiveness of the SMOTE over different models on unbalanced datasets. Three classification models (Logistic Regression, Support Vector Machine and Nearest Neighbour) were tested with multiple datasets, then the same datasets were oversampled by using SMOTE and applied again to the three models to compare the differences in the performances. Results of experiments show that the highest number of nearest neighbours gives lower values of error rates.

Keywords: Imbalanced datasets, SMOTE, machine learning, logistic regression, support vector machine, nearest neighbour.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1124621

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1271

References:

[1] Chawla, N.V., et al., SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002. 16: p. 321- 357.
[2] Alpaydin, E., Introduction to Machine Learning. 2009: Massachusetts Institute of Technology.
[3] Dong, Y. and X. Wang, A New Over-Sampling Approach: RandomSMOTE for Learning from Imbalanced Data Sets. KSEM'11 Proceedings of the 5th international conference on Knowledge Science, Engineering and Management, 2011: p. 343-352.
[4] Kohavi, R. and F. Provost. Glossary of Terms: Special Issue on Applications of Machine Learning and the Knowledge Discovery Process. 1998 (cited 2016); Available from: http://robotics.stanford.edu/~ronnyk/glossary.html.
[5] Bland, J.M. and D.G. Altman, Measurement error. British Medical Journal, 1996. 313: p. 744.
[6] Fawcett, T., An introduction to ROC analysis. Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition, 2006. 27(8): p. 861-874.
[7] Brain, D. and G.I. Webb, On the effect of data set size on bias and variance in classification learning. The Fourth Australian Knowledge Acquisition Workshop, 1999: p. 117-128.