Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features

Mehmet Hacibeyoglu; Ahmet Arslan; Sirzat Kahramanli

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features

Authors: Mehmet Hacibeyoglu, Ahmet Arslan, Sirzat Kahramanli

Abstract:

This study analyzes the effect of discretization on classification of datasets including continuous valued features. Six datasets from UCI which containing continuous valued features are discretized with entropy-based discretization method. The performance improvement between the dataset with original features and the dataset with discretized features is compared with k-nearest neighbors, Naive Bayes, C4.5 and CN2 data mining classification algorithms. As the result the classification accuracies of the six datasets are improved averagely by 1.71% to 12.31%.

Keywords: Data mining classification algorithms, entropy-baseddiscretization method

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1054966

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2465

References:

[1] J. L. Lustgarten, V. Gopalakrishnan, H. Grover, and S. Visweswaran, "Improving Classification Performance with Discretization on Biomedical Datasets," in AMIA Annu Symp Proc., 2008, pp.445-449.
[2] K. J. Cios, W. Pedrycz, R. Swiniarski and L. Kurgan, "Data Mining A Knowledge Discovery Approach," Springer, 2007.
[3] A. Kumar, D. Zhang, "Hand-Geometry Recognition Using Entropy- Based Discretization," IEEE Transactions on Information Forenics and Security, vol. 2, no. 2, 2007, pp. 181-187.
[4] U. M. Fayyad, K. B. Irani, "Multi-interval discretization of continuousvalued attributes for classification learning," in Proc. 13th International Joint Conference on Artificial Intelligence, San Francisco, CA, Morgan Kaufmann, 1993, pp. 1022-1027.
[5] D. Dougherty, R. Kohavi, and M. Sahami, "Supervised and unsupervised discretization of continuous features," in Proc. 12th Int. Conf. Machine Learning, Tahoe City, CA, 1995, pp. 194-202.
[6] I. H. Witten and E. Frank, "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations," San Mateo, CA: Morgan Kaufman, 1999.
[7] G. Shakhnarovish, T. Darrell and P. Indyk, "Nearest-Neighbor Methods in Learning and Vision," MIT Press, 2005.
[8] Y. Tsuruoka and J. Tsujii, "Improving the performance of dictionarybased approaches in protein name recognition," Journal of Biomedical Informatics, vol. 37, no. 6, December, 2004, pp. 461-470
[9] J. R. Quinlan, "C4.5: Programs for machine learning," San Francisco, CA: Morgan Kaufman. 1993.
[10] P. Clark and T. Niblett, "The CN2 induction algorithm," Machine Learning, 1989, vol. 3, pp. 261-284.
[11] N. Mastrogiannis, B. Boutsinas and I. Giannikos, "A method for improving the accuracy of data mining classification algorithms," Computers & Operations Research, 2009, vol. 36 no.10, pp. 2829-2839.
[12] J. R. Quinlan, "Induction of C4.5 Decision trees," Machine Learning, vol. 1, 1986, pp. 81-106.
[13] R. S. Michalski, "On the quasi-minimal solution of the general covering problem," in Proceedings of the Fifth International Symposium on Information Processing, 1969, Bled, Yugoslavia, pp. 125-128.
[14] J. R. Quinlan, "Learning efficient classification procedures and their application to chess end games," Machine learning: An artificial intelligence approach, 1983, Los Altos, CA: Morgan Kaufmann.