Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31515
Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features

Authors: Mehmet Hacibeyoglu, Ahmet Arslan, Sirzat Kahramanli


This study analyzes the effect of discretization on classification of datasets including continuous valued features. Six datasets from UCI which containing continuous valued features are discretized with entropy-based discretization method. The performance improvement between the dataset with original features and the dataset with discretized features is compared with k-nearest neighbors, Naive Bayes, C4.5 and CN2 data mining classification algorithms. As the result the classification accuracies of the six datasets are improved averagely by 1.71% to 12.31%.

Keywords: Data mining classification algorithms, entropy-baseddiscretization method

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2212


[1] J. L. Lustgarten, V. Gopalakrishnan, H. Grover, and S. Visweswaran, "Improving Classification Performance with Discretization on Biomedical Datasets," in AMIA Annu Symp Proc., 2008, pp.445-449.
[2] K. J. Cios, W. Pedrycz, R. Swiniarski and L. Kurgan, "Data Mining A Knowledge Discovery Approach," Springer, 2007.
[3] A. Kumar, D. Zhang, "Hand-Geometry Recognition Using Entropy- Based Discretization," IEEE Transactions on Information Forenics and Security, vol. 2, no. 2, 2007, pp. 181-187.
[4] U. M. Fayyad, K. B. Irani, "Multi-interval discretization of continuousvalued attributes for classification learning," in Proc. 13th International Joint Conference on Artificial Intelligence, San Francisco, CA, Morgan Kaufmann, 1993, pp. 1022-1027.
[5] D. Dougherty, R. Kohavi, and M. Sahami, "Supervised and unsupervised discretization of continuous features," in Proc. 12th Int. Conf. Machine Learning, Tahoe City, CA, 1995, pp. 194-202.
[6] I. H. Witten and E. Frank, "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations," San Mateo, CA: Morgan Kaufman, 1999.
[7] G. Shakhnarovish, T. Darrell and P. Indyk, "Nearest-Neighbor Methods in Learning and Vision," MIT Press, 2005.
[8] Y. Tsuruoka and J. Tsujii, "Improving the performance of dictionarybased approaches in protein name recognition," Journal of Biomedical Informatics, vol. 37, no. 6, December, 2004, pp. 461-470
[9] J. R. Quinlan, "C4.5: Programs for machine learning," San Francisco, CA: Morgan Kaufman. 1993.
[10] P. Clark and T. Niblett, "The CN2 induction algorithm," Machine Learning, 1989, vol. 3, pp. 261-284.
[11] N. Mastrogiannis, B. Boutsinas and I. Giannikos, "A method for improving the accuracy of data mining classification algorithms," Computers & Operations Research, 2009, vol. 36 no.10, pp. 2829-2839.
[12] J. R. Quinlan, "Induction of C4.5 Decision trees," Machine Learning, vol. 1, 1986, pp. 81-106.
[13] R. S. Michalski, "On the quasi-minimal solution of the general covering problem," in Proceedings of the Fifth International Symposium on Information Processing, 1969, Bled, Yugoslavia, pp. 125-128.
[14] J. R. Quinlan, "Learning efficient classification procedures and their application to chess end games," Machine learning: An artificial intelligence approach, 1983, Los Altos, CA: Morgan Kaufmann.