Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31515
Data Preprocessing for Supervised Leaning

Authors: S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas


Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.

Keywords: Data mining, feature selection, data cleaning.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 5019


[1] Bauer, K.W., Alsing, S.G., Greene, K.A., 2000. Feature screening using signal-to-noise ratios. Neurocomputing 31, 29-44.
[2] M. Boulle. Khiops: A Statistical Discretization Method of Continuous Attributes. Machine Learning 55:1 (2004) 53-69
[3] Breunig M. M., Kriegel H.-P., Ng R. T., Sander J.: ÔÇÿLOF: Identifying Density-Based Local Outliers-, Proc. ACM SIGMOD Int. Conf. On Management of Data (SIGMOD 2000), Dallas, TX, 2000, pp. 93-104.
[4] Brodley, C.E. and Friedl, M.A. (1999) "Identifying Mislabeled Training Data", AIR, Volume 11, pages 131-167.
[5] Bruha and F. Franek: Comparison of various routines for unknown attribute value processing: covering paradigm. International Journal of Pattern Recognition and Artificial Intelligence, 10, 8 (1996), 939-955
[6] J.R. Cano, F. Herrera, M. Lozano. Strategies for Scaling Up Evolutionary Instance Reduction Algorithms for Data Mining. In: L.C. Jain, A. Ghosh (Eds.) Evolutionary Computation in Data Mining, Springer, 2005, 21-39
[7] C. Cardie. Using decision trees to improve cased-based learning. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1995.
[8] M. Dash, H. Liu, Feature Selection for Classification, Intelligent Data Analysis 1 (1997) 131-156.
[9] S. Das. Filters, wrappers and a boosting-based hybrid for feature selection. Proc. of the 8th International Conference on Machine Learning, 2001.
[10] T. Elomaa, J. Rousu. Efficient multisplitting revisited: Optimapreserving elimination of partition candidates. Data Mining and Knowledge Discovery 8:2 (2004) 97-126
[11] Fayyad U., and Irani K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of the Thirteenth Int. Joint Conference on Artificial Intelligence, 1022-1027.
[12] Friedman, J.H. 1997. Data mining and statistics: What-s the connection? Proceedings of the 29th Symposium on the Interface Between Computer Science and Statistics.
[13] Marek Grochowski, Norbert Jankowski: Comparison of Instance Selection Algorithms II. Results and Comments. ICAISC 2004a: 580- 585.
[14] Jerzy W. Grzymala-Busse and Ming Hu, A Comparison of Several Approaches to Missing Attribute Values in Data Mining, LNAI 2005, pp. 378−385, 2001.
[15] Isabelle Guyon, André Elisseeff; An Introduction to Variable and Feature Selection, JMLR Special Issue on Variable and Feature Selection, 3(Mar):1157--1182, 2003.
[16] Hernandez, M.A.; Stolfo, S.J.: Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining and Knowledge Discovery 2(1):9-37, 1998.
[17] Hall, M. (2000). Correlation-based feature selection for discrete and numeric class machine learning. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 359-366).
[18] K. M. Ho, and P. D. Scott. Reducing Decision Tree Fragmentation Through Attribute Value Grouping: A Comparative Study, in Intelligent Data Analysis Journal, 4(1), pp.1-20, 2000.
[19] Hu, Y.-J., & Kibler, D. (1996). Generation of attributes for learning algorithms. Proc. 13th International Conference on Machine Learning.
[20] J. Hua, Z. Xiong, J. Lowey, E. Suh, E.R. Dougherty. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21 (2005) 1509-1515
[21] Norbert Jankowski, Marek Grochowski: Comparison of Instances Selection Algorithms I. Algorithms Survey. ICAISC 2004b: 598-603.
[22] Knorr E. M., Ng R. T.: ÔÇÿA Unified Notion of Outliers: Properties and Computation-, Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD-97), Newport Beach, CA, 1997, pp. 219-222.
[23] R. Kohavi and M. Sahami. Error-based and entropy-based discretisation of continuous features. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996.
[24] Kononenko, I., Simec, E., and Robnik-Sikonja, M.(1997).Overcoming the myopia of inductive learning algorithms with RELIEFF. Applied Intelligence, 7: 39-55.
[25] S. B. Kotsiantis, P. E. Pintelas (2004), Hybrid Feature Selection instead of Ensembles of Classifiers in Medical Decision Support, Proceedings of Information Processing and Management of Uncertainty in Knowledge- Based Systems, July 4-9, Perugia - Italy, pp. 269-276.
[26] Kubat, M. and Matwin, S., 'Addressing the Curse of Imbalanced Data Sets: One Sided Sampling', in the Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179-186, 1997.
[27] Lakshminarayan K., S. Harp & T. Samad, Imputation of Missing Data in Industrial Databases, Applied Intelligence 11, 259-275 (1999).
[28] Langley, P., Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall Symposium on Relevance, 1-5, 1994.
[29] P. Langley and S. Sage. Induction of selective Bayesian classifiers. In Proc. of 10th Conference on Uncertainty in Artificial Intelligence, Seattle, 1994.
[30] Ling, C. and Li, C., 'Data Mining for Direct Marketing: Problems and Solutions', Proceedings of KDD-98.
[31] Liu, H. and Setiono, R., A probabilistic approach to feature selectionÔÇöa filter solution. Proc. of International Conference on ML, 319-327, 1996.
[32] H. Liu and R. Setiono. Some Issues on scalable feature selection. Expert Systems and Applications, 15 (1998) 333-339. Pergamon.
[33] Liu, H. and H. Metoda (Eds), Instance Selection and Constructive Data Mining, Kluwer, Boston, MA, 2001
[34] H. Liu, F. Hussain, C. Lim, M. Dash. Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6:4 (2002) 393-423.
[35] Maas W. (1994). Efficient agnostic PAC-learning with simple hypotheses. Proc. of the 7th ACM Conf. on Computational Learning Theory, 67-75.
[36] Markovitch S. & Rosenstein D. (2002), Feature Generation Using General Constructor Functions, Machine Learning, 49, 59-98, 2002.
[37] Oates, T. and Jensen, D. 1997. The effects of training set size on decision tree complexity. In ML: Proc. of the 14th Intern. Conf., pp. 254-262.
[38] Pfahringer B. (1995). Compression-based discretization of continuous attributes. Proc. of the 12th International Conference on Machine Learning.
[39] S. Piramuthu. Evaluating feature selection methods for learning in data mining applications. European Journal of Operational Research 156:2 (2004) 483-494
[40] Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Los Altos, CA.
[41] Quinlan J.R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, Los Altos, California.
[42] Reinartz T., A Unifying View on Instance Selection, Data Mining and Knowledge Discovery, 6, 191-210, 2002, Kluwer Academic Publishers.
[43] Rocke, D. M. and Woodruff, D. L. (1996) "Identification of Outliers in Multivariate Data," Journal of the American Statistical Association, 91, 1047-1061.
[44] Setiono, R., Liu, H., 1997. Neural-network feature selector. IEEE Trans. Neural Networks 8 (3), 654-662.
[45] M. Singh and G. M. Provan. Efficient learning of selective Bayesian network classifiers. In Machine Learning: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, 1996.
[46] Somol, P., Pudil, P., Novovicova, J., Paclik, P., 1999. Adaptive floating search methods in feature selection. Pattern Recognition Lett. 20 (11/13), 1157-1163.
[47] P. Somol, P. Pudil. Feature Selection Toolbox. Pattern Recognition 35 (2002) 2749-2759.
[48] C. M. Teng. Correcting noisy data. In Proc. 16th International Conf. on Machine Learning, pages 239-248. San Francisco, 1999.
[49] Yang J, Honavar V. Feature subset selection using a genetic algorithm. IEEE Int Systems and their Applications 1998; 13(2): 44-49.
[50] Yu and Liu (2003), Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC.
[51] Zheng (2000), Constructing X-of-N Attributes for Decision Tree Learning, Machine Learning, 40, 35-75, 2000, Kluwer Academic Publishers.