Feature Selection and Predictive Modeling of Housing Data Using Random Forest
Authors: Bharatendra Rai
Abstract:
Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative features that describe various aspects people consider while buying a new house. Boruta algorithm that supports feature selection using a wrapper approach build around random forest is used in this study. This feature selection process leads to 49 confirmed features which are then used for developing predictive random forest models. The study also explores five different data partitioning ratios and their impact on model accuracy are captured using coefficient of determination (r-square) and root mean square error (rsme).
Keywords: Housing data, feature selection, random forest, Boruta algorithm, root mean square error.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1130301
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1715References:
[1] M. J. Hallett, J. J. Fan, X. G. Su, R. A. Levine, and M. E. Nunn, “Random forest and variable importance rankings for correlated survival data, with applications to tooth loss,” Statistical Modelling, Vol.14(6), pp.523-547, 2014.
[2] A. L. Boulesteix, A. Bender, B. J. Lorenzo, and C. Strobl, “Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations,” Briefings in Bioinformatics, Vol. 13(3), pp.292-304, 2012.
[3] M. L. Calle, and V. Urrea, “Letter to the Editor: Stability of Random Forest importance measures,” Briefings in Bioinformatics, Vol. 12(1), pp.86-89, 2011.
[4] K. Nicodemus, “Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures, “Briefings in Bioinformatics, Vol.12(4), pp.369-373, 2011.
[5] R. Kohavi, and G. H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, 97, 273–324, 1997.
[6] M. B. Kursa, and W. R. Rudnicki, “Feature Selection with the Baruta Package,” Journal of Statistical Software, Vol. 36, Issue 11, 1-13, 2010.
[7] M. B. Kursa, “Robustness of Random Forest-based gene selection methods,” BMC Bioinformatics, Vol.15, pp. 1-8, 2014.
[8] Z. Yang, M. Jin, Z. Zhang, J. Lu, and K. Hao, “Classification Based on Feature Extraction for Hepatocellular Carcinoma Diagnosis Using High-throughput DNA Methylation Sequencing Data, “Procedia Computer Science, Vol.107, pp.412-417, 2017.
[9] D. T. Larose, and C. D. Larose, Discovering Knowledge in Data: An Introduction to Data Mining. Hoboken, New Jersey: John Wiley & Sons, 2014.
[10] G. Shmueli, N. R. Patel, and P. C. Bruce, Data Mining for Business Intelligence: Concepts, Techniques, and Applications. Hoboken, New Jersey: John Wiley & Sons, 2010.
[11] B. K. Rai, “Classification, Feature Selection and Prediction with Neural-Network Taguchi System,” International Journal of Industrial and Systems Engineering, Vol. 4, No. 6, 645-664, 2009.