Feature Selection with Kohonen Self Organizing Classification Algorithm

Francesco Maiorana

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

Feature Selection with Kohonen Self Organizing Classification Algorithm

Authors: Francesco Maiorana

Abstract:

In this paper a one-dimension Self Organizing Map algorithm (SOM) to perform feature selection is presented. The algorithm is based on a first classification of the input dataset on a similarity space. From this classification for each class a set of positive and negative features is computed. This set of features is selected as result of the procedure. The procedure is evaluated on an in-house dataset from a Knowledge Discovery from Text (KDT) application and on a set of publicly available datasets used in international feature selection competitions. These datasets come from KDT applications, drug discovery as well as other applications. The knowledge of the correct classification available for the training and validation datasets is used to optimize the parameters for positive and negative feature extractions. The process becomes feasible for large and sparse datasets, as the ones obtained in KDT applications, by using both compression techniques to store the similarity matrix and speed up techniques of the Kohonen algorithm that take advantage of the sparsity of the input matrix. These improvements make it feasible, by using the grid, the application of the methodology to massive datasets.

Keywords: Clustering algorithm, Data mining, Feature selection, Grid, Kohonen Self Organizing Map.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1078959

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2997

References:

[1] T. Hastie, R. Tibshiranie, J. H. Friedman "The Elements of Statistical Learning. Data Mining, Inference and Prediction," Springer, New York. 2003.
[2] S. Smit, H. C. J. Hoefsloot, A. K. Smilde "Statistical data processing in clinical proteomics," Journal of Chromatography B, Vol. 866, pp. 77-88, 2008.
[3] A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, E. R. Dougherty "Genetic test bed for feature selection," Bioinformatics, vol. 22, no. 7, pp 837-842, 2006.
[4] K.V. Mardia, J. T. Kent, J. M. Bibby "Multivariate Analysis," Academic Press, London, 1980.
[5] L.J.P. Van der Maaten, E.O. Postma, H. J. van den Herik "Dimensionality reduction: a comparative review," Submitted to Neurocognition, 2008.
[6] Y. Saeys, I. Inza, P. Larranaga "A review of feature selection techniques in bioinformatics," Bioinformatics, vol. 23 no. 19, pp. 2507-2517, 2007.
[7] F. Model, P. Adorj├án, A. Olek, C. Piepenbrock, "Feature selection for DNA methylation based cancer classification," Bioinformatics, vol. 17 (suppl. 1), pp. 157-164, 2001.
[8] A. Ben-Dor, N. Friedman, Z. Yakhini "Class discovery in gene expression data" in Proc of the 5th annual international conference on computational molecular biology, pp 31-38, 2001.
[9] R. Kohavi, G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[10] I. Guyon, S.Gunn, M. Nikravesh, I. Zadeh, L. (Editors) "Feature Extraction, Foundations and Applications (Studies in Fuzziness and Soft Computing)," Chap. 6: Embedded methods. Springer, 2006.
[11] I. Guyon, A. Elisseeff "An introduction to variable and feature selection,". Journal of Machine Learning Research, vol 3, pp. 1157- 1182, 2003.
[12] T. Kohonen "Self Organizing Maps," Springer, 2000.
[13] S. Kaski, J. Kangas, T. Kohonen "Bibliography of self organizing map (SOM) Papers: 1981 - 1997." Neural Computing Survey, vol. 1, no. 3, pp. 102ÔÇö350, 1998.
[14] M. Oja, S. Kaski, T. Kohonen "Bibliography of self organizing map (SOM) papers: 1998 - 2001 Addendum," Neural Computing Survey, vol. 3, no. 1, pp. 1ÔÇö156, 2003.
[15] M. Cottrel J.C. Fort, P. Letremy "Advantages and drawbacks of the batch Kohonen Algorithm," in Proc. 10th European Symp. On Artificial Neural Network, pp. 223ÔÇö230. Bruges (Belgium), 2005.
[16] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities by adaptive Self Organizing classification,". Proceedings of WASET, vol. 4, pp. 27ÔÇö30, 2005: http://www.waset.org/pwaset/v4/v4-8.pdf
[17] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities from tree to semi - lattice classifications," International Journal of Computational Intelligence, vol. 2, no. 1, pp. 34ÔÇö39, 2005: http://www.waset.org/ijci/v2/v2-1-6.pdf
[18] T. Fawcett "An introduction to ROC analysis" Pattern Recognition Letters Vol. 27, pp. 861-874, 2006.
[19] E. Spertus, M. Sahami, O. Buyukkokten "Evaluating similarity measures: a large scale study in the Orkut Social Network," In Proc. of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp. 678-684, 2005.
[20] A. Faro, D. Giordano, F. Maiorana, C. Spanpinato, "Discovering Genes- Diseases Associations from Specialized Literature using the GRID." To appear on IEEE Transaction on Information Technology in Biomedicine.
[21] I. Guyon, "Design of experiments for the NIPS 2003 variable selection benchmark," Technical Report, 2003. http://www.nipsfsc.ecs.soton.ac.uk/papers/Datasets.pdf.
[22] I. Guyon, "Experimental design of the WCCI 2006 performance prediction challenge," Technical Report,2005.
[23] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, G, "Result analysis of the NIPS 2003 feature selection challenge," in Proc NIPS, 2004. http://books.nips.cc/papers/files/nips17/NIPS2004_0194.pdf.
[24] I. Guyon, J. Li, T. Mader., P. A. Pletscher, G. Schneider, M. Uhr, "Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark," Pattern Recognition Letters, vol 28, pp. 1438-1444, 2007.