A Hybrid Approach for Selection of Relevant Features for Microarray Datasets
Authors: R. K. Agrawal, Rajni Bala
Abstract:
Developing an accurate classifier for high dimensional microarray datasets is a challenging task due to availability of small sample size. Therefore, it is important to determine a set of relevant genes that classify the data well. Traditionally, gene selection method often selects the top ranked genes according to their discriminatory power. Often these genes are correlated with each other resulting in redundancy. In this paper, we have proposed a hybrid method using feature ranking and wrapper method (Genetic Algorithm with multiclass SVM) to identify a set of relevant genes that classify the data more accurately. A new fitness function for genetic algorithm is defined that focuses on selecting the smallest set of genes that provides maximum accuracy. Experiments have been carried on four well-known datasets1. The proposed method provides better results in comparison to the results found in the literature in terms of both classification accuracy and number of genes selected.
Keywords: Gene selection, genetic algorithm, microarray datasets, multi-class SVM.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1071300
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2062References:
[1] Alon U., Barkai N., Notterman DA., Gish K., Ybarra S., Mack D., Levine AJ., "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays", In Proc. Natnl. Acad. Sci. USA ,96,1990.
[2] Ben-Dor A., Bruhn L., Friedman N., Nachman I., Schummer M., Yakhini Z., "Tissue classification with gene expression profiles", Journal of Computational Biology, 7(3-4),pp.559-583, 2000.
[3] Golub TR., Slonim DK. et al, "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring", Science, 286, pp.531-537, 1999.
[4] Kohavi R., John G., "Wrapper for feature subset selection", Artificial Intelligence, 97(1-2), pp.273-324, 1997.
[5] Langley P., "Selection of relevant features in machine learning", In AAAI Fall Symposium on Relevance, 1994.
[6] Ding C., Peng HC., "Minimum redundancy feature selection from microarray gene expression data", In IEEE Computer Society Bioinformatics Conf, pp. 523-528, 2003.
[7] Jaeger J., Sengupta R., Ruzzo WL., "Improved gene selection for classification of microarray", In PSB, pp. 53-64. 2003.
[8] Li L., Weinberg CR., Darden TA., Pedersen LG. "Gene Selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method", Bioinformatics, 17(12), pp.131-142, 2001.
[9] Jourdan L., "Meatheuristics for knowledge discovery: Application to genetic data", PhD thesis, University of Lille, 2003.
[10] Peng S., Xu Q., Ling XB., Peng X., Du W., Chen L., "Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines", FEBS Letter, 555(2), pp.358-362, 2003.
[11] Deb K., Goldberg DE., "An investigation of niche and species formation in genetic function optimization", In Schaffer J. D. (Ed) Proc. 3rd Internat. Conf. Genetic Algorithm, Morgan Kaufmann, San Mateo, pp. 42-50, 1989.
[12] Bins J., Draper B., "Feature selection from huge feature sets", In Proc. Internat. Conf. Computer Vision, 2, pp.159-165, 2001.
[13] Hong JH., Cho SB., "Efficient huge scale feature selection with speciated genetic algorithm", Pattern Recognition letters, 27, pp.143- 150, 2006.
[14] Huerta EB., Duval B., Hao J., "A hybrid GA/SVM approach for Gene Selection and Classification of microarray data", EvoWorkshops 2006, LNCS 3907 , pp.34-44,2006.
[15] Reddy AR., Deb K., "Classification of two-class cancer data reliably using evolutionary algorithms", Technical Report KanGAL, 2003.
[16] Fu L.M., Liu CSF., "Evaluation of gene importance in microarray data based upon probability of selection", BMC Bioinformatics, 6(67), 2005.
[17] Khan J., Wei JS., Ringer M., Saal LH., Ladanyin, Westermann F., Berthold F., Schwab M., Antonescu CR., Petterson C., Meltzer PS., "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks", Nat. Med., 7, pp.673-679, 2001.
[18] Li T., Zhang C., Ogihara MA., "Comparative study of feature selection and multi class classification methods for tissue classification based on gene expression", Bioinformatics, 20, pp.2429-2437, 2004.
[19] Souza BF., Carvalho APLF., "Gene Selection based on multi-class support vector machines and Genetic algorithms", Genetics and Molecular Research", 4(3), pp.599-607, 2005.
[20] Li W., Yang Y., "How many genes are needed for a discriminant microarray data analysis in Critical Assessment of Techniques for Microarray", Data Mining Workshop, pp.137-150, 2000.
[21] Su Y., Murali T.M., Pavlovic V., Kasif S. "RankGene: identification of diagnostic genes based on expression data", Bioinformatics, pp.1578-79, 2003.
[22] http://www-genome.wi.mit.edu/cgi-bin/cancer/publications
[23] http://research.nhgri.nih.gov/microarray/supplement/.
[24] http://llmpp.nih.gov/lymphoma
[25] Dietterich TG., Bakiri G., "Solving multi-class learning via errorcorrecting output codes", General of Artificial Intelligence Research, 2, pp.263-86, 1995.
[26] Guyon I., Weston J., Barnhill S., Vapnik V. "Gene Selection for cancer classification using support vector machines", Machine Learning, 46, pp.389-422, 2003.
[27] Tibshirani R., Hastie T., Narasimhan B., Chu G., "Diagnosis of multiple cancer types by shrunken centroids of gene expression", In Proc. Natl Acad. Sci., U.S.A., 99, pp.6567-6572, 2002.
[28] Lee Y., Lee C., "Classification of multiple cancer types by multi category support vector machines using gene expression data", Bioinformatics, 19, pp.1132-1139, 2003.
[29] Corts C., Vapnik VN., "Support Vector Networks", Machine Learning, 2, pp.273-297, 1995.
[30] Vapnik VN., The Nature of Statistical Learning Theory. Springer, Berlin Heidelberg New York 1995.
[31] Rifkin R., Klautau A., "In Defence of One-Vs.-All Classification", Journal of Machine Learning, 5, pp.101-141, 2004.
[32] Hsu CW., Lin CJ., "A comparison of methods for Multi-class Support vector machine", IEEE Transactions on Neural Networks, 13(2), pp.415-425, 2002.
[33] Goldberg DE., Genetic algorithm in search, optimization and machine learning. Addison Wesley, 1989.
[34] Ramaswamy S.,Tamayo P. et al , "Multiclass cancer diagnosis using tumor gene expression signature", Proc Natl. Acad Sci. USA, 98(26), pp 15149-15154,2001.