Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31824
Novel Hybrid Method for Gene Selection and Cancer Prediction

Authors: Liping Jing, Michael K. Ng, Tieyong Zeng


Microarray data profiles gene expression on a whole genome scale, therefore, it provides a good way to study associations between gene expression and occurrence or progression of cancer. More and more researchers realized that microarray data is helpful to predict cancer sample. However, the high dimension of gene expressions is much larger than the sample size, which makes this task very difficult. Therefore, how to identify the significant genes causing cancer becomes emergency and also a hot and hard research topic. Many feature selection algorithms have been proposed in the past focusing on improving cancer predictive accuracy at the expense of ignoring the correlations between the features. In this work, a novel framework (named by SGS) is presented for stable gene selection and efficient cancer prediction . The proposed framework first performs clustering algorithm to find the gene groups where genes in each group have higher correlation coefficient, and then selects the significant genes in each group with Bayesian Lasso and important gene groups with group Lasso, and finally builds prediction model based on the shrinkage gene space with efficient classification algorithm (such as, SVM, 1NN, Regression and etc.). Experiment results on real world data show that the proposed framework often outperforms the existing feature selection and prediction methods, say SAM, IG and Lasso-type prediction model.

Keywords: Gene Selection, Cancer Prediction, Lasso, Clustering, Classification.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1879


[1] T. Golub: Genome-wide views of cancer. New England Journal of Medicine, 344, 8, 601-602, 2001.
[2] S. Ramaswamy, T. Golub: DNA microarrays in clinical oncology. Journal of clinical oncology, 20, 7, 1932-1941, 2002.
[3] H. Peng, F. Long, C. Ding: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern analysis and machine intelligence, 27, 1226-1238, 2005.
[4] A. Appice, M. Ceci, S. Rawles, P. Flach: Redundant feature elimination for multi-class problems. Proc. of the 21st ICML, 33-40, 2004.
[5] T. Golub, C. van-Loan: Matrix Computations baltimore. Johns Hopkins Univ. Press, 1996.
[6] S. Ma, M. Kosorok, M, J. Fine: Additive risk models for survival data with high dimensional covariates. Biometrics, 62, 202-210, 2006.
[7] J. Costa, H. Alonso, L. Roque, A weighted principal component analysis and its application to gene expression data, IEEE/ACM Trans. on computational biology and bioinformatics, 17 Jul. 2009. IEEE computer Society Digital Library. IEEE Computer Society.
[8] D. Nguyen, D. Rocker: Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics, 18, 12, 1625-1632, 2002.
[9] J. Gui, H. Li: Penalized Cix regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics, 21, 3001-3008, 2005.
[10] I. Guyon, J. Weston, S. Barnhill: Gene selection for cancer classification using support vector machines. Machine Learning, 46, 1-3, 389-422, 2002
[11] Y. Ding, D. Wilkins: Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics, 7(Suppl 2), 1-8, 2006.
[12] S. Shevade, S. Keerthi: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19, 17, 2246- 2253, 2003.
[13] G. Cawley, N. Talbot: Gene selection in cancer classification using sparse logistic regression with bayesian regularization. Bioinformatics, 22, 2348- 2355, 2006.
[14] L. Ein-Dor, I. Kela, G. Getz, D. Givol, E. Domany: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21, 171-178, 2005.
[15] A. kalousis, J. Prados, M. Hilario: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and information systems, 12, 95-116, 2007.
[16] G. Unger, B. Chor: Linear separability of gene expression datasets. IEEE Trans. on computational biology and bioinformatics, Aug., 2008.
[17] R. Tibshirani: Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B: Statist. Methodol. 58, 267-288, 1996.
[18] H. Zou, T. Hastie: Regularization and variable selection via the elastic net. J. R. Statist. Soc. B: Statist. Methodol. 67, 301-320, 2005.
[19] H. Zou: The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101, 1418-1429, 2006.
[20] H. Zou, H. Zhang: On the adaptive elastic-net with a diverging number of parameters. The Annals of statistics, 37, 4, 1733-1751, 2009.
[21] M. Yuan, Y. Lin: Model selection and estimation in regression with grouped variables. JRSSB, 68, 49-67, 2006.
[22] L. Meier, S. Geer, P. Buhlmann: The group lasso for logistic regression. JRSSB, 70, 53-71, 2008.
[23] D. Donoho, J. Jin: Higher criticism thresholding: optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA, 105, 14790-14795, 2008.
[24] J. Jin: Impossibility of successful classification when useful features are rare and weak. Proc. Natl. Acad. Sci. USA, 106, 8859-8864, 2009.
[25] R. De, A. Ghosh: Interval based fuzzy systems for identification of important genes from microarray gene expression data: application to carcinogenic development. Journal of Biomedical Informatics, online available, Jul.2009.
[26] Y. Yang, J. Pedersen: A comparative study on feature selection in text categorization. Proc. of the 14th ICML, 412-420, 1997.
[27] A. Dasgupta, P. Drineas, B. Harb: Feature selection methods for text classification. Proc. of KDD, San Jose, CA, USA, 2007.
[28] T. Jirapech-Umpai, S. Aitken: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics, 6, 148:1-11, 2005.
[29] T. Mitchell: Machine learning. McCraw Hill, 1996.
[30] T. Golub et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-537, 1999.
[31] V. Tusher, R. Tibshirani, G. Chu: Significance analysis of microarray applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98, 9, 5116-5121, 2001.
[32] L. Yu, C. Ding, S. Loscalzo: Stable feature selection via dense feature groups. Proc. of SIG KDD, Las Vegas, Nevada, USA, 803-811, 2008.
[33] S. Loscalzo, L. Yu, C. Ding: Consensus group stable feature selection. Proc. of SIG KDD, Paris, France, 567-575, 2009.
[34] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, A. Levine: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745-6750, 1999.
[35] M. West et al.: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA, 98, 20, 11462-11467, 2001.
[36] H. Kishino, P. Waddell: Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome information, 11, 83-95, 2000.
[37] E. Feng, M. Ng:On sparse Fisher discriminant method for microarray data analysis. Bioinformation, 2(5), 230-234, 2007.
[38] C. Bolmont, A. Lilienbaum, D. Paulin, J. Grimaud: Expression of desmin gene in skeletal and smooth muscle by in situ hybridization using a human desmin gene probe. Journal of Submicrosc Cytol Pathol., 22(1), 117-122, 1990.
[39] Y. Li, C. Campbell, M. Tipping: Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18, 1332-1339, 2002.
[40] L. Young, S. Sanduja, K. Bemis-Standoli, E. Pena, R. Price, D. Dixon: The mRNA binding protiens HuR and tristetraprolin regulate cyclooxygenase 2 expression during colon carcinogenesis. Gastroenterology, 136(5), 1669-1679, 2009.
[41] U. Knippschild, S. Wolff, G. Giamas, C. Brockschmidt, M. Wittau, P. Wai, T. Eismann, M. Stier: The role of the casein kinase 1 family in different signaling pathways linked to cancer development. Onkologie, 28, 508-514, 2005.
[42] L. Kaufman, P. Rousseeuw: Finding groups in data: an introduction to cluster analysis, Wiley, 1990.
[43] A. Strehl: Relationship-based clustering and cluster ensembles for highdimensional data mining. Ph.D thesis, The University of Texas at Austin, 2002.
[44] T. Attwood, D. Smith: Introduction to bioinformatics. Prentice Hall, 1999.
[45] L. Jacob, G. Obozinski, J. Vert: Group lasso with overlap and graph lasso. In Proc. of the 26th ICML, Montreal, Canada, 2009.
[46] T. Park, G. Casella: The Bayesian Lasso. Journal of the American Statistical Association, 103, 482, 681-686, 2008.
[47] B. Scholkopf, C. Burges, A. Smola: Advances in kernel methods: support vector learning. MIT Press, Cambridge, MA, 1999.
[48] G. Shakhnarovich, T. Darrell, P. Indyk: Nearest-Neighbor methods in learning and vision. The MIT Press, 2005.
[49] D. Hosmer, S. Lemeshow: Applied logistic Regression, 2nd ed.. New York; Chichester, Wiley, 2000.