Evaluation of Clustering Based on Preprocessing in Gene Expression Data
Authors: Seo Young Kim, Toshimitsu Hamasaki
Abstract:
Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Keywords: Gene expression, clustering, data preprocessing.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1079642
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1739References:
[1] J. Quanckenbush, "Computational analysis of microarray data," Nat.Genet. vol. 2, 2001, pp. 418-427.
[2] J. A. Hartigan, M. A. Wang, "A k-means clustering algorithm," Appl.Stat. vol.28, 1979, pp. 100-108.
[3] S. Y. Kim, J. W. Lee, "Ensemble clustering method based on the resampling similarity measure for gene expression data," Statistical methods in medical research, vol. 16, 2007, pp. 539-564.
[4] A. Weingessel, E. Dimitriadou, K. Hornik, "An ensemble method for clustering," DSC Working papers, 2003. See also http://www.ci.tuwien.ac.at/Conferences/ DSC-2003.
[5] L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, New York, 1990.
[6] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981.
[7] T. Speed, Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall, New York, 2003.
[8] S. Dudoit, J. Fridlyand, "A prediction-based resampling method for estimating the number of clusters in a dataset,".Genome Biology, vol.3, 2002, research0036.1-0036.21.
[9] S. Datta, S. Datta, "Comparisons and validation of statistical clustering techniques for microarray gene expression data," Bioinformatics vol.19, 2003, pp. 459-466.
[10] Y. Luan, H. Li, "Clustering of time-course gene expression data using a mixed-effects model with B-splines," Bioinformatics vol.19, 2003, pp. 474-482.
[11] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander, "Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring," Science vol. 286, 1999, pp. 531-537.
[12] Y. H. Yang, S. Dudoit, P. Luu, T. P. Speed: Normaliztion for cDNA microarray data, eds. M. Bittner, Y. Chen, A. Dorsel, E. Dougherty, Microarrays: Optical Technologies and Informatics SPIE, 2001.
[13] K. Y. Yeung, W. L. Ruzzo, "An empirical study on principal component analysis for clustering gene expression data," Technical Report 2000 UW-CSE-00-11-01, Department of Computer Science and Engineering, University of Washington, 2001.
[14] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, "Molecular classification of cutaneous malignant melanoma by gene expression profiling," Nature vol.406, 2002, pp. 536-540.
[15] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, M. Meyerson, "Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes," Proc.Natl. Acad.Sci. vol. 98, 2001, pp. 13790-13795.
[16] R. Tibshirani, G. Walther, T. Hastie, "Estimating the number of clusters in a dataset via the gap statistic," Technical Report, Department of Biostatistics, Stanford University, 2000.
[17] R. G. Darlene, G. Debashis, M. C. Erin, "Statistical issues in the clustering of gene expression data," Statistica Sinica vol.12, 2002, pp. 219-240.
[18] Y. Zhao, M. C. Li, R. Simon, "An adaptive method for cDNA microarray normalization," BMC Bioinformatics vol. 6; 28, 2005.
[19] D. Dembele, P. Kastner, "Fuzzy C-means method for clustering microarray data," Bioinformatics vol. 19, 2003, pp. 973-780.
[20] V. Guralnik, G. Karypis, "A scalable algorithm for clustering protein sequences," Workshop on Data Mining in Bioinformatics, Proceedings of the U.S.A., 2001, pp. 73-80.
[21] J. A. Berger, S. Hautaniemi, A. K. Jarvinen, H. Edgren, S. K. Mitra, J. Astola, "Optimized LOWESS normalization parameter selection for DNA microarray data," BMC Bioinformatics vol. 5, 2004, pp. 194.