An Improved K-Means Algorithm for Gene Expression Data Clustering
Authors: Billel Kenidra, Mohamed Benmohammed
Abstract:
Data mining technique used in the field of clustering is a subject of active research and assists in biological pattern recognition and extraction of new knowledge from raw data. Clustering means the act of partitioning an unlabeled dataset into groups of similar objects. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Several clustering methods are based on partitional clustering. This category attempts to directly decompose the dataset into a set of disjoint clusters leading to an integer number of clusters that optimizes a given criterion function. The criterion function may emphasize a local or a global structure of the data, and its optimization is an iterative relocation procedure. The K-Means algorithm is one of the most widely used partitional clustering techniques. Since K-Means is extremely sensitive to the initial choice of centers and a poor choice of centers may lead to a local optimum that is quite inferior to the global optimum, we propose a strategy to initiate K-Means centers. The improved K-Means algorithm is compared with the original K-Means, and the results prove how the efficiency has been significantly improved.
Keywords: Microarray data mining, biological pattern recognition, partitional clustering, k-means algorithm, centroid initialization.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1317204
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1289References:
[1] Xiong J, Essential Bioinformatics. Texas A&M University, 2006.
[2] Miyoung Shin and Jaeyoung Kim, Microarray Data Mining for Biological Pathway Analysis, ISBN 978-3-902613-53-0, 2009, pp.438.
[3] Prasad G.V.S.N.R.V, Venkata K, and Vijaya K. Automatic Clustering Approaches Based On Initial Seed Points. International Journal on Computer Science and Engineering (IJCSE), ISSN: 0975-3397 Vol. 3 No. 12 December 2011.
[4] Journal of Intelligent Information Systems, Kluwer Academic Publishers. Manufactured in The Netherlands, 2001, pp.107–145.
[5] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129{137, 1982.
[6] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp.1027–1035.
[7] E. Rendon, I. Abundez, A. Arizmendi, Internal versus External cluster validation indexes. International Journal of Computers and Communications 5(1), 2011, pp.27–34.
[8] K. Alemu, The Role and Application of Bioinformatics in Plant Disease Management. Advances in Life Science and Technology, ISSN 2225-062X, Vol.28, 2015.
[9] K. Reddy, B Vinzamuri, A Survey of Partitional and Hierarchical Clustering Algorithms, Data Clustering: Algorithms and Applications, 2007.