A Minimum Spanning Tree-Based Method for Initializing the K-Means Clustering Algorithm
Authors: J. Yang, Y. Ma, X. Zhang, S. Li, Y. Zhang
Abstract:
The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.
Keywords: Degree, initial cluster center, k-means, minimum spanning tree.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1128109
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1552References:
[1] G. Gan, C. Ma, J. Wu, Data clustering: Theory, algorithms, and applications, ASA-SIAM series on statistics and applied probability, SIAM, Philadelphia: McGraw-Hill, 2007.
[2] N. Abdelhamid, A. Ayesh, F. Thabtah, “Phishing detection based associative classification data mining,” Expert Syst. Appl., vol.41, no.13, pp. 5948–5959, 2014.
[3] O.A. Abbas, “Comparisons between data clustering algorithms,” Int. Arab J. Inform. Technol., vol.5, no.3, pp. 320–325, 2008.
[4] F. Höppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy cluster analysis: methods for classification, data analysis and image recognition, New York: John Wiley & Sons, USA, 1999.
[5] E. Boundaillier, G. Hebrail, “Interactive interpretation of hierarchical clustering,” Intell. Data Anal., vol.2, no.1, pp. 229–244, 1997.
[6] E. Forgy, “Cluster analysis of multivariate data: efficiency vs. interpretability of classification,” Biometrics, vol.21, no.3, pp. 41–52, 1965.
[7] J.B. Macqueen, “Some methods for classification and analysis of multivariate observation,” in Proc. of Berkeley Symposium on Mathematical Statistics and Probability, California, 1967, pp. 281–297.
[8] T. Gonzalez, “Clustering to minimize the maximum intercluster distance,” Theor. Comput. Sci., vol.38, no.2, pp. 293–306, 1985.
[9] Shehroz S. Khan, Amir Ahmad, “Cluster center initialization algorithm for k-means clustering,” Pattern Recognition Letters, vol.25, no.11, pp. 1293–1302, 2004.
[10] Redmond, S. J., & Heneghan, C. “A method for initialising the k-means clustering algorithm using kd-trees,” Pattern Recognition Letters, vol.28, no.8, pp. 965–973, 2007.
[11] Arthur, D., & Vassilvitskii, S. “k-means++: The advantages of careful seeding,” in: Proc. of the 18th annual ACM-SIAM symposium on discrete algorithms, New Orleans, Louisiana, 2007, pp.1027–1035.
[12] Cao, F., Liang, J., & Bai, L. “A new initialization method for categorical data clustering,” Expert Syst. Appl., vol.36, no.7, pp.10223–10228, 2009.
[13] Damodar Reddy, Devender Mishra, Prasanta K. Jana, MST-based cluster initialization for k-means, Berlin: Springer-verlag, 2010, pp.329-338.
[14] Lan Huang, Shixian Du, Yu Zhang, Yaolong Ju, Zhuo Li, “K-means initial clustering center optimal algorithm based on kruskal,” Journal of Information & Computational Science, vol.9, no.9, pp.2387–2392, 2012.
[15] M. Mahajan, P. Nimbor, K. Varadarajan, “The planar k-means problem is NP-hard,” Theoretical Computer Science, vol.442, no.8, pp.13–21, 2012.
[16] Zahn, C.T. “Graph-theoretical methods for detecting and describing gestalt clusters,” IEEE Trans. on Computers, vol.20, no.1, pp.68–86, 1971.