Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31107
K-Means for Spherical Clusters with Large Variance in Sizes

Authors: A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey, M. A. Ramadan


Data clustering is an important data exploration technique with many applications in data mining. The k-means algorithm is well known for its efficiency in clustering large data sets. However, this algorithm is suitable for spherical shaped clusters of similar sizes and densities. The quality of the resulting clusters decreases when the data set contains spherical shaped with large variance in sizes. In this paper, we introduce a competent procedure to overcome this problem. The proposed method is based on shifting the center of the large cluster toward the small cluster, and recomputing the membership of small cluster points, the experimental results reveal that the proposed algorithm produces satisfactory results.

Keywords: Cluster Analysis, Data Clustering, k-means

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2993


[1] R. Agrawal , J. Gehrke , D. Gunopulos, and P. Raghavan, "Automatic SubspaceClustering of High Dimensional Data for Data Mining Applications," Proc.ACM SIGMOD Int. Conf. on Management of Data, Seattle, WA, 1998, pp. 94-105.
[2] M R. Anderberg Cluster Analysis for Applications, Academic Press, 1973.
[3] M. Ankerst , M. Breunig , H.P. Kriegel , and J. Sander, "OPTICS: Ordering Points to Identify the Clustering Structure," Proc. ACM SIGMOD Int. Con. Management of Data Mining, 1999, pp. 49-60.
[4] L. Bottou, and Y. Bengio, "Convergence properties of the k-mean algorithm," The MIT press, Cambridge, MA, 1995, pp. 585-592.
[5] P. S. Bradley , and U. M. Fayyad, "Refining Initial Points for K-Means Clustering," Proc. of the 15th International Conference on Machine Learning (ICML98), J. Shavlik (ed.). Morgan Kaufmann, San Francisco, 1998, pp. 91-99.
[6] S. Deelers, and S. Auwatanamongkol, "Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance," PWASET vol 26, 2007, pp. 323- 328.
[7] R.O. Duda , and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley Sons, New York, 1973.
[8] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, "A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, AAAI Press, Portland, 1996, pp.226-231.
[9] A. M. Fahim, A. M. Salem ,F. A. Torkey, and M. Ramadan, "An efficient enhanced k-means clustering algorithm," Journal of Zhejiang University SCIENCE A, vol 7(10), 2006, pp. 1626-1633.
[10] U.M. Fayyad , G. Piatetsky-Shapiro , P. Smyth , and R. Uthurusamy Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
[11] A. Gersho, and R.M. Gray Vector Quantization and Signal Compression, Kluwer Academic, Boston, 1992.
[12] S. Guha , R. Rastogi, and K. Shim, "CURE: An Efficient Clustering Algorithms for Large Databases," Proc. ACM SIGMOD Int. Conf. on Management of Data, Seattle, WA, 1998, pp. 73-84.
[13] V. Hautamaeki , S. Cherednichenko , I. Kaerkkaeinen , T. Kinnunen, and P. Fraenti, "Improving K-Means by Outlier Removal," SCIA 2005, LNCS 3540, 2005, pp. 978-987.
[14] V. Hautamaeki , I. Kaerkkaeinen, and P. Fraenti, "Outlier detection using k-nearest neighbourgraph," In: 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, United Kingdom, 2004, pp. 430-433.
[15] A. Hinneburg, and D. Keim, "An Efficient Approach to Clustering in Large Multimedia Databases with Noise," Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, New York City, NY. 1998.
[16] Z. Huang, "A fast clustering algorithm to cluster very large categorical data sets in data mining," Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada, 1997, pp. 1-8.
[17] A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
[18] L. Kaufman, and P. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis," Wiley, 1990.
[19] J.B. MacQueen, "Some methods for classification and analysis of multivariate observations. Proc. 5th Symp. Mathematical Statistics and Probability, Berkelely, CA, Vol(1), 1967, pp. 281297.
[20] R.T. Ng, and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int. Conf. on Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 1994, pp. 144-155.
[21] D. T. Pham , S. S. Dimov, and C. D. Nguyen, "Selection of k in K-means clustering," Mechanical Engineering Science, vol(219), 2004, pp. 103- 119.
[22] S. Ray, and R. H. Turi, "Determination of number of clusters in k-means clustering and application in colour image segmentation," In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp.137-143.
[23] G. Sheikholeslami, S. Chatterjee, and A. Zhang, "Wave-Cluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases," Proc. 24th Int. Conf. on Very Large Data Bases. New York, 1998, pp. 428-439.
[24] R. Sibson, "SLINK: an optimally efficient algorithm for the single-link cluster method," The Comp. Journal, 16(1), 1973, pp. 30-34.
[25] T. Zhang, R. Ramakrishnan, and M. Linvy, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int. Conf. on Management of Data. ACM Press, New York, 1996, pp. 103-114.