An Efficient and Generic Hybrid Framework for High Dimensional Data Clustering

Dharmveer Singh Rajput; P. K. Singh; Mahua Bhattacharya

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33156

An Efficient and Generic Hybrid Framework for High Dimensional Data Clustering

Authors: Dharmveer Singh Rajput , P. K. Singh, Mahua Bhattacharya

Abstract:

Clustering in high dimensional space is a difficult problem which is recurrent in many fields of science and engineering, e.g., bioinformatics, image processing, pattern reorganization and data mining. In high dimensional space some of the dimensions are likely to be irrelevant, thus hiding the possible clustering. In very high dimensions it is common for all the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. Hence, performance of the clustering algorithm decreases. In this paper, we propose an algorithmic framework which combines the (reduct) concept of rough set theory with the k-means algorithm to remove the irrelevant dimensions in a high dimensional space and obtain appropriate clusters. Our experiment on test data shows that this framework increases efficiency of the clustering process and accuracy of the results.

Keywords: High dimensional clustering, sub-space, k-means, rough set, discernibility matrix.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1328704

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1965

References:

[1] K. Jain, M. N. Murty, and P. J. Flynn. "Data clustering: a review". ACM Computing Surveys (CSUR) 31(3):264-323, 1999.
[2] M. K. Jiawei Han. "Data Mining: Concepts and Techniques", Chapter 8, pages 335-393. Morgan Kaufmann Publishers, 2001.
[3] Carlotta Domeniconi, Dimitris Papadopoulos, Dimitrios Gunopulos, Sheng Ma, "Subspace Clustering of High Dimensional Data", 517-521, SIAM, 1998.
[4] Kun Niu, Shubo Zhang, and Junliang Chen, "Subspace clustering through attribute clustering", Front. Electr. Electron. Eng. China 2008, 3(1): 44- 48.
[5] A. K. Jain and R. C. Dubes. "Algorithms for clustering data". Prentice- Hall, Inc., 1988.
[6] M. Zait and H. Messatfa. "A comparative study of clustering methods". Future Generation Computer Systems, 13(2-3):149-159, November 1997.
[7] J. Ghosh. Handbook of Data Mining, chapter Scalable Clustering Methods for Data Mining. Lawrence Erlbaum Assoc, 2003.
[8] J. Han, M. Kamber, and A. K. H. Tung. "Geographic Data Mining and Knowledge Discovery", chapter Spatial clustering methods in data mining: A survey, pages 188-217. Taylor and Francis, 2001.
[9] I. H. Witten and E. Frank. Data Mining: Practical Machine Leaning Tools and Techniques with Java Implementations, chapter 6.6, pages 210-228. Morgan Kaufmann, 2000.
[10] C. C. Aggarwal and P. S. Yu. "Finding generalized projected clusters in high dimensional spaces". In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 70-81. ACM Press, 2000.
[11] P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[12] L. Ert├äoz, M. Steinbach, and V. Kumar. "Finding clusters of deferent sizes, shapes, and densities in noisy, high dimensional data". In Proceedings of the 2003 SIAM International Conference on Data Mining, 2003.
[13] Lance Parsons, Ehtesham Haque, Huan Liu. "Subspace Clustering for High Dimensional Data: A Review"; Supported in part by grants from Prop 301 (No. ECR A601) and CEINT 2004.
[14] Skowron, A., Pawlak, Z., Komorowski, J., & Polkowski, L. "A Rough set perspective on data and knowledge". Handbook of data mining and knowledge discovery, pp. 134-149, Oxford University Press, 2002.
[15] Anil K. Jain; "Data Clustering: 50 Years Beyond K-Means"; To appear in Pattern Recognition Letters, 2009.
[16] Drineas, P., Frieze, A., Kannan, R., Vempala, S., & Vinay, V. Clustering large graphs via the singular value decomposition. Machine Learning, 56(1-3), 9-33, 1999.
[17] R.C. Belwal, J. Varshney, S. Ahmed Khan, A. Sharma & M. Bhattacharya, " Hiding Sensitive Association Rules Efficiently By Introducing New Variable Hiding Counter IEEE International Conference on Service, Operations & Logistics and Informatics, proceedings, Vol. I, pp: 130-134, 12th - 15th Oct. Beijing, China 2008.
[18] G. Liu, J. Li, K. Sim, and L. Wong. "Distance based subspace clustering with flexible dimension partitioning". In Proc. IEEE ICDE, pages 1250- 1254, 2007.
[19] Sunita Jahirabadkar, and Parag Kulkarni, ISC - Intelligent Subspace Clustering, A Density based Clustering approach for High Dimensional Dataset, World Academy of Science, Engineering and Technology 55 2009.
[20] Hans-Peter Kriegel, Peer Krger, Matthias Renz, Sebastian Wurst, "A Generic Framework for Efficient Subspace Clustering of High- Dimensional Data ", In proc. 5th IEEE International Conference of Data Mining (ICDM), Houston, TX, 2005.
[21] McQueen J (1967) some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281-297.
[22] http://www.uni-koeln.de/themen/statistik/data/cluster/milk.dat.
[23] Dunn, 1974. Dunn, J. (1974) well separated clusters and optimal fuzzy partitions. Journal of Cybernetics ,4, 95-104.
[24] Davies & Bouldin, 1979. Davies, D.L., Bouldin, D.W., (2000) A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell., 1(4), 224- 227.
[25] Arun Jagota. Novelty detection on a very large number of memories stored in a Hopfield-style network. In Proceedings of the International Joint Conference on Neural Networks (IJCNN-91), 1991.