Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30172
Clustering Categorical Data Using Hierarchies (CLUCDUH)

Authors: Gökhan Silahtaroğlu

Abstract:

Clustering large populations is an important problem when the data contain noise and different shapes. A good clustering algorithm or approach should be efficient enough to detect clusters sensitively. Besides space complexity, time complexity also gains importance as the size grows. Using hierarchies we developed a new algorithm to split attributes according to the values they have and choosing the dimension for splitting so as to divide the database roughly into equal parts as much as possible. At each node we calculate some certain descriptive statistical features of the data which reside and by pruning we generate the natural clusters with a complexity of O(n).

Keywords: Clustering, tree, split, pruning, entropy, gini.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1329320

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1109

References:


[1] Raymond T Ng. & Jiawei Han. (1994). Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of 20th International Conference on Very Large Data Bases, Santiago de Chile, (pp. 144 - 155). Morgan Kauffmann.
[2] Ester Martin, et. al. (1996). A Density Based Algorithm for Discovering Clusters in LargeSpatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (pp. 169- 194). Kluwer Academic Publishers. ]
[3] Ankerst Mihael, et.al. (1999) OPTICS: Ordering Points to Identify the Clustering Structure, Proceedings of ACM SIGMOD (pp. 5761 -5767). Pergamon Press.
[4] Hinneburg, Alexander and Keim, Daniel A. (1998). An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of Knowledge Discovery and Data Mining (pp. 58 -65). AAAI Press.
[5] Han J., & Kamber, Micheline. (2001). Data Mining Concepts and Techniques, Morgan Kaufman Publishers Academic Press.
[6] Karypis, George, et.al. (1999). CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, Poceedings of IEEE COMPUTER, V.32, (pp. 68 - 75). IEEE Computer Society Press.
[7] Duda R. & Hart P. E. (1973). Pattern Classification and Scene Analysis, Wilry.
[8] Kauffman, L., & Rousseeuw P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons.
[9] Fisher Douglas H.(1995). Iteraritive Optimisation and Simplification of Hierarchical Clusterings, Technical Report CS-95-01, Vabderbilt University.
[10] Fausett L. (1994). Fundamentals of Neural Networks, Prentice-Hall, New Jersey.
[11] Maulik U. & Sanghamitra B.(2000). Genetic Algorithm-based clustering technique, Journal of the Pattern Recognition, Pergamon, issue: 33.
[12] Zhang Tian et.al. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of ACM International Conference on Management of Data, (409 - 418). Oxford University Press.
[13] Kreyzig E.(1989). Introductory Functional Analysis With Applications, Wiley.
[14] Bill F. (Ed.) (1992). Information retrieval: data structures & algorithms. Prentice Hall.
[15] Mitchell T.(1997). Machine Learning, McGraw-Hill International.
[16] Quinlan,J.Ross. (1987). Simplifying decision trees, International Journal of Man-Machine Studies,issue: 27(3), (pp. 221 - 234).
[17] Breiman L., & Friedman J. H., & Olshen R. A., & Stone C. J. (1984). Classification and Regression Trees, Wadsworth, Belmont.
[18] Mehta M., & Agrawal R., & Rissanen J. (1996). SLIQ: A Fast Scalable Classifier for Data Mining, Proceedings of 5th International Extending Database Technology Conference.France. (pp. 18-32). Springer-Verlag, London.
[19] Agrawal R. & Shafer J.C. (1996). Parallel Mining of Association Rules, Proceedings. of IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6. (962- 969). IEEE Educational Activities Department. USA.
[20] Hettich, S. , & Bay, S. D. (1999). The UCI KDD Archive, Department of Information and Computer Science, University of California, Irvine, CA. Retrieved September 1, 2008, from http://kdd.ics.uci.edu.
[21] Pham D.T., & Chan A.B.(1998). Control Chart Pattern Recognition using a New Type of Self Organizing Neural Network. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering. Vol 212, No 1, (pp. 115-127). Professional Engineering Publishing.
[22] Keogh, E. & Pazzani, M. (2001). Derivative Dynamic Time Warping. In First SIAM International Conference on Data Mining (SDM'2001), Chicago, USA.
[23] Alcock R.J. & Manolopoulos Y. (1999). Time-Series Similarity Queries Employing a Feature-Based Approach. 7th Hellenic Conference on Informatics. Ioannina,Greece.