DCBOR: A Density Clustering Based on Outlier Removal

A. M. Fahim; G. Saake; A. M. Salem; F. A. Torkey; M. A. Ramadan

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32804

DCBOR: A Density Clustering Based on Outlier Removal

Authors: A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey, M. A. Ramadan

Abstract:

Data clustering is an important data exploration technique with many applications in data mining. We present an enhanced version of the well known single link clustering algorithm. We will refer to this algorithm as DCBOR. The proposed algorithm alleviates the chain effect by removing the outliers from the given dataset. So this algorithm provides outlier detection and data clustering simultaneously. This algorithm does not need to update the distance matrix, since the algorithm depends on merging the most k-nearest objects in one step and the cluster continues grow as long as possible under specified condition. So the algorithm consists of two phases; at the first phase, it removes the outliers from the input dataset. At the second phase, it performs the clustering process. This algorithm discovers clusters of different shapes, sizes, densities and requires only one input parameter; this parameter represents a threshold for outlier points. The value of the input parameter is ranging from 0 to 1. The algorithm supports the user in determining an appropriate value for it. We have tested this algorithm on different datasets contain outlier and connecting clusters by chain of density points, and the algorithm discovers the correct clusters. The results of our experiments demonstrate the effectiveness and the efficiency of DCBOR.

Keywords: Data Clustering, Clustering Algorithms, Handling Noise, Arbitrary Shape of Clusters.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1074549

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1887

References:

[1] M. Ankerst , M. M. Breunig and H-P. Kriegel , "OPTICS: Ordering Points to Identify the Clustering Structure". in Proc. ACM SIGMOD, 1999, pp. 49-60.
[2] M. Emre Celebi , Y. Alp Aslandogan , and P. R. Bergstresser "Mining biomedical images with density-based clustering." In ITCC -05: Proceedings of the International Conference on Information Technology: Coding and Computing, volume I, pages 163-168, Washington, DC, USA, 2005. IEEE Computer Society.
[3] M. Ester , H.-P. Kriegel , J. Sander and X. Xu , "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". in Proc. Knowledge Discovery and Data Mining, 1996, pp. 226- 231.
[4] L. Ertoz , M. Steinbach and V. Kumar , "A new shared nearest neighbor clustering algorithm and its applications", AHPCRC, Tech. Rep. 134, 2002.
[5] S. Guha, R. Rastogi , and K. Shim, "CURE: An Efficient Clustering Algorithms for Large Databases." Proc. ACM SIGMOD Int. Conf. on Management of Data. Seattle, WA, 1998, pp. 73-84.
[6] V. Hautamaeki , S. Cherednichenko , I. Kaerkkaeinen , T. Kinnunen , and P. Fraenti "Improving K-Means by Outlier Removal", LNCS Springer Berlin / Heidelberg, may 2005, pp. 978-987.
[7] A. Hinneburg and D. A. Keim , "An Efficient Approach to Clustering in Large Multimedia Databases with Noise," in Proc. Knowledge Discovery and Data Mining, 1998, pp. 58-65.
[8] A. K. Jain and, R. C.Dubes "Algorithms for Clustering Data." Prentice Hall, 1988.
[9] A. K. Jain , M. N. Murty , and P. J. Flynn , "Data Clustering: A Review", ACM Computing Surveys, vol. 31, no 3, pp. 264-323, Sep. 1999.
[10] G. Karypis , E. H. Han and V. Kumar, "CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling." Computer,32, pp. 68- 75, 1999.
[11] A. McCallum , K. Nigam and L. H. Ungar "Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching." In Proceedings of KDD-2000. pp.169-178.
[12] R. T. Ng and J. Han "Efficient and Effective Clustering Methods for Spatial Data Mining". Proc. 20th Int. Conf. on Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 1994, p. 144-155.
[13] J. Sander , M. Ester , H-P. Kriegel , and X. Xu "Density-based clustering in spatial databases: The algorithm gdbscan and its applications." Data Mining and Knowledge Discovery, 2(2):169-194, 1998.
[14] R. Sibson, SLINK: an optimally efficient algorithm for the single-link cluster method. The Comp. Journal, 1973, 16(1), p. 30-34.
[15] T. Zhang , R. Ramakrishnan and M. Linvy BIRCH: An Efficient Data Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int. Conf. on Management of Data. ACM Press, New York, 1996, p. 103-114.