Class Outliers Mining: Distance-Based Approach

Nabil M. Hewahi; Motaz K. Saad

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33121

Class Outliers Mining: Distance-Based Approach

Authors: Nabil M. Hewahi, Motaz K. Saad

Abstract:

In large datasets, identifying exceptional or rare cases with respect to a group of similar cases is considered very significant problem. The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare events with respect to the whole dataset. In this research, we pose the problem that is Class Outliers Mining and a method to find out those outliers. The general definition of this problem is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels". We introduce a novel definition of Outlier that is Class Outlier, and propose the Class Outlier Factor (COF) which measures the degree of being a Class Outlier for a data object. Our work includes a proposal of a new algorithm towards mining of the Class Outliers, presenting experimental results applied on various domains of real world datasets and finally a comparison study with other related methods is performed.

Keywords: Class Outliers, Distance-Based Approach, Outliers Mining.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1078088

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3392

References:

[1] Angiulli, F., Pizzuti, C.: Fast Outlier detection in high dimensional spaces, In Proc. of the Sixth European Conference on the Principles of Data Mining and Knowledge Discovery, pp. 15-26, 2002.
[2] Barbar├á, D., Chen, P.: Using the fractal dimension to cluster datasets, In: Proc. KDD, pp. 260-264, 2000.
[3] Barnett, V., Lewis, T.: Outliers in Statistical Data, John Wiley, 1994.
[4] Bay, S. D., and Schwabacher, M.: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule, Proc. of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
[5] Blake C., Keogh E., Merz C. J.: UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/MLRepository.htm, 1998.
[6] Bolton, R. J., Hand, D. J.: Statistical fraud detection: A review (with discussion), Statistical Science, 17(3): pp. 235-255, 2002.
[7] Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: Identifying densitybased local outliers, In: Proc. SIGMOD Conf, pp. 93-104, 2000.
[8] Eskin E., Arnold A., Prerau M., Portnoy L., Stolfo S.: A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data, In Data Mining for Security Applications, 2002.
[9] Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), Portland, OR. pp. 226-231, 1996.
[10] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, San Francisco, Morgan Kaufmann, 2001.
[11] Hawkins, D.: Identification of Outliers, Chapman and Hall, 1980.
[12] Hawkins, S., He, H. X., Williams, G. J., Baxter, R. A.: Outlier detection using replicator neural networks, In Proc. of the Fifth Int. Conf. and Data Warehousing and Knowledge Discovery (DaWaK02), 2002.
[13] He, Z., Deng, S., Xu., X.: Outlier detection integrating semantic knowledge, In: Proc. of WAIM-02, pp. 126-131, 2002.
[14] He, Z., Xu, X., Huang, J., Deng, S.: Mining Class Outliers: Concepts, Algorithms and Applications in CRM, Expert Systems with Applications (ESWA'04), 27(4): pp. 681-697, 2004.
[15] Jain, A., Murty, M., Flynn, P.: Data clustering: A review, ACM Comp, Surveys 31, 264-323, 1999.
[16] Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours, In: Proc. KDD. pp. 224-228, 1998.
[17] Knorr E. M., Ng. R. T.: Finding intensional knowledge of distancebased outliers, In Proc. of the 25th VLDB Conference, 1999.
[18] Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: Algorithms and applications, VLDB Journal 8, pp. 237-253, 2000.
[19] Knorr, E., Ng, R.: A unified notion of outliers: Properties and computation, In: Proc. KDD. pp. 219-222, 1997.
[20] Knorr, E., Ng, R.: Finding intentional knowledge of distance-based outliers, In: Proc. VLDB. pp. 211-222, 1999.
[21] Knorr, E.M., Ng, R.: Algorithms for mining distance-based outliers in large datasets, In: Proc. VLDB pp. 392-403, 1998.
[22] Lane, T., Brodley, C. E.: Temporal sequence learning and data reduction for anomaly detection, ACM Transactions on Information and System Security, 2(3): pp. 295-331, 1999.
[23] Michalski, R. S., Winston, P. H.: Variable Precision Logic, Artificial Intelligence Journal 29, Elsevier Science Publishers B.V. (North- Holland), pp. 121-146,1986.
[24] Papadimitriou, S., Faloutsos C.: Cross-outlier detection, In: Proc. of SSTD-03, pp. 199-213, 2003.
[25] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, In Proc. of the ACM SIGMOD Conference, pp. 427-438, 2000.
[26] Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection, John Wiley and Sons, 1987.
[27] Rulequest Research, Gritbot, http://www.rulequest.com
[28] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, (Second Edition), San Francisco, Morgan Kaufmann, 2005.