Browse > Article
http://dx.doi.org/10.5351/KJAS.2021.34.3.507

A survey on unsupervised subspace outlier detection methods for high dimensional data  

Ahn, Jaehyeong (Department of Applied Statistics, Konkuk University)
Kwon, Sunghoon (Department of Applied Statistics, Konkuk University)
Publication Information
The Korean Journal of Applied Statistics / v.34, no.3, 2021 , pp. 507-521 More about this Journal
Abstract
Detecting outliers among high-dimensional data encounters a challenging problem of screening the variables since relevant information is often contained in only a few of the variables. Otherwise, when a number of irrelevant variables are included in the data, the distances between all observations tend to become similar which leads to making the degree of outlierness of all observations alike. The subspace outlier detection method overcomes the problem by measuring the degree of outlierness of the observation based on the relevant subsets of the entire variables. In this paper, we survey recent subspace outlier detection techniques, classifying them into three major types according to the subspace selection method. And we summarize the techniques of each type based on how to select the relevant subspaces and how to measure the degree of outlierness. In addition, we introduce some computing tools for implementing the subspace outlier detection techniques and present results from the simulation study and real data analysis.
Keywords
outlier detection; high-dimensional data; subspace outlier detection;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Kriegel HP, Kroger P, Schubert E, and Zimek A (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, 831-838.
2 Breunig MM, Kriegel HP, Ng RT, and Sander J (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 93-104.
3 Durrant RJ and Kaban A (2009). When is 'nearest neighbour' meaningful: A converse theorem and implications, Journal of Complexity, 25, 385-397.   DOI
4 Eskin E, Arnold A, Prerau M, Portnoy L, and Stolfo S (2002). A geometric framework for unsupervised anomaly detection, In Applications of Data Mining in Computer Security, Springer, Boston, 77-101.
5 Houle ME, Kriegel HP, Kroger P, Schubert E, and Zimek A. (2010). Can shared-neighbor distances defeat the curse of dimensionality?. In International Conference on Scientific and Statistical Database Management, Springer, Berlin, 482-500.
6 Lazarevic A and Kumar V (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 157-166.
7 Muller E, Schiffer M, and Seidl T (2011). Statistical selection of relevant subspace projections for outlier ranking. In 2011 IEEE 27th International Conference on Data Engineering, 434-445.
8 Parsons L, Haque E, and Liu H (2004). Subspace clustering for high dimensional data: a review, Acm Sigkdd Explorations Newsletter, 6, 90-105.   DOI
9 Penny KI and Jolliffe IT (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society: Series D (The Statistician), 50, 295-307.   DOI
10 Procopiuc CM, Jones M, Agarwal PK, and Murali TM (2002). A Monte Carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 418-427.
11 Zimek A, Schubert E, and Kriegel HP (2012). A survey on unsupervised outlier detection in high-dimensional numerical data, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5, 363-387.   DOI
12 Schubert E and Zimek A (2019). ELKI: A Large Open-Source Library for Data Analysis-ELKI Release 0.7. 5" Heidelberg.
13 Steinbiss V, Tran BH, and Ney H (1994). Improvements in beam search. In Third International Conference on Spoken Language Processing.
14 Stephens MA (1970). Use of the kolmogorov-smirnov, cramer-von mises and related statistics without extensive tables, Journal of the Royal Statistical Society: Series B (Methodological), 32, 115-122.   DOI
15 Nguyen HV, Muller E, Vreeken J, Keller F, and Bohm K (2013). CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, 198-206.
16 Liu FT, Ting KM, and Zhou ZH (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, 413-422.
17 Powers DM (2020). Evaluation: from Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.
18 Silverman BW (1986). Density Estimation for Statistics and Data Analysis, 26, CRC press.
19 Tukey JW (1977). Exploratory Data Analysis, 2, 131-160.
20 Agrawal R, Gehrke J, Gunopulos D, and Raghavan P (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 94-105.
21 Bennett KP, Fayyad U, and Geiger D (1999). Density-based indexing for approximate nearest-neighbor queries. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 233-243.
22 Beyer K, Goldstein J, Ramakrishnan R, and Shaft U (1999). When is "nearest neighbor" meaningful?. In International Conference on Database Theory, Springer, Berlin, 217-235.
23 Campos GO, Zimek A, Sander J, et al. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study, Data Mining and Knowledge Discovery, 30, 891-927.   DOI
24 Fawcett T and Provost F (1997). Adaptive fraud detection, Data Mining and Knowledge Discovery, 1, 291-316.   DOI
25 Keller F, Muller E, and Bohm K (2012). HiCS: High contrast subspaces for density-based outlier ranking. In 2012 IEEE 28th International Conference on Data Engineering, 1037-1048.
26 Muller E, Assent I, Iglesias P, Mulle Y, and Bohm K (2012). Outlier ranking via subspace analysis in multiple views of the data. In 2012 IEEE 12th International Conference on Data Mining, 529-538.
27 Agrawal R and Srikant R (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference Very Large Data Bases, VLDB, 125, 487-499.
28 Barnett V and Lewis T (1984). Outliers in Statistical Data(2nd ed), Chichester, Wiley.
29 Beckmann N, Kriegel HP, Schneider R, and Seeger B (1990). The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, 322-331.
30 Hawkins DM (1980). Identification of Outliers, Chapman and Hall, London.