DOI QR코드

DOI QR Code

A survey on unsupervised subspace outlier detection methods for high dimensional data

고차원 자료의 비지도 부분공간 이상치 탐지기법에 대한 요약 연구

  • Ahn, Jaehyeong (Department of Applied Statistics, Konkuk University) ;
  • Kwon, Sunghoon (Department of Applied Statistics, Konkuk University)
  • 안재형 (건국대학교 응용통계학과) ;
  • 권성훈 (건국대학교 응용통계학과)
  • Received : 2021.02.16
  • Accepted : 2021.03.05
  • Published : 2021.06.30

Abstract

Detecting outliers among high-dimensional data encounters a challenging problem of screening the variables since relevant information is often contained in only a few of the variables. Otherwise, when a number of irrelevant variables are included in the data, the distances between all observations tend to become similar which leads to making the degree of outlierness of all observations alike. The subspace outlier detection method overcomes the problem by measuring the degree of outlierness of the observation based on the relevant subsets of the entire variables. In this paper, we survey recent subspace outlier detection techniques, classifying them into three major types according to the subspace selection method. And we summarize the techniques of each type based on how to select the relevant subspaces and how to measure the degree of outlierness. In addition, we introduce some computing tools for implementing the subspace outlier detection techniques and present results from the simulation study and real data analysis.

고차원 자료에서 이상치를 탐지하기 위해서는 변수를 선별해야 할 필요성이 있다. 이상치 탐지에 적합한 정보가 종종 일부 변수에만 포함되어 있기 때문이다. 많은 수의 부적합한 변수가 자료에 포함될 경우 모든 관측치의 거리가 비슷해지는 집중효과가 발생하고 이로 인해 모든 관측치의 이상정도가 비슷해지는 문제가 발생하게 된다. 부분공간 이상치 탐지기법은 전체 변수 중 이상치 탐지에 적합한 변수들의 집합을 선별하여 관측치의 이상정도를 측정함으로써 이러한 문제를 극복한다. 본 논문은 대표적인 부분공간 이상치 탐지기법을 부분공간 선정 방식에 따라 세가지 유형으로 분류하고 각 유형에 속한 방법론을 부분공간 선정 기준과 이상 정도 측정 방식에 따라 요약한다. 더하여, 부분공간 이상치 탐지기법들을 적용할 수 있는 컴퓨팅 프로그램을 소개하고 집중효과에 대한 간단한 가상 실험과 자료 분석 결과를 제시한다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (NO. NRF-2020R1F1A1A01071036).

References

  1. Agrawal R and Srikant R (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference Very Large Data Bases, VLDB, 125, 487-499.
  2. Agrawal R, Gehrke J, Gunopulos D, and Raghavan P (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 94-105.
  3. Barnett V and Lewis T (1984). Outliers in Statistical Data(2nd ed), Chichester, Wiley.
  4. Beckmann N, Kriegel HP, Schneider R, and Seeger B (1990). The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, 322-331.
  5. Bennett KP, Fayyad U, and Geiger D (1999). Density-based indexing for approximate nearest-neighbor queries. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 233-243.
  6. Beyer K, Goldstein J, Ramakrishnan R, and Shaft U (1999). When is "nearest neighbor" meaningful?. In International Conference on Database Theory, Springer, Berlin, 217-235.
  7. Breunig MM, Kriegel HP, Ng RT, and Sander J (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 93-104.
  8. Campos GO, Zimek A, Sander J, et al. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study, Data Mining and Knowledge Discovery, 30, 891-927. https://doi.org/10.1007/s10618-015-0444-8
  9. Durrant RJ and Kaban A (2009). When is 'nearest neighbour' meaningful: A converse theorem and implications, Journal of Complexity, 25, 385-397. https://doi.org/10.1016/j.jco.2009.02.011
  10. Eskin E, Arnold A, Prerau M, Portnoy L, and Stolfo S (2002). A geometric framework for unsupervised anomaly detection, In Applications of Data Mining in Computer Security, Springer, Boston, 77-101.
  11. Fawcett T and Provost F (1997). Adaptive fraud detection, Data Mining and Knowledge Discovery, 1, 291-316. https://doi.org/10.1023/A:1009700419189
  12. Hawkins DM (1980). Identification of Outliers, Chapman and Hall, London.
  13. Houle ME, Kriegel HP, Kroger P, Schubert E, and Zimek A. (2010). Can shared-neighbor distances defeat the curse of dimensionality?. In International Conference on Scientific and Statistical Database Management, Springer, Berlin, 482-500.
  14. Keller F, Muller E, and Bohm K (2012). HiCS: High contrast subspaces for density-based outlier ranking. In 2012 IEEE 28th International Conference on Data Engineering, 1037-1048.
  15. Kriegel HP, Kroger P, Schubert E, and Zimek A (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, 831-838.
  16. Lazarevic A and Kumar V (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 157-166.
  17. Liu FT, Ting KM, and Zhou ZH (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, 413-422.
  18. Muller E, Schiffer M, and Seidl T (2011). Statistical selection of relevant subspace projections for outlier ranking. In 2011 IEEE 27th International Conference on Data Engineering, 434-445.
  19. Muller E, Assent I, Iglesias P, Mulle Y, and Bohm K (2012). Outlier ranking via subspace analysis in multiple views of the data. In 2012 IEEE 12th International Conference on Data Mining, 529-538.
  20. Nguyen HV, Muller E, Vreeken J, Keller F, and Bohm K (2013). CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, 198-206.
  21. Parsons L, Haque E, and Liu H (2004). Subspace clustering for high dimensional data: a review, Acm Sigkdd Explorations Newsletter, 6, 90-105. https://doi.org/10.1145/1007730.1007731
  22. Penny KI and Jolliffe IT (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society: Series D (The Statistician), 50, 295-307. https://doi.org/10.1111/1467-9884.00279
  23. Powers DM (2020). Evaluation: from Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.
  24. Procopiuc CM, Jones M, Agarwal PK, and Murali TM (2002). A Monte Carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 418-427.
  25. Schubert E and Zimek A (2019). ELKI: A Large Open-Source Library for Data Analysis-ELKI Release 0.7. 5" Heidelberg.
  26. Silverman BW (1986). Density Estimation for Statistics and Data Analysis, 26, CRC press.
  27. Steinbiss V, Tran BH, and Ney H (1994). Improvements in beam search. In Third International Conference on Spoken Language Processing.
  28. Stephens MA (1970). Use of the kolmogorov-smirnov, cramer-von mises and related statistics without extensive tables, Journal of the Royal Statistical Society: Series B (Methodological), 32, 115-122. https://doi.org/10.1111/j.2517-6161.1970.tb00821.x
  29. Tukey JW (1977). Exploratory Data Analysis, 2, 131-160.
  30. Zimek A, Schubert E, and Kriegel HP (2012). A survey on unsupervised outlier detection in high-dimensional numerical data, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5, 363-387. https://doi.org/10.1002/sam.11161