DOI QR코드

DOI QR Code

이상탐지 알고리즘 성능 비교: 이상치 유형과 데이터 속성 관점에서

Performance Comparison of Anomaly Detection Algorithms: in terms of Anomaly Type and Data Properties

  • 김재웅 (국민대학교 비즈니스IT전문대학원) ;
  • 정승렬 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT전문대학원)
  • Jaeung Kim (Graduate School of Business IT, Kookmin University) ;
  • Seung Ryul Jeong (Graduate School of Business IT, Kookmin University) ;
  • Namgyu Kim (Graduate School of Business IT, Kookmin University)
  • 투고 : 2023.08.26
  • 심사 : 2023.09.08
  • 발행 : 2023.09.30

초록

여러 분야에서 이상탐지의 중요성이 강조됨에 따라, 다양한 데이터 유형과 이상치 유형에 대한 이상탐지 알고리즘이 개발되고 있다. 하지만 이상탐지 알고리즘의 성능은 주로 공개 데이터 세트에 대해 측정될 뿐 특정 유형의 이상치에서 나타나는 각 알고리즘의 성능은 확인되지 않고 있으므로, 분석 상황에 맞는 적절한 이상탐지 알고리즘 선택에 어려움이 있다. 이에 본 논문에서는 이상치의 유형과 다양한 데이터 속성을 먼저 파악하여, 이를 기반으로 적절한 이상탐지 알고리즘 선택에 도움을 줄 수 있는 방안을 제시하고자 한다. 구체적으로 본 연구에서는 지역, 전역, 종속성, 그리고 군집화의 총 4가지 이상치 유형에 대해 이상탐지 알고리즘의 성능을 비교하고, 추가 분석을 통해 라벨 수준, 데이터 개수, 그리고 차원 수가 성능에 미치는 영향을 확인한다. 실험 결과 이상치 유형에 따라 가장 우수한 성능을 나타내는 알고리즘이 다르게 나타나며, 이상치 유형에 대한 정보가 없는 경우에도 안정적인 성능을 보여주는 알고리즘을 확인했다. 또한 비지도 학습 기반 이상탐지 알고리즘의 성능이 지도 학습 및 준지도 학습 알고리즘의 성능보다 낮게 나타나는 유형을 확인하였다. 마지막으로 데이터 개수가 상대적으로 적거나 많을 때 대부분 알고리즘들의 성능이 이상치 유형에 더 강하게 영향을 받으며, 상대적으로 고차원일 경우 지역, 전역 이상치에서는 우수한 성능을 보였지만 군집화 이상치 유형에서 낮은 성능을 나타냄을 확인하였다.

With the increasing emphasis on anomaly detection across various fields, diverse anomaly detection algorithms have been developed for various data types and anomaly patterns. However, the performance of anomaly detection algorithms is generally evaluated on publicly available datasets, and the specific performance of each algorithm on anomalies of particular types remains unexplored. Consequently, selecting an appropriate anomaly detection algorithm for specific analytical contexts poses challenges. Therefore, in this paper, we aim to investigate the types of anomalies and various attributes of data. Subsequently, we intend to propose approaches that can assist in the selection of appropriate anomaly detection algorithms based on this understanding. Specifically, this study compares the performance of anomaly detection algorithms for four types of anomalies: local, global, contextual, and clustered anomalies. Through further analysis, the impact of label availability, data quantity, and dimensionality on algorithm performance is examined. Experimental results demonstrate that the most effective algorithm varies depending on the type of anomaly, and certain algorithms exhibit stable performance even in the absence of anomaly-specific information. Furthermore, in some types of anomalies, the performance of unsupervised anomaly detection algorithms was observed to be lower than that of supervised and semi-supervised learning algorithms. Lastly, we found that the performance of most algorithms is more strongly influenced by the type of anomalies when the data quantity is relatively scarce or abundant. Additionally, in cases of higher dimensionality, it was noted that excellent performance was exhibited in detecting local and global anomalies, while lower performance was observed for clustered anomaly types.

키워드

참고문헌

  1. Aggarwal, C. C., & Aggarwal, C. C. (2017). An introduction to outlier analysis (pp. 1-34). Springer International Publishing. 
  2. Akcay, S., Atapour-Abarghouei, A., & Breckon, T. P. (2019). Ganomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision-ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, Revised Selected Papers, Part III 14 (pp. 622-637). Springer International Publishing. 
  3. Ayres-de-Campos, D., Bernardes, J., Garrido, A., Marques-de-Sa, J., & Pereira-Leite, L. (2000). SisPorto 2.0: a program for automated analysis of cardiotocograms. Journal of Maternal-Fetal Medicine, 9(5), 311-318. 
  4. Aas, K., Czado, C., Frigessi, A., & Bakken, H. (2009). Pair-copula constructions of multiple dependence. Insurance: Mathematics and economics, 44(2), 182-198. 
  5. Boyd, K., Eng, K. H., & Page, C. D. (2013). Area under the precision-recall curve: point estimates and confidence intervals. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 (pp. 451-466). Springer Berlin Heidelberg. 
  6. Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.  https://doi.org/10.1023/A:1010933404324
  7. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 93-104). 
  8. Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenkova, B., Schubert, E., ... & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery, 30, 891-927. 
  9. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). 
  10. Choi, N. W. & Kim, W. (2019). Anomaly Detection for User Action with Generative Adversarial Networks. 25(3), 43-62. 
  11. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297. 
  12. Domingues, R., Filippone, M., Michiardi, P., & Zouaoui, J. (2018). A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern recognition, 74, 406-421. 
  13. Emmott, A., Das, S., Dietterich, T., Fern, A., & Wong, W. K. (2015). A meta-analysis of the anomaly detection problem. arXiv preprint arXiv:1503.01158. 
  14. Han, S., Hu, X., Huang, H., Jiang, M., & Zhao, Y. (2022). Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35, 32142-32159. 
  15. Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer. 
  16. He, Z., Xu, X., & Deng, S. (2003). Discovering cluster-based local outliers. Pattern recognition letters, 24(9-10), 1641-1650. 
  17. Huang, H., Qin, H., Yoo, S., & Yu, D. (2012, October). Local anomaly descriptor: a robust unsupervised algorithm for anomaly detection based on diffusion space. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 405-414). 
  18. Hwang, C. (2022). Resolving data imbalance through differentiated anomaly data processing based on verification data. Journal of Intelligence and Information Systems, 28(4), 179-190. 
  19. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30. 
  20. Kriegel, H. P., Kroger, P., Schubert, E., & Zimek, A. (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings 13 (pp. 831-838). Springer Berlin Heidelberg. 
  21. Lai, K. H., Zha, D., Wang, G., Xu, J., Zhao, Y., Kumar, D., ... & Hu, X. (2021, May). Tods: An automated time series outlier detection system. In Proceedings of the aaai conference on artificial intelligence (Vol. 35, No. 18, pp. 16060-16062). 
  22. Lee, D. H. & Kim. N. (2022). Anomaly detection methodology based on multimodal deep learning. 28(2), 101-125. 
  23. Li, Z., Zhao, Y., Botta, N., Ionescu, C., & Hu, X. (2020, November). COPOD: copula-based outlier detection. In 2020 IEEE international conference on data mining (ICDM) (pp. 1118-1123). IEEE. 
  24. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C., & Chen, G. (2022). Ecod: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE Transactions on Knowledge and Data Engineering. 
  25. Liu, B., Tan, P. N., & Zhou, J. (2022, June). Unsupervised anomaly detection by robust density estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 4, pp. 4101-4108). 
  26. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In 2008 eighth ieee international conference on data mining (pp. 413-422). IEEE. 
  27. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2010). On detecting clustered anomalies using SCiForest. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part II 21 (pp. 274-290). Springer Berlin Heidelberg. 
  28. Liu, K., Dou, Y., Zhao, Y., Ding, X., Hu, X., Zhang, R., ... & Yu, P. S. (2022). Pygod: A python library for graph outlier detection. arXiv preprint arXiv:2204.12095. 
  29. Malerba, D., Esposito, F., & Semeraro, G. (1996). A further comparison of simplification methods for decision-tree induction. Learning From Data: Artificial Intelligence and Statistics V, 365-374. 
  30. Mangasarian, O. L., Street, W. N., & Wolberg, W. H. (1995). Breast cancer diagnosis and prognosis via linear programming. Operations research, 43(4), 570-577. 
  31. Martinez-Guerra, R., & Mata-Machuca, J. L. (2016). Fault detection and diagnosis in nonlinear systems. Springer International Pu. 
  32. Pang, G., Cao, L., Chen, L., & Liu, H. (2018, July). Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2041-2050). 
  33. Pang, G., Shen, C., & van den Hengel, A. (2019, July). Deep anomaly detection with deviation networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 353-362). 
  34. Pevny, T. (2016). Loda: Lightweight on-line detector of anomalies. Machine Learning, 102, 275-304.  https://doi.org/10.1007/s10994-015-5521-0
  35. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, 31.
  36. Ramaswamy, S., Rastogi, R., & Shim, K. (2000, May). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 427-438). 
  37. Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., ... & Muller, K. R. (2021). A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5), 756-795. 
  38. Ruff, L., Vandermeulen, R. A., Gornitz, N., Binder, A., Muller, E., Muller, K. R., & Kloft, M. (2019). Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694. 
  39. Shyu, M. L., Chen, S. C., Sarinnapakorn, K., & Chang, L. (2003). A novel anomaly detection scheme based on principal component classifier. Miami Univ Coral Gables Fl Dept of Electrical and Computer Engineering. 
  40. Soenen, J., Van Wolputte, E., Perini, L., Vercruyssen, V., Meert, W., Davis, J., & Blockeel, H. (2021). The effect of hyperparameter tuning on the comparative evaluation of unsupervised anomaly detection methods. In Proceedings of the KDD'21 Workshop on Outlier Detection and Description (pp. 1-9). Outlier Detection and Description Organising Committee. 
  41. Steinbuss, G., & Bohm, K. (2021). Benchmarking unsupervised outlier detection with realistic synthetic data. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(4), 1-20. 
  42. Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002). Enhancing effectiveness of outlier detections for low density patterns. In Advances in Knowledge Discovery and Data Mining: 6th Pacific-Asia Conference, PAKDD 2002 Taipei, Taiwan, May 6-8, 2002 Proceedings 6 (pp. 535-548). Springer Berlin Heidelberg. 
  43. Zhao, Y., & Hryniewicki, M. K. (2018, July). Xgbod: improving supervised outlier detection with unsupervised representation learning. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE. 
  44. Zhao, Y., Nasrullah, Z., & Li, Z. (2019). Pyod: A python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588. 
  45. Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., & Chen, H. (2018, February). Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations. 
  46. Zhou, Z. H. (2018). A brief introduction to weakly supervised learning. National science review, 5(1), 44-53.  https://doi.org/10.1093/nsr/nwx106