DOI QR코드

DOI QR Code

불균형 자료에서 불순도 지수를 활용한 분류 임계값 선택

Selecting the optimal threshold based on impurity index in imbalanced classification

  • 장서인 (숙명여자대학교 통계학과) ;
  • 여인권 (숙명여자대학교 통계학과)
  • Jang, Shuin (Department of Statistics, Sookmyung Women's University) ;
  • Yeo, In-Kwon (Department of Statistics, Sookmyung Women's University)
  • 투고 : 2021.06.15
  • 심사 : 2021.07.15
  • 발행 : 2021.10.31

초록

이 논문에서는 불균형 자료에 대한 분류 분석에서 불순도지수를 이용하여 임계값을 조정하는 방법에 대해 알아본다. 이항자료에 대한 분류에서는 소수범주를 Positive, 다수범주를 Negative라고 하면, 일반적으로 사용하는 0.5 기준으로 범주를 정하면 불균형 자료에서는 특이도는 높은 반면 민감도는 상대적으로 낮게 나오는 경향이 있다. 소수범주에 속한 개체를 제대로 분류하는 것이 상대적으로 중요한 문제에서는 민감도를 높이는 것이 중요한데 이를 분류기준이 되는 임계값을 조정을 통해 높이는 방법에 대해 알아본다. 기존연구에서는 G-mean이나 F1-score와 같은 측도를 기준으로 임계값을 조정했으나 이 논문에서는 CHAID의 카이제곱통계량, CART의 지니지수, C4.5의 엔트로피를 이용하여 최적임계값을 선택하는 방법을 제안한다. 최적임계값이 여러 개 나올 수 있는 경우 해결방법을 소개하고 불균형 분류 예제로 사용되는 데이터 분석을 통해 0.5를 기준으로 ?(무엇?)을 때와 비교하여 어떤 개선이 이루어졌는지 등을 분류성능측도로 알아본다.

In this paper, we propose the method of adjusting thresholds using impurity indices in classification analysis on imbalanced data. Suppose the minority category is Positive and the majority category is Negative for the imbalanced binomial data. When categories are determined based on the commonly used 0.5 basis, the specificity tends to be high in unbalanced data while the sensitivity is relatively low. Increasing sensitivity is important when proper classification of objects in minority categories is relatively important. We explore how to increase sensitivity through adjusting thresholds. Existing studies have adjusted thresholds based on measures such as G-Mean and F1-score, but in this paper, we propose a method to select optimal thresholds using the chi-square statistic of CHAID, the Gini index of CART, and the entropy of C4.5. We also introduce how to get a possible unique value when multiple optimal thresholds are obtained. Empirical analysis shows what improvements have been made compared to the results based on 0.5 through classification performance metrics.

키워드

참고문헌

  1. Akosa J (2017). Predictive accuracy: a misleading performance measure for highly imbalanced data, SAS Papers, 942, 1-12.
  2. Blake C and Merz C (1998). UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine.
  3. Breiman L, Friedman JH, Olshen RA, and Stone CJ (1984). Classification and Regression Trees, Chapman & Hall, New York.
  4. Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357. https://doi.org/10.1613/jair.953
  5. Chawla NV, Lazarevic A, Hall LO, and Bowyer KW (2003). SMOTEboost: improving prediction of the minority class in boosting. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119.
  6. Collell G, Prelec D, and Patil KR (2018). A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing, 275, 330-340. https://doi.org/10.1016/j.neucom.2017.08.035
  7. Espindola RP and Ebecken N (2005). On extending f-measure and G-Mean metrics to multi-class problems, WIT Transactions on Information and Communication Technologies, 35, 25-34. https://doi.org/10.2495/DATA050031
  8. Kass GV(1980). An exploratory technique for investigating large quantities of categorical data, Applied Statistics, 29, 119-127. https://doi.org/10.2307/2986296
  9. Kim D, Kang S, and Song J (2015). Classification analysis for unbalanced data, The Korean Journal of Applied Statistics, 28, 495-509. https://doi.org/10.5351/KJAS.2015.28.3.495
  10. Kim HY and Lee W (2017). On sampling algorithms for imbalanced binary data: performance comparison and some caveats, The Korean Journal of Applied Statistics, 30, 681-690. https://doi.org/10.5351/KJAS.2017.30.5.681
  11. Longadge R, Dongre SS, and Malik L (2013). Class imbalance problem in data mining: review, International Journal of Computer Science and Network, 2.
  12. Quinlan JR (1986). Induction of decision trees, Machine Learning, 1, 81-106. https://doi.org/10.1007/BF00116251
  13. Quinlan JR (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers.
  14. Voigt T, Fried R, Backes M, and Rhodc W (2014). Threshold optimization for classification in imbalanced data in a problem of gamma-ray astronomy, Advances in Data Analysis and Classification, 8, 195-216. https://doi.org/10.1007/s11634-014-0167-5
  15. Woods K, Doss C, Bowyer K, Solka J, Priebe C, and Kegelmeyer P (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition and Artificial Intelligence, 7, 1417-1436. https://doi.org/10.1142/S0218001493000698
  16. Yu H, Mu C, Sun C, Yang W, Yang X, and Zuo X (2015). Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowledge-Based Systems, 76, 67-78. https://doi.org/10.1016/j.knosys.2014.12.007
  17. Zou Q, Xie S, Lin Z, Wu M, and Ju Y (2016). Finding the best classification threshold in imbalanced classification, Big Data Research, 5, 2-8. https://doi.org/10.1016/j.bdr.2015.12.001