DOI QR코드

DOI QR Code

Comparison of resampling methods for dealing with imbalanced data in binary classification problem

이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교

  • Park, Geun U (Division of Biostatistics, Department of Biomedical Systems Informatics, Yonsei University College of Medicine) ;
  • Jung, Inkyung (Division of Biostatistics, Department of Biomedical Systems Informatics, Yonsei University College of Medicine)
  • 박근우 (연세대학교 의과대학 의생명시스템정보학교실 의학통계학과) ;
  • 정인경 (연세대학교 의과대학 의생명시스템정보학교실 의학통계학과)
  • Received : 2019.02.15
  • Accepted : 2019.04.02
  • Published : 2019.06.30

Abstract

A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

이분형 자료의 분류에서 자료의 불균형 정도가 심한 경우 분류 결과가 좋지 않을 수 있다. 이런 문제 해결을 위해 학습 자료를 변형시키는 등의 연구가 활발히 진행되고 있다. 본 연구에서는 이러한 이분형 자료의 분류문제에서 불균형을 다루기 위한 방법들 중 표본재추출 방법들을 비교하였다. 이를 통해 자료에서 희소계급의 탐지를 보다 효과적으로 하는 방법을 찾고자 하였다. 모의실험을 통하여 여러 오버샘플링, 언더샘플링, 오버샘플링과 언더샘플링 혼합방법의 총 20가지를 비교하였다. 분류문제에서 대표적으로 쓰이는 로지스틱 회귀분석, support vector machine, 랜덤포레스트 모형을 분류기로 사용하였다. 모의실험 결과, 정확도가 0.5 이상이면서 민감도가 높았던 표본재추출 방법은 random under sampling (RUS)였다. 그 다음으로 민감도가 높았던 방법은 오버샘플링 ADASYN (adaptive synthetic sampling approach)이었다. 이를 통해 RUS 방법이 희소계급값을 찾기 위한 방안으로는 적합했다는 것을 알 수 있었다. 몇 가지 실제 자료에 적용한 결과도 모의실험의 결과와 비슷한 양상을 보였다.

Keywords

GCGHDE_2019_v32n3_349_f0001.png 이미지

Figure 2.1. Changes in data set after applying various over-sampling methods.

GCGHDE_2019_v32n3_349_f0002.png 이미지

Figure 2.2. Changes in data set after applying various CNN-based under-sampling methods.

GCGHDE_2019_v32n3_349_f0003.png 이미지

Figure 2.3. Changes in data set after applying various under-sampling methods.

GCGHDE_2019_v32n3_349_f0004.png 이미지

Figure 2.4. Changes in data set after applying two combined methods.

GCGHDE_2019_v32n3_349_f0005.png 이미지

Figure 3.1. Sensitivity, accuracy, ACU, and F1-score of logistic regression for simulation 3.

GCGHDE_2019_v32n3_349_f0006.png 이미지

Figure 3.2. Sensitivity, accuracy, ACU, and F1-score of SVM for simulation 3.

GCGHDE_2019_v32n3_349_f0007.png 이미지

Figure 3.3. Sensitivity, accuracy, ACU, and F1-score of random forest for simulation 3.

GCGHDE_2019_v32n3_349_f0008.png 이미지

Figure 3.4. An example of original data set and changed data set after applying the NM2 method when the rare class values were distributed in two extremes.

GCGHDE_2019_v32n3_349_f0009.png 이미지

Figure 4.1. Sensitivity, accuracy, ACU, and F1-score of logistic regression, SVM, and random forest for so-lar flare m0 data.

Table 2.1. Misclassification table

GCGHDE_2019_v32n3_349_t0001.png 이미지

Table 4.1. Example data sets

GCGHDE_2019_v32n3_349_t0002.png 이미지

References

  1. Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 6, 20-29. https://doi.org/10.1145/1007730.1007735
  2. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
  3. Gates, G. (1972). The reduced nearest neighbor rule (Corresp.), IEEE Transactions on Information Theory, 18, 431-433. https://doi.org/10.1109/TIT.1972.1054809
  4. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. (2017). Learning from classimbalanced data: Review of methods and applications, Expert Systems with Applications, 73, 220-239. https://doi.org/10.1016/j.eswa.2016.12.035
  5. Han, H., Wang, W. Y., and Mao, B. H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878-887). Springer, Berlin, Heidelberg.
  6. Hart, P. (1968). The condensed nearest neighbor rule (Corresp.), IEEE Transactions on Information Theory, 14, 515-516. https://doi.org/10.1109/TIT.1968.1054155
  7. He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference (pp. 1322-1328). IEEE.
  8. He, H. and Garcia, E. A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 21, 1263-1284. https://doi.org/10.1109/TKDE.2008.239
  9. Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe (pp. 63-66).
  10. Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning, Journal of Machine Learning Research, 18, 1-5.
  11. Mani, I. and Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of Workshop on Learning from Imbalanced Datasets II, ICML (Vol. 126), Washington.
  12. Moon, S. Y. (2018). Performance comparison of classification methods based on the random forest in class imbalanced data (Master thesis), Korea University, Seoul.
  13. Prati, R. C., Batista, G. E., Monard, M. C. (2009). Data mining with imbalanced class distributions: concepts and methods. In Proceedings of the 4th Indian International Conference on Artificial Intelligence (pp. 359-376), Tumkur, Karnataka.
  14. Smith, M. R., Martinez, T., and Giraud-Carrier, C. (2014). An instance level analysis of data complexity, Machine Learning, 95, 225-256. https://doi.org/10.1007/s10994-013-5422-z
  15. Tomek, I. (1976a). An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics, 6, 448-452. https://doi.org/10.1109/TSMC.1976.4309523
  16. Tomek, I. (1976b). Two modifications of CNN, IEEE Transactions on Systems, Man, and Cybernetics, 6, 769-772. https://doi.org/10.1109/TSMC.1976.4309452
  17. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics, 2, 408-421. https://doi.org/10.1109/TSMC.1972.4309137