DOI QR코드

DOI QR Code

계급불균형자료의 분류: 훈련표본 구성방법에 따른 효과

Classification of Class-Imbalanced Data: Effect of Over-sampling and Under-sampling of Training Data

  • 김지현 (숭실대학교 자연과학대학 정보통계학과) ;
  • 정종빈 (숭실대학교 자연과학대학 정보통계학과)
  • 발행 : 2004.11.01

초록

두 계급의 분류문제에서 두 계급의 관측 개체수가 심하게 불균형을 이룬 자료를 분석할 때, 흔히 인위적으로 두 계급의 크기를 비슷하게 해준 다음 분석한다. 본 연구에서는 이런 훈련표본 구성방법의 타당성에 대해 알아보았다. 또한 훈련표본의 구성방법이 부스팅에 미치는 효과에 대해서도 알아보았다. 12개의 실제 자료에 대한 실험 결과 나무모형으로 부스팅 기법을 적용할 때는 훈련표본을 그대로 둔 채 분석하는 것이 좋다는 결론을 얻었다.

Given class-imbalanced data in two-class classification problem, we often do over-sampling and/or under-sampling of training data to make it balanced. We investigate the validity of such practice. Also we study the effect of such sampling practice on boosting of classification trees. Through experiments on twelve real datasets it is observed that keeping the natural distribution of training data is the best way if you plan to apply boosting methods to class-imbalanced data.

키워드

참고문헌

  1. Bauer, E. and Kohabi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105-139 https://doi.org/10.1023/A:1007515423169
  2. Blake, C. L. and Merz, C. J. (1998). UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, University of California in Irvine Department of Information and Computer Science
  3. Boz O. (2001). Cost Sensitive Learning Bibliography. http://home.ptd.net/~olcay/costsensitive.html
  4. Breiman, L., Friedman, J. H., Olshen, J. A., and Stone, C, J. (1984). Classification and Regression Trees. Belmont, CA, Wadsworth
  5. Drummond, C. and Holte, R. (2000). Explicitly representing expected cost: An alternative to ROC representation. Technical Report, School of Information Technology and Engineering, University of Ottawa
  6. Freund, Y., and Schapire, R. (1997). A decision-theoretic generalization of on-line, learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139 https://doi.org/10.1006/jcss.1997.1504
  7. Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28, 337-374
  8. Shapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227
  9. Therneau, T. M. and Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines. Technical Report, Mayo Foundation
  10. Ting, K. M. (2000). A comparative study of cost-sensitive boosting algorithms, Proceedings of the 17th International Conference on Machine Learning, 983-990
  11. Ting, K. M. and Zheng, Z. (1998). Boosting cost-sensitive trees, Proceedings of the First International Conference on Discovery Science, 244-255
  12. Weiss, G. M. and Provost, F. (2001). The effect of class distribution on classifier learning. Technical Report, Department of Computer Science, Rutgers University
  13. Woods, K., Doss, C., Bowyer, K., Solka, J., Priebe, C., and Kegelmeyer, P. (1993). Comparative evaluation of pattern recognition technique for detection of microcalcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence, 1417-1436

피인용 문헌

  1. A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution vol.41, pp.1, 2015, https://doi.org/10.7232/JKIIE.2015.41.1.025
  2. A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data vol.27, pp.3, 2014, https://doi.org/10.5351/KJAS.2014.27.3.357
  3. Weighted L1-Norm Support Vector Machine for the Classification of Highly Imbalanced Data vol.28, pp.1, 2015, https://doi.org/10.5351/KJAS.2015.28.1.009