DOI QR코드

DOI QR Code

불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점

On sampling algorithms for imbalanced binary data: performance comparison and some caveats

  • Kim, HanYong (Department of Statistics, Inha University) ;
  • Lee, Woojoo (Department of Statistics, Inha University)
  • 투고 : 2017.07.17
  • 심사 : 2017.09.12
  • 발행 : 2017.10.31

초록

파산감지, 스팸메일 감지, 불량품 감지 등 일상생활에서 불균형적인 이항 분류 문제를 다양하게 접할 수 있다. 반응변수의 클래스의 비율이 상당히 불균형한 경우 이항 분류 모형의 예측 성능이 좋지 않다는 점은 이미 잘 알려진 사실이다. 이러한 문제점을 해결하기 위해 그 동안 오버 샘플링, 언더 샘플링, SMOTE와 같은 여러 샘플링 기법이 개발되어 왔다. 본 연구에서는 분류 모형으로 많이 사용되는 기계학습모형으로 로지스틱 회귀모형, Lasso, 랜덤포레스트, 부스팅, 서포트 벡터 머신을 위의 샘플링 기법들과 결합하여 사용했을 때의 예측 성능을 살펴보았다. 실질적인 예측 성능의 개선 여부를 확인하기 위해 네 개의 실제 자료를 분석하였다. 이와 더불어, 샘플링 방법이 사용될 때 주의해야 할 점에 대해서 강조하였다.

Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

키워드

참고문헌

  1. Altini, M. (2015). Dealing with imbalanced data: undersampling, oversampling and proper cross-validation. http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation.
  2. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357. https://doi.org/10.1613/jair.953
  3. Dal Pozzolo, A., Caelen, O., Waterschoot, S., and Bontempi, G. (2013). Racing for unbalanced methods selection. In International Conference on Intelligent Data Engineering and Automated Learning, (pp.24-31), Springer, Berlin, Heidelberg.
  4. Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software, 33, 1-22.
  5. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42, 463-484. https://doi.org/10.1109/TSMCC.2011.2161285
  6. He, H. and Garcia, E. A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 21, 1263-1284. https://doi.org/10.1109/TKDE.2008.239
  7. He, H. and Ma, Y (2013). Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, New Jersey.
  8. Hulse, J. V., Khoshgoftaar, T. M., and Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning, 935-942.
  9. Kuhn, M. (2016). Building predictive models in R using the caret package, Journal of Statistical Software, 28(5).
  10. Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest, R News, 2, 18-22.
  11. Longadge, R. and Dongre, S. (2013). Class imbalance problem in data mining review, arXiv preprint arXiv:1305.1707
  12. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2017). e1071: Misc Functions of the Department of Statistics, R package version 1.6-8.
  13. Ren, P., Yao, S., Li, J., Valdes-Sosa, P. A., and Kendrick, K. M. (2015). Improved prediction of preterm delivery using empirical mode decomposition analysis of uterine electromyography signals, PLOS ONE, 10, e0132116 https://doi.org/10.1371/journal.pone.0132116
  14. Ridgeway, G. (2017). gbm: generalized boosted regression models, R package version 2.1.3.
  15. Xie, J. and Qiu, Z. (2007). The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition, 40, 557-562. https://doi.org/10.1016/j.patcog.2006.01.009