[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2017.30.5.681

On sampling algorithms for imbalanced binary data: performance comparison and some caveats

Kim, HanYong (Department of Statistics, Inha University)
Lee, Woojoo (Department of Statistics, Inha University)

Publication Information

The Korean Journal of Applied Statistics / v.30, no.5, 2017 , pp. 681-690 More about this Journal

Abstract

Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Keywords

imbalanced binary data; sampling; classifier; prediction;

Citations & Related Records

Reference

1	Altini, M. (2015). Dealing with imbalanced data: undersampling, oversampling and proper cross-validation. http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation.
2	Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357. DOI
3	Dal Pozzolo, A., Caelen, O., Waterschoot, S., and Bontempi, G. (2013). Racing for unbalanced methods selection. In International Conference on Intelligent Data Engineering and Automated Learning, (pp.24-31), Springer, Berlin, Heidelberg.
4	Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software, 33, 1-22.
5	Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42, 463-484. DOI
6	He, H. and Garcia, E. A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 21, 1263-1284. DOI
7	He, H. and Ma, Y (2013). Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, New Jersey.
8	Hulse, J. V., Khoshgoftaar, T. M., and Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning, 935-942.
9	Kuhn, M. (2016). Building predictive models in R using the caret package, Journal of Statistical Software, 28(5).
10	Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest, R News, 2, 18-22.
11	Longadge, R. and Dongre, S. (2013). Class imbalance problem in data mining review, arXiv preprint arXiv:1305.1707
12	Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2017). e1071: Misc Functions of the Department of Statistics, R package version 1.6-8.
13	Ren, P., Yao, S., Li, J., Valdes-Sosa, P. A., and Kendrick, K. M. (2015). Improved prediction of preterm delivery using empirical mode decomposition analysis of uterine electromyography signals, PLOS ONE, 10, e0132116 DOI
14	Ridgeway, G. (2017). gbm: generalized boosted regression models, R package version 2.1.3.
15	Xie, J. and Qiu, Z. (2007). The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition, 40, 557-562. DOI

KSCI

On sampling algorithms for imbalanced binary data: performance comparison and some caveats 불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점

On sampling algorithms for imbalanced binary data: performance comparison and some caveats