Browse > Article
http://dx.doi.org/10.5351/KJAS.2017.30.5.681

On sampling algorithms for imbalanced binary data: performance comparison and some caveats  

Kim, HanYong (Department of Statistics, Inha University)
Lee, Woojoo (Department of Statistics, Inha University)
Publication Information
The Korean Journal of Applied Statistics / v.30, no.5, 2017 , pp. 681-690 More about this Journal
Abstract
Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.
Keywords
imbalanced binary data; sampling; classifier; prediction;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Altini, M. (2015). Dealing with imbalanced data: undersampling, oversampling and proper cross-validation. http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation.
2 Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357.   DOI
3 Dal Pozzolo, A., Caelen, O., Waterschoot, S., and Bontempi, G. (2013). Racing for unbalanced methods selection. In International Conference on Intelligent Data Engineering and Automated Learning, (pp.24-31), Springer, Berlin, Heidelberg.
4 Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software, 33, 1-22.
5 Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42, 463-484.   DOI
6 He, H. and Garcia, E. A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 21, 1263-1284.   DOI
7 He, H. and Ma, Y (2013). Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, New Jersey.
8 Hulse, J. V., Khoshgoftaar, T. M., and Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning, 935-942.
9 Kuhn, M. (2016). Building predictive models in R using the caret package, Journal of Statistical Software, 28(5).
10 Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest, R News, 2, 18-22.
11 Longadge, R. and Dongre, S. (2013). Class imbalance problem in data mining review, arXiv preprint arXiv:1305.1707
12 Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2017). e1071: Misc Functions of the Department of Statistics, R package version 1.6-8.
13 Ren, P., Yao, S., Li, J., Valdes-Sosa, P. A., and Kendrick, K. M. (2015). Improved prediction of preterm delivery using empirical mode decomposition analysis of uterine electromyography signals, PLOS ONE, 10, e0132116   DOI
14 Ridgeway, G. (2017). gbm: generalized boosted regression models, R package version 2.1.3.
15 Xie, J. and Qiu, Z. (2007). The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition, 40, 557-562.   DOI