[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2015.28.3.495

Classification Analysis for Unbalanced Data 불균형 자료에 대한 분류분석

Kim, Dongah (Department of Statistics, Ewha Womans University)
Kang, Suyeon (Department of Statistics, Ewha Womans University)
Song, Jongwoo (Department of Statistics, Ewha Womans University)

Publication Information

The Korean Journal of Applied Statistics / v.28, no.3, 2015 , pp. 495-509 More about this Journal

Abstract

We study a classification problem of significant differences in the proportion of two groups known as the unbalanced classification problem. It is usually more difficult to classify classes accurately in unbalanced data than balanced data. Most observations are likely to be classified to the bigger group if we apply classification methods to the unbalanced data because it can minimize the misclassification loss. However, this smaller group is misclassified as the larger group problem that can cause a bigger loss in most real applications. We compare several classification methods for the unbalanced data using sampling techniques (up and down sampling). We also check the total loss of different classification methods when the asymmetric loss is applied to simulated and real data. We use the misclassification rate, G-mean, ROC and AUC (area under the curve) for the performance comparison.

Keywords

up-sampling; down-sampling; asymmetric loss; misclassification rate; G-mean; ROC; AUC; logisitic regression; SVM; random forest;

Citations & Related Records

Reference

1	Breiman, L. (2001). Random forests, Machine Learning, 45, 5-32. DOI
2	Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357.
3	Chen, C., Liaw, A. and Breiman, L. (2004). Using random forest to learn imbalanced data, Technical Report 666.
4	Karatzoglou, A., Meyer, D. and Hornik, K. (2006). Support vector machines in R, Journal of Statistical Software, 15.
5	Kubat, M., Holte, R. and Matwin, S. (1997). Learning when negative examples abound. In Proceedings of ECML-97, 9th European Conference on Machine Learning, 146-153.
6	Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection, Proceedings of the 14th International Conference on Machine Learning, 179-186.
7	Park, C., Kim, Y., Kim, J., Song, J. and Choi, H. (2011). Datamining using R, Kyowoo, Seoul.
8	R Development Core Team (2010). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. http://www.R-project.org
9	Vapnik, V. (1998). Statistical Learning Theory, Wiley, New York.
10	Wu, G. and Chang, E. (2003). Class-boundary alignment for imbalanced dataset learning, In ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC.