DOI QR코드

DOI QR Code

불균형 자료의 분류분석을 위한 가중 L1-norm SVM

Weighted L1-Norm Support Vector Machine for the Classification of Highly Imbalanced Data

  • 투고 : 2014.09.18
  • 심사 : 2015.01.13
  • 발행 : 2015.02.28

초록

SVM은 높은 수준의 분류 정확도와 유연성을 바탕으로 다양한 분야의 분류분석에서 널리 사용되고 있다. 그러나 집단별 개체수가 상이한 불균형 자료의 분류분석에서 SVM은 다수집단으로 편향되게 분류함수를 추정하므로 소수집단의 분류 정확도가 심각하게 감소하게 된다. 불균형 자료의 분류분석을 위하여 집단별 오분류 비용을 차등 적용하는 가중 $L_2$-norm SVM이 개발되었으나, 이는 릿지 형태의 벌칙함수를 사용하므로 분류함수의 추정에서 불필요한 잡음변수의 제거에는 효율적이지 못하다. 따라서 본 논문에서는 라소 형태의 별칙함수를 사용하고 훈련개체의 오분류 비용을 차등적으로 부여함으로서 불균형 자료의 분류분석에서 변수선택의 기능을 지니는 가중 $L_1$-norm SVM을 제안하였으며, 모의실험과 실제자료의 분석을 통하여 제안한 방법론의 효율적인 성능과 유용성을 확인하였다.

The support vector machine has been successfully applied to various classification areas due to its flexibility and a high level of classification accuracy. However, when analyzing imbalanced data with uneven class sizes, the classification accuracy of SVM may drop significantly in predicting minority class because the SVM classifiers are undesirably biased toward the majority class. The weighted $L_2$-norm SVM was developed for the analysis of imbalanced data; however, it cannot identify irrelevant input variables due to the characteristics of the ridge penalty. Therefore, we propose the weighted $L_1$-norm SVM, which uses lasso penalty to select important input variables and weights to differentiate the misclassification of data points between classes. We demonstrate the satisfactory performance of the proposed method through simulation studies and a real data analysis.

키워드

참고문헌

  1. Akbani, R., Kwek, S. and Japkowicz,, N. (2004). Applying support vector machines to imbalanced datasets, In Proceedings of European Conference of Machine Learning, 3201, 39-50.
  2. Bang, S. and Jhun, M. (2014). Weighted support vector machine using k-means clustering, Communications in Statistics-Simulation and Computation, 43, 2307-2324. https://doi.org/10.1080/03610918.2012.762388
  3. Barandela, R., Sanchez, J., Garcia, V. and Rangel, E. (2003). Strategies for learning in class imbalance problems, Pattern Recognition, 36, 849-851. https://doi.org/10.1016/S0031-3203(02)00257-1
  4. Berkelaar, M. and others (2014). lpSolve: Interface to Lp solve v. 5.5 to solve linear/integer programs. R package version 5.6.10. http://CRAN.R-project.org/package=lpSolve.
  5. Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Arti cial Intelligence Research, 16, 321-357.
  6. Cohen, G., Hilario, M., Sax, H., Hugonnet, S. and Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection, Artificial Intelligence in Medicine, 37, 7-18. https://doi.org/10.1016/j.artmed.2005.03.002
  7. Cortes, C. and Vapnik, V. (1995). Support vector networks, Machine Learning, 20, 273-297.
  8. Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets, International Journal of Emerging Technology and Advanced Engineering, 2, 42-47.
  9. Garcia, V., Sanchez, J. S., Mollineda, R. A., Alejo, R. and Sotoca, J. M. (2007). The class imbalance problem in pattern classification and learning, In Proceedings of the 5th Spanish Workshop on Data Mining and Learning, 283-291.
  10. Han, H., Wang, W. Y. and Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, 3644, 878-887.
  11. Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
  12. Japkowicz, N. (2000). The Class imbalance problem; Significance and Strategies, In Proceedings of the 2000 International Conference on Artificial Intelligence : Special Track on Inductive Learning, 1, 111-117
  13. Kim, J. and Jeong, J. (2004). Classification of class-imbalanced data: Effect of over-sampling and undersampling of training data, The Korean Journal of Applied Statistics, 17, 445-457. https://doi.org/10.5351/KJAS.2004.17.3.445
  14. Kubat M. and Matwin S. (1997). Adressing the curse of imbalanced training sets: One-sided selection, In Proceedings of the Fourteenth International Conference on Machine Learining, 179-186.
  15. Lee, H. and Lee, S. (2014). A comparison of ensemble methods combining resampling techniques for class imbalanced data, The Korean Journal of Applied Statistics, 27, 357-371. https://doi.org/10.5351/KJAS.2014.27.3.357
  16. Lin, Y., Lee, Y. and Wahba, G. (2002). Support vector machines for classification in nonstandard situations, Machine Learning, 46, 191-202. https://doi.org/10.1023/A:1012406528296
  17. Liu, Y., An, A. and Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles, In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 3918, 107-118.
  18. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
  19. Tang, Y., Zhang, Y., Chawla, N. and Krasser, S. (2009). SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39, 281-288. https://doi.org/10.1109/TSMCB.2008.2002909
  20. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58, 267-288.
  21. Turlach, B. and Weingessel, A. (2013). quadprog: Functions to solve quadratic programming problems. R package version 1.5-5. http://CRAN.R-project.org/package=quadprog.
  22. Vapnik, V. N, (1998). Statistical Learning Theory, Wiley, New York.
  23. Veropoulos, K., Campbell, C. and Cristianini, N. (1999). Controlling the sensitivity of support vector machines, In Proceedings of the International Joint Conference on AI, 55-60.
  24. Wang, B. X. and Japkowicz, N. (2009). Boosting support vector machines for imbalanced data sets, Knowledge and Information Systems, 25, 1-20.
  25. Wang, L., Zhu, J. and Zou, H. (2006). The doubly regularized support vector machine, Statistica Sinica, 16, 589-615.
  26. Zhu, J., Rosset, S., Hastiem, T. and Tibshirani, R. (2003). 1-norm support vector machine, Neural Information Proceeding Systems, 16, 49-56.

피인용 문헌

  1. Hierarchically penalized support vector machine for the classication of imbalanced data with grouped variables vol.29, pp.5, 2016, https://doi.org/10.5351/KJAS.2016.29.5.961