DOI QR코드

DOI QR Code

Hierarchically penalized support vector machine for the classication of imbalanced data with grouped variables

그룹변수를 포함하는 불균형 자료의 분류분석을 위한 서포트 벡터 머신

  • Received : 2016.06.09
  • Accepted : 2016.07.07
  • Published : 2016.08.31

Abstract

The hierarchically penalized support vector machine (H-SVM) has been developed to perform simultaneous classification and input variable selection when input variables are naturally grouped or generated by factors. However, the H-SVM may suffer from estimation inefficiency because it applies the same amount of shrinkage to each variable without assessing its relative importance. In addition, when analyzing imbalanced data with uneven class sizes, the classification accuracy of the H-SVM may drop significantly in predicting minority class because its classifiers are undesirably biased toward the majority class. To remedy such problems, we propose the weighted adaptive H-SVM (WAH-SVM) method, which uses a adaptive tuning parameters to improve the performance of variable selection and the weights to differentiate the misclassification of data points between classes. Numerical results are presented to demonstrate the competitive performance of the proposed WAH-SVM over existing SVM methods.

H-SVM은 입력변수들이 그룹화 되어 있는 경우 분류함수의 추정에서 그룹 및 그룹 내의 변수선택을 동시에 할 수 있는 방법론이다. 그러나 H-SVM은 입력변수들의 중요도에 상관없이 모든 변수들을 동일하게 축소 추정하기 때문에 추정의 효율성이 감소될 수 있다. 또한, 집단별 개체수가 상이한 불균형 자료의 분류분석에서는 분류함수가 편향되어 추정되므로 소수집단의 예측력이 하락할 수 있다. 이러한 문제점들을 보완하기 위해 본 논문에서는 적응적 조율모수를 사용하여 변수선택의 성능을 개선하고 집단별 오분류 비용을 차등적으로 부여하는 WAH-SVM을 제안하였다. 또한, 모의실험과 실제자료 분석을 통하여 제안한 모형과 기존 방법론들의 성능 비교하였으며, 제안한 모형의 유용성과 활용 가능성 확인하였다.

Keywords

References

  1. Akbani, R., Kwek, S., and Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Proceedings of European Conference of Machine Learning, 3201, 39-50.
  2. Bang, S. and Jhun, M. (2012). On the use of adaptive weights for the $F_{\infty}$-norm support vector machine, The Korean Journal of Applied Statistics, 25, 829-835. https://doi.org/10.5351/KJAS.2012.25.5.829
  3. Bang, S., Kang, J., Jhun, M., and Kim, E. (2016). Hierarchically penalized support vector machine with grouped variables, International Journal of Machine Learning and Cybernetics, DOI:10.1007/s13042-016-0494-2.
  4. Berkelaar, M. and others (2014). lpSolve: Interface to Lp solve v. 5.5 to solve linear/integer programs. R package version 5.6.10. http://CRAN.R-project.org/package=lpSolve.
  5. Breiman, L. (1995). Better subset regression using the nonnegative garrote, Technometrics, 37, 373-384. https://doi.org/10.1080/00401706.1995.10484371
  6. Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Articial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
  7. Cortes, C. and Vapnik, V. (1995). Support vector networks, Machine Learning, 20, 273-297.
  8. Domingos, P. (1999). Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 155-164.
  9. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its Oracle properties, Journal of American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
  10. Friberg, H. A. (2013). Users Guide to the R-to-MOSEK Interface. URL http://rmosek.r-forge.r-project.org.
  11. Hwang W., Zhang H., and Ghosal, S. (2009). FIRST: Combining forward iterative selection and shrinkage in high dimensional sparse linear regression, Statistics and Its Interface, 2, 341-348. https://doi.org/10.4310/SII.2009.v2.n3.a7
  12. Japkowicz, N. (2000). The Class imbalance problem; Significance and Strategies. In Proceedings of the 2000 International Conference on Articial Intelligence : Special Track on Inductive Learning, 1, 111-117
  13. Kim, E., Jhun, M., and Bang, S. (2015). Weighted $L_1$-norm support vector machine for classification of highly imbalanced data, The Korea Journal of Applied Statistics, 28, 9-22. https://doi.org/10.5351/KJAS.2015.28.1.009
  14. Kotsiantis, S., Kanellopoulos, D., and Pintelas, P. (2006). Handling imbalanced datasets: a review, GESTS International Transactions on Computer Science and Engineering, 30, 25-36.
  15. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, 179-186.
  16. Lin, Y., Lee, Y., and Wahba, G. (2002). Support vector machines for classification in nonstandard situations, Machine Learning, 46, 191-202. https://doi.org/10.1023/A:1012406528296
  17. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
  18. Tang, Y., Zhang, Y., Chawla, N., and Krasser, S. (2009). SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39, 281-288. https://doi.org/10.1109/TSMCB.2008.2002909
  19. Turlach, B. and Weingessel, A. (2013). quadprog: Functions to solve quadratic programming problems. R package version 1.5-5. http://CRAN.R-project.org/package=quadprog.
  20. Vapnik, V. N. (1998). Statistical Learning Theory, Wiley, New York.
  21. Veropoulos, K., Campbell, C. and Cristianini, N. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the International Joint Conference on AI, 55-60.
  22. Wang, S., Nan, B., Zhou, N., and Zhu, J. (2009). Hierarchically penalized Cox regression with grouped variables, Biometrika, 96, 307-322. https://doi.org/10.1093/biomet/asp016
  23. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B, 68, 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
  24. Zhou, N. and Zhu, J. (2010). Group variable selection via a hierarchical lasso and its oracle property, Statistics and Its Interface, 3, 557-574. https://doi.org/10.4310/SII.2010.v3.n4.a13
  25. Zhu, J., Rosset, S., Hastiem T., and Tibshirani, R. (2003). 1-norm support vector machine, Neural Information Proceeding Systems, 16, 49-56.
  26. Zou, H. (2006). The adaptive lasso and its oracle properties, Journal of the Royal Statistical Society, Series B, 101, 1418-1429.
  27. Zou, H. (2007). An improved 1-norm SVM for simultaneous classification and variable selection. In Proceedings of the 11th International Conference on Articial Intelligence and Statistics.
  28. Zou, H. and Yuan, M. (2008). The $F_{\infty}$-norm support vector machine, Statistica Sinica, 18, 379-398.