DOI QR코드

DOI QR Code

A comparative study of feature screening methods for ultrahigh dimensional multiclass classification

초고차원 다범주분류를 위한 변수선별 방법 비교 연구

  • Received : 2017.08.29
  • Accepted : 2017.10.13
  • Published : 2017.10.31

Abstract

We compare various variable screening methods on multiclass classification problems when the data is ultrahigh-dimensional. Two different approaches were considered: (1) pairwise extension from binary classification via one versus one or one versus rest comparisons and (2) direct classification of multiclass responses. We conducted extensive simulation studies under different conditions: heavy tailed explanatory variables, correlated signal and noise variables, correlated joint distributions but uncorrelated marginals, and unbalanced response variables. We then analyzed real data to examine the performance of the methods. The results showed that model-free methods perform better for multiclass classification problems as well as binary ones.

본 논문에서는 초고차원 자료의 다항분류를 위한 변수선별 방법에 대해 비교 연구를 진행하였다. 다항분류를 위한 변수선별 방법에는 일대일 혹은 일대다 비교를 통해 이항분류를 위한 방법을 확장시켜 적용하는 방법과 다항 반응 변수에 직접 적용할 수 있는 방법이 있다. 다항분류를 위한 변수선별 성능을 확인하기 위하여 여러가지 상황-설명변수의 꼬리가 두꺼운 경우, 신호변수와 잡음변수가 서로 연관된 경우, 결합분포상으로 연관되어 있지만 주변분포 상으로는 연관되어 있지 않은 경우, 다범주 반응변수의 분포가 불균형인 경우-을 가정하고 모의실험을 진행하였고, 실제 자료에도 적용해 보았다. 그 결과, 모형 가정을 필요로 하지 않는 방법들이 안정적인 성능을 보이는 것을 확인하였다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence rules, The Annals of Statistics, 36, 2605. https://doi.org/10.1214/07-AOS504
  2. Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models, Journal of the American Statistical Association, 106, 544-557. https://doi.org/10.1198/jasa.2011.tm09779
  3. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  4. Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection: beyond the linear model, The Journal of Machine Learning Research, 10, 2013-2038.
  5. Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality, The Annals of Statistics, 38, 3567-3604. https://doi.org/10.1214/10-AOS798
  6. Gui, J. and Li, H. (2005), Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data, Bioinformatics, 21, 3001-3008. https://doi.org/10.1093/bioinformatics/bti422
  7. He, X., Wang, L., and Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, The Annals of Statistics, 41, 342-369. https://doi.org/10.1214/13-AOS1087
  8. Kimeldorf, G. and Wahba, G. (1971). Some Results on Tchebycheffian Spline Functions, Journal of Mathematical Analysis and Applications, 33, 82-95. https://doi.org/10.1016/0022-247X(71)90184-3
  9. Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning, Journal of the American Statistical Association, 107, 1129-1139. https://doi.org/10.1080/01621459.2012.695654
  10. Ma, S. and Huang, J. (2008). Penalized feature selection and classification in Bioinformatics, Briefings in Bioinformatics, 9, 392-403. https://doi.org/10.1093/bib/bbn027
  11. Mai, Q. and Zou, H. (2012). The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100, 229-234.
  12. Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: a nonparametric model-free screening method, The Annals of Statistics, 43, 1471-1497. https://doi.org/10.1214/14-AOS1303
  13. Metzker, M. L. (2010). Sequencing technologies-the next generation, Nature Reviews Genetics, 11, 31-46. https://doi.org/10.1038/nrg2626
  14. Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression, Bioinformatics, 25, 714-721. https://doi.org/10.1093/bioinformatics/btp041
  15. Zhang, H. H., Ahn, J., Lin, X., and Park, C. (2006). Gene selection using support vector machines with non-convex penalty, Bioinformatics, 22, 88-95. https://doi.org/10.1093/bioinformatics/bti736
  16. Zhu, L. P., Li, L., Li, R., and Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data, Journal of the American Statistical Association, 106, 1464-1475, https://doi.org/10.1198/jasa.2011.tm10563