DOI QR코드

DOI QR Code

Ensemble variable selection using genetic algorithm

  • Received : 2022.03.08
  • Accepted : 2022.04.29
  • Published : 2022.11.30

Abstract

Variable selection is one of the most crucial tasks in supervised learning, such as regression and classification. The best subset selection is straightforward and optimal but not practically applicable unless the number of predictors is small. In this article, we propose directly solving the best subset selection via the genetic algorithm (GA), a popular stochastic optimization algorithm based on the principle of Darwinian evolution. To further improve the variable selection performance, we propose to run multiple GA to solve the best subset selection and then synthesize the results, which we call ensemble GA (EGA). The EGA significantly improves variable selection performance. In addition, the proposed method is essentially the best subset selection and hence applicable to a variety of models with different selection criteria. We compare the proposed EGA to existing variable selection methods under various models, including linear regression, Poisson regression, and Cox regression for survival data. Both simulation and real data analysis demonstrate the promising performance of the proposed method.

Keywords

Acknowledgement

This work is funded by the National Research Foundation of Korea (NRF) grants (2018R1D1A1B07043034, 2019R1A4A1028134) and Korea University (K2000461).

References

  1. David B, Royston G, Alun J, Jem JR, and Douglas BK (1997). Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry, Analytica Chimica Acta, 348, 71-86. https://doi.org/10.1016/S0003-2670(97)00065-2
  2. Partha D and Pravin KT (1997). Demand for medical care by the elderly: A finite mixture approach, Journal of applied Econometrics, 12, 313-336. https://doi.org/10.1002/(SICI)1099-1255(199705)12:3<313::AID-JAE440>3.0.CO;2-G
  3. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
  4. Leardi R and Gonzalez AL (1998). Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemometrics and Intelligent Laboratory Systems, 41, 195-207. https://doi.org/10.1016/S0169-7439(98)00051-3
  5. Meinshausen N and Buhlmann P (2010). Stability selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 417-473. https://doi.org/10.1111/j.1467-9868.2010.00740.x
  6. Niazi A and Leardi R (2012). Genetic algorithms in chemometrics, Journal of Chemometrics, 26, 345-351. https://doi.org/10.1002/cem.2426
  7. Tibshirani R (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B-methodological, 58, 267-288.
  8. Tibshirani R (1997). The lasso method for variable selection in the Cox model, Statistics in Medicine, 16, 385-395. https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
  9. Volinsky C and Raftery A(2000). Bayesian information criterion for censored survival models, Biometrics, 56, 256-262. https://doi.org/10.1111/j.0006-341X.2000.00256.x
  10. Wang S, Nan B, Rosset S, and Zhu J (2011). Random lasso, The Annals of Applied Statistics, 5, 468.
  11. Xin L and Zhu M (2012). Stochastic stepwise ensembles for variable selection, Journal of Computational and Graphical Statistics, 21, 275-294. https://doi.org/10.1080/10618600.2012.679223
  12. Yeh IC (2007). Modeling slump flow of concrete using second-order regressions and artificial neural networks, Cement and Concrete Composites, 29, 474-480, Available from: https://doi.org/10.10-16/j.cemconcomp.2007.02.001 https://doi.org/10.10-16/j.cemconcomp.2007.02.001
  13. Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
  14. Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty, The Annals of Statistics, 38, 894-942.
  15. Zhang CX, Zhang JS, and Kim SW (2016). PBoostGA: Pseudo-boosting genetic algorithm for variable ranking and selection, Computational Statistics, 31, 1237-1262. https://doi.org/10.1007/s00180-016-0652-8
  16. Zhu M and Chipman HA (2006). Darwinian evolution in parallel universes: A parallel genetic algorithm for variable selection, Technometrics, 48, 491-502. https://doi.org/10.1198/004017006000000093
  17. Zhu M and Fan G (2011). Variable selection by ensembles for the Cox model, Journal of Statistical Computation and Simulation, 81,1983-1992.
  18. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320.  https://doi.org/10.1111/j.1467-9868.2005.00503.x