Ensemble variable selection using genetic algorithm

Seogyoung, Lee;Martin Seunghwan, Yang;Jongkyeong, Kang;Seung Jun, Shin;

doi:10.29220/CSAM.2022.29.6.629

Communications for Statistical Applications and Methods

Volume 29 Issue 6
/
Pages.629-640
/
2022
/
2287-7843(pISSN)
/
2383-4757(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Ensemble variable selection using genetic algorithm

Seogyoung, Lee (Department of Statistics, Korea University) ;
Martin Seunghwan, Yang (Department of Statistics, Korea University) ;
Jongkyeong, Kang (Department of Information Statistics, Kangwon National University) ;
Seung Jun, Shin (Department of Statistics, Korea University)

Received : 2022.03.08
Accepted : 2022.04.29
Published : 2022.11.30

https://doi.org/10.29220/CSAM.2022.29.6.629 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Variable selection is one of the most crucial tasks in supervised learning, such as regression and classification. The best subset selection is straightforward and optimal but not practically applicable unless the number of predictors is small. In this article, we propose directly solving the best subset selection via the genetic algorithm (GA), a popular stochastic optimization algorithm based on the principle of Darwinian evolution. To further improve the variable selection performance, we propose to run multiple GA to solve the best subset selection and then synthesize the results, which we call ensemble GA (EGA). The EGA significantly improves variable selection performance. In addition, the proposed method is essentially the best subset selection and hence applicable to a variety of models with different selection criteria. We compare the proposed EGA to existing variable selection methods under various models, including linear regression, Poisson regression, and Cox regression for survival data. Both simulation and real data analysis demonstrate the promising performance of the proposed method.

Keywords

Acknowledgement

This work is funded by the National Research Foundation of Korea (NRF) grants (2018R1D1A1B07043034, 2019R1A4A1028134) and Korea University (K2000461).

References

David B, Royston G, Alun J, Jem JR, and Douglas BK (1997). Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry, Analytica Chimica Acta, 348, 71-86. https://doi.org/10.1016/S0003-2670(97)00065-2
Partha D and Pravin KT (1997). Demand for medical care by the elderly: A finite mixture approach, Journal of applied Econometrics, 12, 313-336. https://doi.org/10.1002/(SICI)1099-1255(199705)12:3<313::AID-JAE440>3.0.CO;2-G
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
Leardi R and Gonzalez AL (1998). Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemometrics and Intelligent Laboratory Systems, 41, 195-207. https://doi.org/10.1016/S0169-7439(98)00051-3
Meinshausen N and Buhlmann P (2010). Stability selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 417-473. https://doi.org/10.1111/j.1467-9868.2010.00740.x
Niazi A and Leardi R (2012). Genetic algorithms in chemometrics, Journal of Chemometrics, 26, 345-351. https://doi.org/10.1002/cem.2426
Tibshirani R (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B-methodological, 58, 267-288.
Tibshirani R (1997). The lasso method for variable selection in the Cox model, Statistics in Medicine, 16, 385-395. https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
Volinsky C and Raftery A(2000). Bayesian information criterion for censored survival models, Biometrics, 56, 256-262. https://doi.org/10.1111/j.0006-341X.2000.00256.x
Wang S, Nan B, Rosset S, and Zhu J (2011). Random lasso, The Annals of Applied Statistics, 5, 468.
Xin L and Zhu M (2012). Stochastic stepwise ensembles for variable selection, Journal of Computational and Graphical Statistics, 21, 275-294. https://doi.org/10.1080/10618600.2012.679223
Yeh IC (2007). Modeling slump flow of concrete using second-order regressions and artificial neural networks, Cement and Concrete Composites, 29, 474-480, Available from: https://doi.org/10.10-16/j.cemconcomp.2007.02.001 https://doi.org/10.10-16/j.cemconcomp.2007.02.001
Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty, The Annals of Statistics, 38, 894-942.
Zhang CX, Zhang JS, and Kim SW (2016). PBoostGA: Pseudo-boosting genetic algorithm for variable ranking and selection, Computational Statistics, 31, 1237-1262. https://doi.org/10.1007/s00180-016-0652-8
Zhu M and Chipman HA (2006). Darwinian evolution in parallel universes: A parallel genetic algorithm for variable selection, Technometrics, 48, 491-502. https://doi.org/10.1198/004017006000000093
Zhu M and Fan G (2011). Variable selection by ensembles for the Cox model, Journal of Statistical Computation and Simulation, 81,1983-1992.
Zou H and Hastie T (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

Communications for Statistical Applications and Methods

Ensemble variable selection using genetic algorithm

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)