DOI QR코드

DOI QR Code

Effect of outliers on the variable selection by the regularized regression

  • Jeong, Junho (Department of Statistics, Pusan National University) ;
  • Kim, Choongrak (Department of Statistics, Pusan National University)
  • Received : 2017.12.20
  • Accepted : 2018.03.09
  • Published : 2018.03.31

Abstract

Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the "large n, small p" setup; however, diagnostic issues in the regression models have been rarely studied in a high dimensional setup. In the high dimensional data, the influence of observations is more serious because the sample size n is significantly less than the number variables p. Here, we investigate the influence of observations on the least absolute shrinkage and selection operator (LASSO) estimates, suggested by Tibshirani (Journal of the Royal Statistical Society, Series B, 73, 273-282, 1996), and the influence of observations on selected variables by the LASSO in the high dimensional setup. We also derived an analytic expression for the influence of the k observation on LASSO estimates in simple linear regression. Numerical studies based on artificial data and real data are done for illustration. Numerical results showed that the influence of observations on the LASSO estimates and the selected variables by the LASSO in the high dimensional setup is more severe than that in the usual "large n, small p" setup.

Keywords

References

  1. BaeW, Noh S, and Kim C (2017). Case influence diagnostics for the significance of the linear regres-sion model, Communications for Statistical Applications and Methods, 24, 155-162.
  2. Box GEP and Cox DR (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B, 26, 211-252.
  3. Cook RD (1977). Detection of influential observation in linear regression, Technometrics, 19, 15-18.
  4. Hoerl AE and Kennard RW (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
  5. Jang DH and Anderson-Cook CM (2017). Influence plots for LASSO, Quality and Reliability in Engineering International, 33, 1317-1326. https://doi.org/10.1002/qre.2106
  6. Kim C, Lee J, Yang H, and Bae W (2015). Case influence diagnostics in the lasso regression. Journal of the Korean Statistical Society, 44, 271-279. https://doi.org/10.1016/j.jkss.2014.09.003
  7. Kim J and Lee S (2017). A convenient approach for penalty parameter selection in robust lasso regression, Communications for Statistical Applications and Methods, 24, 651-662. https://doi.org/10.29220/CSAM.2017.24.6.651
  8. Lu T, Pan Y, Kao SY, Kohane I, and Chan J (2004). Gene regulation and DNA damage in the ageing human brain, Nature, 429, 883-891. https://doi.org/10.1038/nature02661
  9. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.
  10. Zhao J, Leng C, Li L, and Wang H (2013). High-dimensional influence measure, The Annals of Statistics, 41, 2639-2667. https://doi.org/10.1214/13-AOS1165