Browse > Article
http://dx.doi.org/10.29220/CSAM.2019.26.6.575

Unified methods for variable selection and outlier detection in a linear regression  

Seo, Han Son (Department of Applied Statistics, Konkuk University)
Publication Information
Communications for Statistical Applications and Methods / v.26, no.6, 2019 , pp. 575-582 More about this Journal
Abstract
The problem of selecting variables in the presence of outliers is considered. Variable selection and outlier detection are not separable problems because each observation affects the fitted regression equation differently and has a different influence on each variable. We suggest a simultaneous method for variable selection and outlier detection in a linear regression model. The suggested procedure uses a sequential method to detect outliers and uses all possible subset regressions for model selections. A simplified version of the procedure is also proposed to reduce the computational burden. The procedures are compared to other variable selection methods using real data sets known to contain outliers. Examples show that the proposed procedures are effective and superior to robust algorithms in selecting the best model.
Keywords
outliers; regression diagnostics; robustness; variable selections;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Adams JL (1991). A computer experiment to evaluate regression strategies. In Proceeding of American Statistical Association Section on Statistical Computing, 55-62.
2 Atkinson AC (1986). [Influential observations, high leverage points, and outliers in linear regression]: Comment: aspects of diagnostic regression analysis, Statistical Science, 1, 397-402.   DOI
3 Akaike H (1973). Information theory and an extension of the maximum likelihood principle, In B. N. Petrov & F. Csaki (Eds), Second International Symposium on Information theory, Budapest, Akademiai Kiado.
4 Blettner M and Sauerbrei W (1993). Influence of model-building strategies on the results of a case-control study, Statistics in Medicine, 12, 1325-1338.   DOI
5 Brownlee KA (1965). Statistical Theory and Methodology in Science and Engineering (2nd ed.), Wiley, New York.
6 Busemeyer JR and Wang Y (2000). Model comparisons and model selections based on generalization criterion methodology, Journal of Mathematical Psychology, 44, 171-189.   DOI
7 Chatterjee S and Hadi AS (1988). Impact of simultaneous omission of a variable and an observation on a linear regression equation, Computational Statistics & Data Analysis, 6, 129-144.   DOI
8 Dupuis DJ and Victoria-Feser MP (2011). Fast robust model selection in large datasets, Journal of the American Statistical Association, 106, 203-212.   DOI
9 Dupuis DJ and Victoria-Feser MP (2013), Robust VIF regression with application to variable selection in large data sets, Annals of Applied Statistics, 7, 319-341.   DOI
10 Efron B (1984). Comparing non-nested linear models, Journal of the American Statistical Association, 79, 791-803.   DOI
11 Godfrey LG (1998) Tests of non-nested regression models: Some results on small sample behaviour and the bootstrap, Journal of Econometrics, 84, 59-74.   DOI
12 Hadi AS and Simonoff JS (1993). Procedures for the identification of multiple outliers in linear models, Journal of the American Statistical Association, 88, 1264-1272.   DOI
13 McCann L and Welsch RE (2007). Robust variable selection using least angle regression and elemental set sampling, Computational Statistics & Data Analysis, 52, 249-257.   DOI
14 Hoeting J, Raftery AE, and Madigan D (1996). A method for simultaneous variable selection and outlier identification in linear regression, Computational Statistics & Data Analysis, 22, 251-270.   DOI
15 Kim S, Park SH, and Krzanowski WJ (2008). Simultaneous variable selection and outlier identification in linear regression using the mean-shift outlier model, Journal of Applied Statistics, 35, 283-291.   DOI
16 Kong D, Bondell HD, and Wu Y (2018). Fully efficient robust estimation, outlier detection and variable selection via penalized regression, Statistica Sinica, 28, 1031-1052.
17 Menjoge RS and Welsch RE (2010). A diagnostic method for simultaneous feature selection and outlier identification in linear regression, Computational Statistics & Data Analysis, 54, 3181-3193.   DOI
18 Ronchetti E, Christopher Field C, and Blanchard W (1997). Robust linear model selection by Cross-Validation, Journal of the American Statistical Association, 92, 439.
19 Ronchetti E and Staudte RG (1994). A robust version of Mallows's Cp, Journal of the American Statistical Association, 89, 550-559.   DOI
20 Rousseeuw PJ (1984). Least median of squares regression, Journal of the American Statistical Association, 79, 871-880.   DOI
21 Rousseeuw PJ and van Zomeren BC (1990). Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85, 633-639.   DOI
22 Royston P and van Thompson SG (1995). Comparing non-nested regression models, Journal of the American Statistical Association, 51, 114-127.
23 Vuong QH (1989). Likelihood ratio tests for model selection and non-nested hypotheses, Econometrica, 57, 307-333.   DOI
24 Schwarz JH (1978). Estimating the dimension of a model, The Annals of Statistics, 6, 461-464.   DOI
25 Sebert DM, Montgomery DC, and Rollier DA (1998). A clustering algorithm for identifying multiple outliers, Computational Statistics & Data Analysis, 27, 461-484.   DOI
26 StatLib (1996). Department of Statistics, Carnegie Mellon University, Data and Story Library, Datafile Name:SMSA, Reference: U.S. Department of Labor Statistics.
27 Wisnowski JW, Simpson JR, Montgomery DC, and Runger GC (2003). Resampling methods for variable selection in robust regression, Computational Statistics & Data Analysis, 43, 341-355.   DOI