Variable selection and prediction performance of penalized two-part regression with community-based crime data application

Seong-Tae Kim;Man Sik Park;

doi:10.29220/CSAM.2024.31.4.441

Communications for Statistical Applications and Methods

제31권4호
/
Pages.441-457
/
2024
/
2287-7843(pISSN)
/
2383-4757(eISSN)

한국통계학회 (The Korean Statistical Society)

DOI QR Code

Variable selection and prediction performance of penalized two-part regression with community-based crime data application

Seong-Tae Kim (Department of Mathematics & Statistics, NC A&T State University) ;
Man Sik Park (Department of Statistics, Sungshin Women's University)

투고 : 2024.01.07
심사 : 2024.05.14
발행 : 2024.07.31

https://doi.org/10.29220/CSAM.2024.31.4.441 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Semicontinuous data are characterized by a mixture of a point probability mass at zero and a continuous distribution of positive values. This type of data is often modeled using a two-part model where the first part models the probability of dichotomous outcomes -zero or positive- and the second part models the distribution of positive values. Despite the two-part model's popularity, variable selection in this model has not been fully addressed, especially, in high dimensional data. The objective of this study is to investigate variable selection and prediction performance of penalized regression methods in two-part models. The performance of the selected techniques in the two-part model is evaluated via simulation studies. Our findings show that LASSO and ENET tend to select more predictors in the model than SCAD and MCP. Consequently, MCP and SCAD outperform LASSO and ENET for β-specificity, and LASSO and ENET perform better than MCP and SCAD with respect to the mean squared error. We find similar results when applying the penalized regression methods to the prediction of crime incidents using community-based data.

키워드

과제정보

Kim is partially supported by NSF Grants 1719498 and 2100729.

참고문헌

Breheny P (2013). ncvreg: Regularization paths for scad-and mcp-penalized regression models, R package version, 2.6-0, Available from: https://pbreheny.github.io/ncvreg/
Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression with applications to biological feature selection, The Annals of Applied Statistics, 5, 232-253. https://doi.org/10.1214/10-AOAS388
Brown EC, Catalano RF, Fleming CB, Haggerty KP, and Abbott RD (2005). Adolescent substance use outcomes in the raising healthy children project: A two-part latent growth curve analysis, Journal of Consulting and Clinical Psychology, 73, 699-710. https://doi.org/10.1037/0022-006X.73.4.699
Candes E and Tao T (2007). The Dantzig selector: Statistical estimation when p is much larger than n, The Annals of Statistics, 35, 2313-2351. https://doi.org/10.1214/009053606000001523
Cragg JG (1971). Some statistical models for limited dependent variables with application to the demand for durable goods, Econometrica: Journal of the Econometric Society, 39, 829-844. https://doi.org/10.2307/1909582
Duan N, Manning WG, Morris CN, and Newhouse JPA (1983). Comparison of alternative models for the demand for medical care, Journal of Business and Economic Statistics, 1, 115-126. https://doi.org/10.1080/07350015.1983.10509330
Dunn PK and Smyth GK (2005). Series evaluation of Tweedie exponential dispersion model densities, Statistics and Computing, 15, 267-280. https://doi.org/10.1007/s11222-005-4070-y
Dziak JJ, Coffman DL, Lanza ST, and Li R (2020). Sensitivity and specificity of information criteria, Briefings in Bioinformatics, 21, 553-565. https://doi.org/10.1093/bib/bbz016
Efron B, Hastie T, Johnstone I, and Tibshirani R (2004). Least angle regression, The Annals of Statistics, 32, 407-451. https://doi.org/10.1214/009053604000000067
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Frees EW, Jin X, and Lin X (2013). Actuarial applications of multivariate two-part regression models, Annals of Actuarial Science 7, 258-287. https://doi.org/10.1017/S1748499512000346
Friedman J, Hastie T, and Tibshirani R (2009). glmnet: Lasso and elastic-net regularized generalized linear models, R package version, 1.0, Available from: https://cran.r-project.org/web/packages/glmnet
Hao N, Feng Y, and Zhang HH (2018). Model selection for high-dimensional quadratic regression via regularization, Journal of the American Statistical Association, 113, 615-625. https://doi.org/10.1080/01621459.2016.1264956
Kang HW and Kang HB (2017). Prediction of crime occurrence from multi-modal data using deep learning, PloS One 12, e0176244.
Kokonendji CC, Bonat WH, and Abid R (2021). Tweedie regression models and its geometric sums for (semi-) continuous data, Wiley Interdisciplinary Reviews: Computational Statistics, 13, e1496.
Liu L (2009). Joint modeling longitudinal semi-continuous data and survival with application to longitudinal medical cost data, Statistics in Medicine, 28, 972-986. https://doi.org/10.1002/sim.3497
Merlo L, Maruotti A, and Petrella L (2022). Two-part quantile regression models for semi-continuous longitudinal data: A finite mixture approach, Statistical Modelling, 22, 485-508. https://doi.org/10.1177/1471082X21993603
Min Y and Agresti A (2002). Modeling nonnegative data with clumping at zero: A survey, Journal of the Iranian Statistical Society, 1, 7-33.
Mullahy J (1998). Much ado about two: Reconsidering retransformation and the two-part model in health econometrics, Journal of Health Economics, 17, 247-281. Notice: Data not available: U.S. Bureau of Labor Statistics (n.d.). https://doi.org/10.1016/S0167-6296(98)00030-7
Neelon B, O'Malley AJ, and Smith VA (2016). Modeling zero-modified count and semicontinuous data in health services research Part 1: Background and overview, Statistics in Medicine, 35, 5070-5093. https://doi.org/10.1002/sim.7050
Ng S (2013). Variable selection in predictive regressions, In Handbook of Economic Forecasting; Elliott G and Timmermann A, Eds, Elsvier, 752-789.
Olsen MK and Schafer JL (2001). A two-part random-effects model for semicontinuous longitudinal data, Journal of the American Statistical Association, 96, 730-745. https://doi.org/10.1198/016214501753168389
Pan W, Wang X, Xiao W, and Zhu H (2019). A generic sure independence screening procedure, Journal of the American Statistical Association, 114, 928-937. https://doi.org/10.1080/01621459.2018.1462709
Redmond MA and Baveja A (2002). A data-driven software tool for enabling cooperative information sharing among police departments, European Journal of Operational Research, 141, 660-678. https://doi.org/10.1016/S0377-2217(01)00264-8
Smith VA, Preisser JS, Neelon B, and Maciejewski ML (2014). A marginalized two-part model for semicontinuous data, Statistics in Medicine, 33, 4891-4903. https://doi.org/10.1002/sim.6263
Tang Y, Xiang L, and Zhu Z (2014). Risk factor selection in rate making: EM adaptive LASSO for zero-inflated poisson regression models, Risk Analysis, 34, 1112-1127. https://doi.org/10.1111/risa.12162
Tibshirani R (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, and Tibshirani RJ (2012). Strong rules for discarding predictors in lasso-type problems, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 245-266. https://doi.org/10.1111/j.1467-9868.2011.01004.x
Tu W and Zhou XHA (1999). Wald test comparing medical costs based on log-normal distributions with zero valued costs, Statistics in Medicine, 18, 2749-2761. https://doi.org/10.1002/(SICI)1097-0258(19991030)18:20<2749::AID-SIM195>3.0.CO;2-C
Tweedie MCK (1984). An index which distinguishes between some important exponential families, Statistics: Applications and New Directions, In Ghosh JK and Roy J (Eds), Indian Statistical Institute, Calcutta, 579-604.
Wu TT and Lange K (2008). Coordinate descent algorithms for lasso penalized regression, The Annals of Applied Statistics, 2, 224-244. https://doi.org/10.1214/07-AOAS147
Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty, The Annals of statistics, 38, 894-942. https://doi.org/10.1214/09-AOS729
Zhao T, Luo X, Chu H, Le CT, Epstein LH, and Thomas JL (2016). A two-part mixed effects model for cigarette purchase task data, Journal of the Experimental Analysis of Behavior, 106, 242-253. https://doi.org/10.1002/jeab.228
Zou B, Mi X, Xenakis J, Wu D, Hu J, and Zou F (2023). A deep neural network two-part model and feature importance test for semi-continuous data, bioRxiv, 2023-06, Available from: https://doi.org/10.11 01/2023.06.07.544106 https://doi.org/10.1101/2023.06.07.544106
Zou H (2006). The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101, 1418-1429. https://doi.org/10.1198/016214506000000735
Zou H and Hastie T (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models, The Annals of Statistics, 36, 1509-1533. https://doi.org/10.1214/009053607000000802

Communications for Statistical Applications and Methods

Variable selection and prediction performance of penalized two-part regression with community-based crime data application

초록

키워드

과제정보

참고문헌

자세히 찾기