DOI QR코드

DOI QR Code

Variable selection and prediction performance of penalized two-part regression with community-based crime data application

  • Seong-Tae Kim (Department of Mathematics & Statistics, NC A&T State University) ;
  • Man Sik Park (Department of Statistics, Sungshin Women's University)
  • Received : 2024.01.07
  • Accepted : 2024.05.14
  • Published : 2024.07.31

Abstract

Semicontinuous data are characterized by a mixture of a point probability mass at zero and a continuous distribution of positive values. This type of data is often modeled using a two-part model where the first part models the probability of dichotomous outcomes -zero or positive- and the second part models the distribution of positive values. Despite the two-part model's popularity, variable selection in this model has not been fully addressed, especially, in high dimensional data. The objective of this study is to investigate variable selection and prediction performance of penalized regression methods in two-part models. The performance of the selected techniques in the two-part model is evaluated via simulation studies. Our findings show that LASSO and ENET tend to select more predictors in the model than SCAD and MCP. Consequently, MCP and SCAD outperform LASSO and ENET for β-specificity, and LASSO and ENET perform better than MCP and SCAD with respect to the mean squared error. We find similar results when applying the penalized regression methods to the prediction of crime incidents using community-based data.

Keywords

Acknowledgement

Kim is partially supported by NSF Grants 1719498 and 2100729.

References

  1. Breheny P (2013). ncvreg: Regularization paths for scad-and mcp-penalized regression models, R package version, 2.6-0, Available from: https://pbreheny.github.io/ncvreg/
  2. Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression with applications to biological feature selection, The Annals of Applied Statistics, 5, 232-253.
  3. Brown EC, Catalano RF, Fleming CB, Haggerty KP, and Abbott RD (2005). Adolescent substance use outcomes in the raising healthy children project: A two-part latent growth curve analysis, Journal of Consulting and Clinical Psychology, 73, 699-710. https://doi.org/10.1037/0022-006X.73.4.699
  4. Candes E and Tao T (2007). The Dantzig selector: Statistical estimation when p is much larger than n, The Annals of Statistics, 35, 2313-2351.
  5. Cragg JG (1971). Some statistical models for limited dependent variables with application to the demand for durable goods, Econometrica: Journal of the Econometric Society, 39, 829-844. https://doi.org/10.2307/1909582
  6. Duan N, Manning WG, Morris CN, and Newhouse JPA (1983). Comparison of alternative models for the demand for medical care, Journal of Business and Economic Statistics, 1, 115-126. https://doi.org/10.1080/07350015.1983.10509330
  7. Dunn PK and Smyth GK (2005). Series evaluation of Tweedie exponential dispersion model densities, Statistics and Computing, 15, 267-280. https://doi.org/10.1007/s11222-005-4070-y
  8. Dziak JJ, Coffman DL, Lanza ST, and Li R (2020). Sensitivity and specificity of information criteria, Briefings in Bioinformatics, 21, 553-565. https://doi.org/10.1093/bib/bbz016
  9. Efron B, Hastie T, Johnstone I, and Tibshirani R (2004). Least angle regression, The Annals of Statistics, 32, 407-451.
  10. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
  11. Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  12. Frees EW, Jin X, and Lin X (2013). Actuarial applications of multivariate two-part regression models, Annals of Actuarial Science 7, 258-287. https://doi.org/10.1017/S1748499512000346
  13. Friedman J, Hastie T, and Tibshirani R (2009). glmnet: Lasso and elastic-net regularized generalized linear models, R package version, 1.0, Available from: https://cran.r-project.org/web/packages/glmnet
  14. Hao N, Feng Y, and Zhang HH (2018). Model selection for high-dimensional quadratic regression via regularization, Journal of the American Statistical Association, 113, 615-625. https://doi.org/10.1080/01621459.2016.1264956
  15. Kang HW and Kang HB (2017). Prediction of crime occurrence from multi-modal data using deep learning, PloS One 12, e0176244.
  16. Kokonendji CC, Bonat WH, and Abid R (2021). Tweedie regression models and its geometric sums for (semi-) continuous data, Wiley Interdisciplinary Reviews: Computational Statistics, 13, e1496.
  17. Liu L (2009). Joint modeling longitudinal semi-continuous data and survival with application to longitudinal medical cost data, Statistics in Medicine, 28, 972-986. https://doi.org/10.1002/sim.3497
  18. Merlo L, Maruotti A, and Petrella L (2022). Two-part quantile regression models for semi-continuous longitudinal data: A finite mixture approach, Statistical Modelling, 22, 485-508. https://doi.org/10.1177/1471082X21993603
  19. Min Y and Agresti A (2002). Modeling nonnegative data with clumping at zero: A survey, Journal of the Iranian Statistical Society, 1, 7-33.
  20. Mullahy J (1998). Much ado about two: Reconsidering retransformation and the two-part model in health econometrics, Journal of Health Economics, 17, 247-281. Notice: Data not available: U.S. Bureau of Labor Statistics (n.d.). https://doi.org/10.1016/S0167-6296(98)00030-7
  21. Neelon B, O'Malley AJ, and Smith VA (2016). Modeling zero-modified count and semicontinuous data in health services research Part 1: Background and overview, Statistics in Medicine, 35, 5070-5093. https://doi.org/10.1002/sim.7050
  22. Ng S (2013). Variable selection in predictive regressions, In Handbook of Economic Forecasting; Elliott G and Timmermann A, Eds, Elsvier, 752-789.
  23. Olsen MK and Schafer JL (2001). A two-part random-effects model for semicontinuous longitudinal data, Journal of the American Statistical Association, 96, 730-745. https://doi.org/10.1198/016214501753168389
  24. Pan W, Wang X, Xiao W, and Zhu H (2019). A generic sure independence screening procedure, Journal of the American Statistical Association, 114, 928-937. https://doi.org/10.1080/01621459.2018.1462709
  25. Redmond MA and Baveja A (2002). A data-driven software tool for enabling cooperative information sharing among police departments, European Journal of Operational Research, 141, 660-678. https://doi.org/10.1016/S0377-2217(01)00264-8
  26. Smith VA, Preisser JS, Neelon B, and Maciejewski ML (2014). A marginalized two-part model for semicontinuous data, Statistics in Medicine, 33, 4891-4903. https://doi.org/10.1002/sim.6263
  27. Tang Y, Xiang L, and Zhu Z (2014). Risk factor selection in rate making: EM adaptive LASSO for zero-inflated poisson regression models, Risk Analysis, 34, 1112-1127. https://doi.org/10.1111/risa.12162
  28. Tibshirani R (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  29. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, and Tibshirani RJ (2012). Strong rules for discarding predictors in lasso-type problems, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 245-266. https://doi.org/10.1111/j.1467-9868.2011.01004.x
  30. Tu W and Zhou XHA (1999). Wald test comparing medical costs based on log-normal distributions with zero valued costs, Statistics in Medicine, 18, 2749-2761. https://doi.org/10.1002/(SICI)1097-0258(19991030)18:20<2749::AID-SIM195>3.0.CO;2-C
  31. Tweedie MCK (1984). An index which distinguishes between some important exponential families, Statistics: Applications and New Directions, In Ghosh JK and Roy J (Eds), Indian Statistical Institute, Calcutta, 579-604.
  32. Wu TT and Lange K (2008). Coordinate descent algorithms for lasso penalized regression, The Annals of Applied Statistics, 2, 224-244.
  33. Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
  34. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty, The Annals of statistics, 38, 894-942.
  35. Zhao T, Luo X, Chu H, Le CT, Epstein LH, and Thomas JL (2016). A two-part mixed effects model for cigarette purchase task data, Journal of the Experimental Analysis of Behavior, 106, 242-253. https://doi.org/10.1002/jeab.228
  36. Zou B, Mi X, Xenakis J, Wu D, Hu J, and Zou F (2023). A deep neural network two-part model and feature importance test for semi-continuous data, bioRxiv, 2023-06, Available from: https://doi.org/10.11 01/2023.06.07.544106 https://doi.org/10.1101/2023.06.07.544106
  37. Zou H (2006). The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101, 1418-1429. https://doi.org/10.1198/016214506000000735
  38. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
  39. Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models, The Annals of Statistics, 36, 1509-1533.