DOI QR코드

DOI QR Code

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim (Statistics Discipline, Division of Science and Mathematics, University of Minnesota at Morris) ;
  • Kee-Jae Lee (Department of Information Statistics, Korea National Open University) ;
  • Seung-Joo Lee (Department of Data Science, Cheongju University)
  • Received : 2023.08.26
  • Accepted : 2023.09.18
  • Published : 2023.11.30

Abstract

Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

Keywords

Acknowledgement

We thank the two respected referees, Associated Editor and Editor for constructive and helpful suggestions which led to substantial improvement in the revised version.

References

  1. Hosmer DW and Lemeshow S (2020). Applied Logistic Regression, John Wiley & Sons, Inc Hoboken, New Jersey.
  2. Kalton G and Kasprzyk D (1982). Imputation for missing surveys responses, Proceedings of the Survey Research Methods Section, American Statistical Association, 22-31.
  3. Kalton G and Kish L (1984). Some efficient random imputation methods, Communications in Statistics-Theory and Methods, 13, 1919-1939. https://doi.org/10.1080/03610928408828805
  4. Kim JK and Shao J (2013). Statistical Methods for Handling Incomplete Data, CRCpress, Boca Raton, FL.
  5. Kursa MB and Rudnicki WR (2010). Feature Selection with the Boruta Package, Journal of Statistical Software, 36, 1-13. https://doi.org/10.18637/jss.v036.i11
  6. McCullagh P and Nelder JA (1989). Generalized Linear Models (2nd ed), Chapman & Hall, New York.
  7. Rubin RB (1976). Inference and missing data, Biometrika, 63, 581-592. https://doi.org/10.1093/biomet/63.3.581
  8. Sande IG (1979). A personal view of hot-deck imputation procedures, Survey Methodology Statistics Canada, 5, 238-247.
  9. Singh S and Horn S (2000). Compromised imputation in survey sampling, Metrika, 51, 267-276. https://doi.org/10.1007/s001840000054
  10. Singh S and Deo B (2003). Imputation by power transformation, Statistical Papers, 44, 555-579. https://doi.org/10.1007/BF02926010
  11. Singh S (2009). A new method of imputation in survey sampling, Statistics, 43, 499-511. https://doi.org/10.1080/02331880802605114
  12. van Buuren S and Groothuis-Oudshoorn K (2011). Mice: Multivariate imputation by chained equations in R, Journal of Statistical Software, 45, 1-67. https://doi.org/10.18637/jss.v045.i03
  13. van Buuren S (2018). Flexible Imputation of Missing Data (2nd ed), Chapman & Hall/CRC, Boca Raton, FL.