DOI QR코드

DOI QR Code

A comparison of imputation methods using machine learning models

  • Heajung Suh (Department of Statistics, Ewha Womans University) ;
  • Jongwoo Song (Department of Statistics, Ewha Womans University)
  • 투고 : 2022.10.31
  • 심사 : 2023.01.17
  • 발행 : 2023.05.31

초록

Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

키워드

과제정보

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MIST) (No.2020S1A5C2A04092451).

참고문헌

  1. Azur MJ, Stuart EA, Frangakis C, and Leaf PJ (2011). Multiple imputation by chained equations: What is it and how does it work?, International Journal of Methods in Psychiatric Research, 20, 40-49, Available from: https://doi:10.1002/mpr.329
  2. Berglund P and Heeringa SG (2014). Multiple imputation of missing data using SAS. Cary, N.C: SAS Institute
  3. Deng Y and Lumley T (2021). Multiple imputation through xgboost, Available from: arXiv:2106.01574
  4. Gower JC (1971). A general coefficient of similarity and some of its properties, Biometrics, 27, 857-871, Available from: https://doi.org/10.2307/2528823
  5. Graham JW (2009). Missing data analysis: Making it work in the real world, Annual Review of Psychology, 60, 549-576, Available from: https://doi:10.1146/annurev.psych.58.110405.085530
  6. Little RJA and Rubin DB (1987). Statistical Analysis with Missing Data, John Wiley and Sons, New York.
  7. Little RJA and Rubin DB (2002). Statistical Analysis with Missing Data, Wiley Hoboken, NewJersy, Available from: https://doi:10.1002/9781119013563
  8. Raghunathan TE, Lepkowski JM, Hoewyk JV, and Solenberger P (2000). A multivariate technique for multiply imputing missing values using a sequence of regression models, Survey Methodology, 27, 85-95.
  9. Schafer JL (1999). Multiple imputation: A primer, Statistical Methods in Medical Research, 8, 3-15, Available from: http://doi:10.1177/096228029900800102
  10. Stekhoven DJ and Buhlmann P (2012). missForest-non-Parametric missing value imputation for mixed-type data, Bioinformatics, 28, 112-118, Available from: https://doi:10.1093/bioinformatics/btr597
  11. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, and Carpenter JR (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls, BMJ (Clinical research ed.), 338, b2393, Available from: https://doi:10.1136/bmj.b2393
  12. Van Buuren S (2007). Multiple imputation of discrete and continuous data by fully conditional specification, Statistical Methods in Medical Research, 16, 219-242, Available from: https://doi:10.1177/0962280206074463
  13. van Buuren S and Groothuis-Oudshoorn CGM (2011). Mice: Multivariate imputation by chained equations in R, Journal of Statistical Software, 45, Available from: https://doi:10.18637/jss.v045.i03
  14. Zhang S (2012). Nearest neighbor selection for iteratively kNN imputation, Journal of Systems and Software, 85, 2541-2552, Available from: https://doi: 10.1016/j.jss.2012.05.073