A comparison of imputation methods using machine learning models

Heajung Suh;Jongwoo Song;

doi:10.29220/CSAM.2023.30.3.331

Communications for Statistical Applications and Methods

제30권3호
/
Pages.331-341
/
2023
/
2287-7843(pISSN)
/
2383-4757(eISSN)

한국통계학회 (The Korean Statistical Society)

DOI QR Code

A comparison of imputation methods using machine learning models

Heajung Suh (Department of Statistics, Ewha Womans University) ;
Jongwoo Song (Department of Statistics, Ewha Womans University)

투고 : 2022.10.31
심사 : 2023.01.17
발행 : 2023.05.31

https://doi.org/10.29220/CSAM.2023.30.3.331 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

키워드

과제정보

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MIST) (No.2020S1A5C2A04092451).

참고문헌

Azur MJ, Stuart EA, Frangakis C, and Leaf PJ (2011). Multiple imputation by chained equations: What is it and how does it work?, International Journal of Methods in Psychiatric Research, 20, 40-49, Available from: https://doi:10.1002/mpr.329
Berglund P and Heeringa SG (2014). Multiple imputation of missing data using SAS. Cary, N.C: SAS Institute
Deng Y and Lumley T (2021). Multiple imputation through xgboost, Available from: arXiv:2106.01574
Gower JC (1971). A general coefficient of similarity and some of its properties, Biometrics, 27, 857-871, Available from: https://doi.org/10.2307/2528823
Graham JW (2009). Missing data analysis: Making it work in the real world, Annual Review of Psychology, 60, 549-576, Available from: https://doi:10.1146/annurev.psych.58.110405.085530
Little RJA and Rubin DB (1987). Statistical Analysis with Missing Data, John Wiley and Sons, New York.
Little RJA and Rubin DB (2002). Statistical Analysis with Missing Data, Wiley Hoboken, NewJersy, Available from: https://doi:10.1002/9781119013563
Raghunathan TE, Lepkowski JM, Hoewyk JV, and Solenberger P (2000). A multivariate technique for multiply imputing missing values using a sequence of regression models, Survey Methodology, 27, 85-95.
Schafer JL (1999). Multiple imputation: A primer, Statistical Methods in Medical Research, 8, 3-15, Available from: http://doi:10.1177/096228029900800102
Stekhoven DJ and Buhlmann P (2012). missForest-non-Parametric missing value imputation for mixed-type data, Bioinformatics, 28, 112-118, Available from: https://doi:10.1093/bioinformatics/btr597
Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, and Carpenter JR (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls, BMJ (Clinical research ed.), 338, b2393, Available from: https://doi:10.1136/bmj.b2393
Van Buuren S (2007). Multiple imputation of discrete and continuous data by fully conditional specification, Statistical Methods in Medical Research, 16, 219-242, Available from: https://doi:10.1177/0962280206074463
van Buuren S and Groothuis-Oudshoorn CGM (2011). Mice: Multivariate imputation by chained equations in R, Journal of Statistical Software, 45, Available from: https://doi:10.18637/jss.v045.i03
Zhang S (2012). Nearest neighbor selection for iteratively kNN imputation, Journal of Systems and Software, 85, 2541-2552, Available from: https://doi: 10.1016/j.jss.2012.05.073

Communications for Statistical Applications and Methods

A comparison of imputation methods using machine learning models

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)