DOI QR코드

DOI QR Code

Development of a Machine Learning Model for Imputing Time Series Data with Massive Missing Values

결측치 비율이 높은 시계열 데이터 분석 및 예측을 위한 머신러닝 모델 구축

  • Bangwon Ko (Department of Statistics and Actuarial Science, Soongsil University) ;
  • Yong Hee Han (Department of Entrepreneurship and Small Business, Soongsil University)
  • 고방원 ;
  • 한용희
  • Received : 2024.06.10
  • Accepted : 2024.06.24
  • Published : 2024.06.29

Abstract

In this study, we compared and analyzed various methods of missing data handling to build a machine learning model that can effectively analyze and predict time series data with a high percentage of missing values. For this purpose, Predictive State Model Filtering (PSMF), MissForest, and Imputation By Feature Importance (IBFI) methods were applied, and their prediction performance was evaluated using LightGBM, XGBoost, and Explainable Boosting Machines (EBM) machine learning models. The results of the study showed that MissForest and IBFI performed the best among the methods for handling missing values, reflecting the nonlinear data patterns, and that XGBoost and EBM models performed better than LightGBM. This study emphasizes the importance of combining nonlinear imputation methods and machine learning models in the analysis and prediction of time series data with a high percentage of missing values, and provides a practical methodology.

본 연구는 결측치 비율이 높은 시계열 데이터를 효과적으로 분석하고 예측할 수 있는 머신러닝 모델을 구축하기 위해 다양한 결측치 처리 방법을 비교 분석하였다. 이를 위해 PSMF(Predictive State Model Filtering), MissForest, IBFI(Imputation By Feature Importance) 방법을 적용하였으며, 이후 LightGBM, XGBoost, EBM(Explainable Boosting Machines) 머신러닝 모델을 사용하여 예측 성능을 평가하였다. 연구 결과, 결측치 처리 방법 중에서는 MissForest와 IBFI가 비선형적 데이터 패턴을 잘 반영하여 가장 높은 성능을 나타냈으며, 머신러닝 모델 중에서는 XGBoost와 EBM 모델이 LightGBM 모델보다 더 높은 성능을 보였다. 본 연구는 결측치 비율이 높은 시계열 데이터의 분석 및 예측에 있어 비선형적 결측치 처리 방법과 머신러닝 모델의 조합이 중요함을 강조하며, 실무적으로 유용한 방법론을 제시하였다.

Keywords

Acknowledgement

This work was supported by the Soongsil University Research Fund(Convergence Research) of 2020.

References

  1. Little, Roderick. J., Donald B. Rubin. "Statistical analysis with missing data." 793). 793, John Wiley & Sons, 2019. 
  2. Van Buuren, Stef, Karin Groothuis-Oudshoo rn. "mice: Multivariate imputation by chaine d equations in R." Journal of Statistical Soft ware, 45, 1-67, 2011. 
  3. Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "LightGBM: A highly efficient gradient boosting decision tree." Advances in Neural Information Processing Systems, 3146 -3154, 2017. 
  4. Chen, Tiangi, Carlos Guestrin. "XGBoost: A scalable tree boosting system." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794, 2016. 
  5. Lou, Yin, Rich Caruana, Johannes Gehrke. "Intelligible models for classification and regression." Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150-158, 2012. 
  6. Akyildiz, Omer Deniz, Gerrit van den Burg, T heodoros Damoulas, Mark Steel. "Probabilistic sequential matrix factorization." arXiv preprint, arXiv:1910.03906, 2019. 
  7. Stekhoven, Daniel J., Peter Buhlmann. "MissForest: non-parametric missing value imputation for mixed-type data." Bioinformatics, 28-1, 112-118, 2012. 
  8. Mir, Adil Aslam, Kimberlee Jane Kearfott, Fatih Vehbi Celebi, Muhammad Rafique. "Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data." PloS one, 17-1, e0262131, 2022. 
  9. Air Korea, https://www.airkorea.or.kr/web/last_amb_hour_data?pMENU_NO=123 
  10. PhysioNet, https://archive.physionet.org/mimic2