Development of a Machine Learning Model for Imputing Time Series Data with Massive Missing Values

Bangwon Ko;Yong Hee Han;

doi:10.17661/jkiiect.2024.17.3.176

The Journal of Korea Institute of Information, Electronics, and Communication Technology (한국정보전자통신기술학회논문지)

Volume 17 Issue 3
/
Pages.176-182
/
2024
/
2005-081X(pISSN)
/
2288-9302(eISSN)

Korea Information Electronic Communication Technology (한국정보전자통신기술학회)

DOI QR Code

Development of a Machine Learning Model for Imputing Time Series Data with Massive Missing Values

결측치 비율이 높은 시계열 데이터 분석 및 예측을 위한 머신러닝 모델 구축

Bangwon Ko (Department of Statistics and Actuarial Science, Soongsil University) ;
Yong Hee Han (Department of Entrepreneurship and Small Business, Soongsil University)

고방원 ;
한용희

Received : 2024.06.10
Accepted : 2024.06.24
Published : 2024.06.29

https://doi.org/10.17661/jkiiect.2024.17.3.176 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In this study, we compared and analyzed various methods of missing data handling to build a machine learning model that can effectively analyze and predict time series data with a high percentage of missing values. For this purpose, Predictive State Model Filtering (PSMF), MissForest, and Imputation By Feature Importance (IBFI) methods were applied, and their prediction performance was evaluated using LightGBM, XGBoost, and Explainable Boosting Machines (EBM) machine learning models. The results of the study showed that MissForest and IBFI performed the best among the methods for handling missing values, reflecting the nonlinear data patterns, and that XGBoost and EBM models performed better than LightGBM. This study emphasizes the importance of combining nonlinear imputation methods and machine learning models in the analysis and prediction of time series data with a high percentage of missing values, and provides a practical methodology.

본 연구는 결측치 비율이 높은 시계열 데이터를 효과적으로 분석하고 예측할 수 있는 머신러닝 모델을 구축하기 위해 다양한 결측치 처리 방법을 비교 분석하였다. 이를 위해 PSMF(Predictive State Model Filtering), MissForest, IBFI(Imputation By Feature Importance) 방법을 적용하였으며, 이후 LightGBM, XGBoost, EBM(Explainable Boosting Machines) 머신러닝 모델을 사용하여 예측 성능을 평가하였다. 연구 결과, 결측치 처리 방법 중에서는 MissForest와 IBFI가 비선형적 데이터 패턴을 잘 반영하여 가장 높은 성능을 나타냈으며, 머신러닝 모델 중에서는 XGBoost와 EBM 모델이 LightGBM 모델보다 더 높은 성능을 보였다. 본 연구는 결측치 비율이 높은 시계열 데이터의 분석 및 예측에 있어 비선형적 결측치 처리 방법과 머신러닝 모델의 조합이 중요함을 강조하며, 실무적으로 유용한 방법론을 제시하였다.

Keywords

Acknowledgement

This work was supported by the Soongsil University Research Fund(Convergence Research) of 2020.

References

Little, Roderick. J., Donald B. Rubin. "Statistical analysis with missing data." 793). 793, John Wiley & Sons, 2019.
Van Buuren, Stef, Karin Groothuis-Oudshoo rn. "mice: Multivariate imputation by chaine d equations in R." Journal of Statistical Soft ware, 45, 1-67, 2011.
Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "LightGBM: A highly efficient gradient boosting decision tree." Advances in Neural Information Processing Systems, 3146 -3154, 2017.
Chen, Tiangi, Carlos Guestrin. "XGBoost: A scalable tree boosting system." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794, 2016.
Lou, Yin, Rich Caruana, Johannes Gehrke. "Intelligible models for classification and regression." Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 150-158, 2012.
Akyildiz, Omer Deniz, Gerrit van den Burg, T heodoros Damoulas, Mark Steel. "Probabilistic sequential matrix factorization." arXiv preprint, arXiv:1910.03906, 2019.
Stekhoven, Daniel J., Peter Buhlmann. "MissForest: non-parametric missing value imputation for mixed-type data." Bioinformatics, 28-1, 112-118, 2012.
Mir, Adil Aslam, Kimberlee Jane Kearfott, Fatih Vehbi Celebi, Muhammad Rafique. "Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data." PloS one, 17-1, e0262131, 2022.
Air Korea, https://www.airkorea.or.kr/web/last_amb_hour_data?pMENU_NO=123
PhysioNet, https://archive.physionet.org/mimic2

The Journal of Korea Institute of Information, Electronics, and Communication Technology (한국정보전자통신기술학회논문지)

Development of a Machine Learning Model for Imputing Time Series Data with Massive Missing Values

결측치 비율이 높은 시계열 데이터 분석 및 예측을 위한 머신러닝 모델 구축

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)