DOI QR코드

DOI QR Code

Prediction of spatio-temporal AQI data

  • KyeongEun Kim (Department of Statistics, Seoul National University) ;
  • MiRu Ma (Department of Statistics, Sungkyunkwan University) ;
  • KyeongWon Lee (Department of Statistics, Seoul National University)
  • Received : 2022.07.21
  • Accepted : 2023.01.15
  • Published : 2023.03.31

Abstract

With the rapid growth of the economy and fossil fuel consumption, the concentration of air pollutants has increased significantly and the air pollution problem is no longer limited to small areas. We conduct statistical analysis with the actual data related to air quality that covers the entire of South Korea using R and Python. Some factors such as SO2, CO, O3, NO2, PM10, precipitation, wind speed, wind direction, vapor pressure, local pressure, sea level pressure, temperature, humidity, and others are used as covariates. The main goal of this paper is to predict air quality index (AQI) spatio-temporal data. The observations of spatio-temporal big datasets like AQI data are correlated both spatially and temporally, and computation of the prediction or forecasting with dependence structure is often infeasible. As such, the likelihood function based on the spatio-temporal model may be complicated and some special modelings are useful for statistically reliable predictions. In this paper, we propose several methods for this big spatio-temporal AQI data. First, random effects with spatio-temporal basis functions model, a classical statistical analysis, is proposed. Next, neural networks model, a deep learning method based on artificial neural networks, is applied. Finally, random forest model, a machine learning method that is closer to computational science, will be introduced. Then we compare the forecasting performance of each other in terms of predictive diagnostics. As a result of the analysis, all three methods predicted the normal level of PM2.5 well, but the performance seems to be poor at the extreme value.

Keywords

Acknowledgement

This research was supported by the Basic Research Program through the National Research Foundation of Korea (NRF) funded by the MSIT (NRF-2020R1A4A1018207).

References

  1. Bakar KS and Kokic P (2017). Bayesian Gaussian models for point referenced spatial and spatio-temporal data, Journal of Statistical Research, 51, 17-40.  https://doi.org/10.47302/jsr.2017510102
  2. Baran B (2019). Prediction of air quality index by extreme learning machines, In Proceedings of International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Turkey, 19079408, Available from: http: doi.org/10.1109/IDAP.2019.8875910 
  3. Herrera VM, Khoshgoftaar TM, Villanustre F, and Furht B (2019). Random forest implementation and optimization for big data analytics on LexisNexis's high performance computing cluster platform, Journal of Big Data, 6, 1-36.  https://doi.org/10.1186/s40537-018-0162-3
  4. Hengl T, Nussbaum M, Wright MN, Heuvelink GB, and Graler B (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables, PeerJ, 6, e5518, Available from: https://doi.org/10.7717/peerj.5518 
  5. Ioffe S and Szegedy C (2015). "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International conference on machine learning, pmlr, 2015. 
  6. Jiang W (2021). The data analysis of Shanghai Air Quality Index based on linear regression analysis, Journal of Physics: Conference Series, 1813, 012031, Available from: https://doi.org/10.1088/1742-6596/1813/1/012031 
  7. Johnson RA and Wichern DW (2013). Applied Multivariate Statistical Analysis, Pearson Educated Limited Harlow, England. 
  8. Leo B (2001). Random forests, Machine Learning, 45, 5-32.  https://doi.org/10.1023/A:1010933404324
  9. Loshchilov I and Hutter F (2016). SGRD: Stochastic gradient descent with warm restarts, Available from: arXiv preprint arXiv:1608.03983 
  10. Loshchilov I and Hutter F (2017). Decoupled weight decay regularization. arXiv preprint, Available from: arXiv:1711.05101 
  11. Nair V and Hinton GE (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), 807-814. 
  12. Paszke A, Gross S, Massa F et al. (2019). Pytorch: An imperative style S, high-performance deep learning library, Advances in Neural Information Processing Systems, 32, 8024-8035. 
  13. Powers DMW (2020). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation, International Journal of Machine Learning Technology, 2, 37-63, Available from: https://arxiv.org/abs/2010.16061  https://doi.org/10.16061
  14. Quinlan R (1986). Induction of decision trees, Machine Learning, 1, 81-106.  https://doi.org/10.1007/BF00116251
  15. Searle SR (2017). Matrix Algebra Useful for Statistics, Wiley Hoboken, New Jersey. 
  16. Simonyan K and Zisserman A (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition, Available from: https://arxiv.org/abs/1409.1556 
  17. Wang J, Li X, Jin L, Li J, Sun Q, and Wang H (2022). An air quality index prediction model based on CNN-ILSTM, Scientific Reports, 12, 8373, Available from: http://doi.org/ 10.1038/s41598-022-12355-6 
  18. Wikle CK, Zammit-Mangion A, and Cressie N (2019). Spatio-temporal Statistics with R, CRC Press, Taylor & Francis Group, Florida. 
  19. Yoon J, Jordon J, and van der Schaar M (2018). Gain: Missing data imputation using generative adversarial nets, International Conference on Machine Learning, 80, 5689-5698. 
  20. Ma H, Yue S, and Li J (2020). Air quality evaluation method based on data analysis, In Proceedings of 2020 39th Chinese Control Conference (CCC), Shenyang, China, 3162-3167.