DOI QR코드

DOI QR Code

Development of ensemble machine learning model considering the characteristics of input variables and the interpretation of model performance using explainable artificial intelligence

수질자료의 특성을 고려한 앙상블 머신러닝 모형 구축 및 설명가능한 인공지능을 이용한 모형결과 해석에 대한 연구

  • Park, Jungsu (Department of Civil and Environmental Engineering, Hanbat National University)
  • 박정수 (국립한밭대학교 건설환경공학과)
  • Received : 2022.08.12
  • Accepted : 2022.08.15
  • Published : 2022.08.15

Abstract

The prediction of algal bloom is an important field of study in algal bloom management, and chlorophyll-a concentration(Chl-a) is commonly used to represent the status of algal bloom. In, recent years advanced machine learning algorithms are increasingly used for the prediction of algal bloom. In this study, XGBoost(XGB), an ensemble machine learning algorithm, was used to develop a model to predict Chl-a in a reservoir. The daily observation of water quality data and climate data was used for the training and testing of the model. In the first step of the study, the input variables were clustered into two groups(low and high value groups) based on the observed value of water temperature(TEMP), total organic carbon concentration(TOC), total nitrogen concentration(TN) and total phosphorus concentration(TP). For each of the four water quality items, two XGB models were developed using only the data in each clustered group(Model 1). The results were compared to the prediction of an XGB model developed by using the entire data before clustering(Model 2). The model performance was evaluated using three indices including root mean squared error-observation standard deviation ratio(RSR). The model performance was improved using Model 1 for TEMP, TN, TP as the RSR of each model was 0.503, 0.477 and 0.493, respectively, while the RSR of Model 2 was 0.521. On the other hand, Model 2 shows better performance than Model 1 for TOC, where the RSR was 0.532. Explainable artificial intelligence(XAI) is an ongoing field of research in machine learning study. Shapley value analysis, a novel XAI algorithm, was also used for the quantitative interpretation of the XGB model performance developed in this study.

Keywords

References

  1. Ahmad, A., and Dey, L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data, Data Knowl. Eng., 63, 503-527. https://doi.org/10.1016/j.datak.2007.03.016
  2. Bennett, N.D., Croke, B.F., Guariso, G., Guillaume, J.H., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T., Norton, J.P. and Perrin, C. (2013). Characterising performance of environmental models, Environ. Modell. Softw., 40, 1-20. https://doi.org/10.1016/j.envsoft.2012.09.011
  3. Chen, T. and Guestrin, C. (2016). "Xgboost: A scalable tree boosting system", In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17 August, San Francisco, CA, USA. Association for computing Machinery.
  4. Dietterich, T.G. (2000). Ensemble methods in machine learning, In international workshop on multiple classifier systems, June, Berlin, Heidelberg. 1-15.
  5. Ekmekcioglu, O., Koc, K., Ozger, M., and Isik, Z. (2022). Exploring the additional value of class imbalance distributions on interpretable flash flood susceptibility prediction in the Black Warrior River basin, Alabama, United States, J. Hydrol., 610, 127877.
  6. Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine, Ann. Stat., 1189-1232.
  7. Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.-Z., 2019. XAI-explainable artificial intelligence, Sci. Robot. 4(37).
  8. Hollister, J.W., Milstead, W.B. and Kreakie, B.J. (2016). Modeling lake trophic state: A random forest approach, Ecosphere, 7, e01321.
  9. KMA Korea Meteorological Administration, open met data portal, https://www.data.kma.go.kr/ (April 1, 2022).
  10. Kwak, J. (2021). A study on the 3-month prior prediction of Chl-a concentraion in the Daechong lake using hydrometeorological forecasting data, J. Wetl. Res., 23(2), 144-153.
  11. K-water Mywater https://www.water.or.kr/ (June 1, 2022).
  12. Kwon, Y.S., Baek, S.H., Lim, Y.K., Pyo, J., Ligaray, M., Park, Y. and Cho, K.H. (2018). Monitoring coastal chlorophyll-a concentrations in coastal areas using machine learning models, Water 10(8), 1020.
  13. Liu, M., and Lu, J. (2014). Support vector machine-an alternative to artificial neuron network for water quality forecasting in an agricultural nonpoint source polluted river?, Environ. Sci. Pollut. R., 21, 11036-11053. https://doi.org/10.1007/s11356-014-3046-x
  14. Lundberg, S.M., Erion, G.G., and Lee, S.I. (2018). Consistent individualized feature attribution for tree ensembles, https://arxiv.org/abs/1802.03888
  15. Lundberg, S.M. and Lee, S.I. (2017). "A unified approach to interpreting model predictions", Proceedings of the 31st International Conference on Neural Information Processing Systems, 4768-4777.
  16. Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q. and Niu, X. (2018). Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning, Electron Commer. Res. Appl., 31, 24-39. https://doi.org/10.1016/j.elerap.2018.08.002
  17. Mangalathu, S., Hwang, S.H., and Jeon, J.S. (2020). Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach, Eng. Struct., 219, 110927.
  18. Moriasi, D.N., Arnold, J.G., Van Liew, M.W., Bingner, R.L., Harmel, R.D. and Veith, T.L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations, Am. Soc. Agric. Biol. Eng., 50, 885-900.
  19. NIER National Institute of Environmental Research, realtime water information system http://www.koreawqi.go.kr/index_web.jsp (April 1, 2022).
  20. Park, J. (2021). The effect of input variables clustering on the characteristics of ensemble machine learning model for water quality prediction, J Korean Soc. Wat. Environ., 37(5), 335-343.
  21. Park, J., Lee, W.H., Kim, K.T., Park, C.Y., Lee, S. and Heo, T.Y. (2022). Interpretation of ensemble learning to predict water quality using explainable artificial intelligence, Sci. Total Environ., 832, 155070.
  22. Park, J., Park, J.H., Choi, J.S., Joo, J.C., Park, K., Yoon, H.C., Park, C.Y., Lee, W.H., and Heo, T.Y. (2020). Ensemble Model Development for the Prediction of a Disaster Index in Water Treatment Systems, Water, 12, 3195.
  23. Park, Y., Cho, K.H., Park, J., Cha, S.M. and Kim, J.H. (2015). Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea. Sci. Total Environ., 502, 31-41. https://doi.org/10.1016/j.scitotenv.2014.09.005
  24. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R. and Dubourg, V. (2011). Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., 12, 2825-2830.
  25. Ribeiro, M.T., Singh, S., Guestrin, C., 2016. "Why should I trust you?" explaining the predictions of any classifier, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, 1135-1144.
  26. Shin, C.M., Min, J.H., Park, S.Y., Choi, J., Park, J.H., Song, Y.S. and Kim, K. (2017). Operational water quality forecast for the Yeongsan river using EFDC model, J. Korean Soc. Water Environ., 33(2), 219-229.
  27. Shin, Y., Kim, T., Hong, S., Lee, S., Lee, E., Hong, S., Lee, C., Kim, T., Park, M.S., and Park, J. (2020). Prediction of chlorophyll-a concentrations in the Nakdong River using machine learning methods, Water, 12, 1822.
  28. Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A., 2016. Not just a black box: learning important features through propagating activation differences arXiv preprint arXiv: 1605.01713.
  29. Singh, K.P., Basant, N., and Gupta, S. (2011). Support vector machines in water quality management, Anal. Chim. Acta., 703, 152-162. https://doi.org/10.1016/j.aca.2011.07.027
  30. Song, J. (2017). K-Means cluster analysis for missing data, J. Korean Data Anal. Soc., 19, 689-697. https://doi.org/10.37727/jkdas.2017.19.2.689
  31. Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B. and Si, Y. (2018). A data-driven design for fault detection of wind turbines using random forests and XGboost. IEEE Access, 6, 21020-21031. https://doi.org/10.1109/ACCESS.2018.2818678