DOI QR코드

DOI QR Code

Evaluation of Multi-classification Model Performance for Algal Bloom Prediction Using CatBoost

머신러닝 CatBoost 다중 분류 알고리즘을 이용한 조류 발생 예측 모형 성능 평가 연구

  • Juneoh Kim (Department of Civil and Environmental Eng, Hanbat National University) ;
  • Jungsu Park (Department of Civil and Environmental Eng, Hanbat National University)
  • 김준오 (국립한밭대학교 건설환경공학과) ;
  • 박정수 (국립한밭대학교 건설환경공학과)
  • Received : 2022.11.22
  • Accepted : 2022.12.22
  • Published : 2023.01.30

Abstract

Monitoring and prediction of water quality are essential for effective river pollution prevention and water quality management. In this study, a multi-classification model was developed to predict chlorophyll-a (Chl-a) level in rivers. A model was developed using CatBoost, a novel ensemble machine learning algorithm. The model was developed using hourly field monitoring data collected from January 1 to December 31, 2015. For model development, chl-a was classified into class 1 (Chl-a≤10 ㎍/L), class 2 (10<Chl-a≤50 ㎍/L), and class 3 (Chl-a>50 ㎍/L), where the number of data used for the model training were 27,192, 11,031, and 511, respectively. The macro averages of precision, recall, and F1-score for the three classes were 0.58, 0.58, and 0.58, respectively, while the weighted averages were 0.89, 0.90, and 0.89, for precision, recall, and F1-score, respectively. The model showed relatively poor performance for class 3 where the number of observations was much smaller compared to the other two classes. The imbalance of data distribution among the three classes was resolved by using the synthetic minority over-sampling technique (SMOTE) algorithm, where the number of data used for model training was evenly distributed as 26,868 for each class. The model performance was improved with the macro averages of precision, rcall, and F1-score of the three classes as 0.58, 0.70, and 0.59, respectively, while the weighted averages were 0.88, 0.84, and 0.86 after SMOTE application.

Keywords

Acknowledgement

1. 이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (No. 2022R1F1A1065518) (50%). 2. 본 결과물은 환경부의 재원으로 한국환경산업기술원의 환경시설 재난재해 대응기술개발사업의 지원을 받아 연구되었습니다 (2022002870001) (50%).

References

  1. Breiman, L. (2001). Random forests, Machine learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
  2. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
  3. Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system, In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785-794.
  4. Dorogush, A. V., Ershov, V., and Gulin, A. (2018). CatBoost: Gradient boosting with categorical features support, arXiv preprint arXiv:1810.11363.
  5. Hollister, J. W., Milstead, W. B., and Kreakie, B. J. (2016). Modeling lake trophic state: A random forest approach, Ecosphere, 7(3), e01321.
  6. Jung, H. S., Choi, Y., Oh, J. H., and Lim, G. H. (2002). Recent trends in temperature and precipitation over South Korea, International Journal of Climatology, 22, 1327-1337. https://doi.org/10.1002/joc.797
  7. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems, 30.
  8. Kim, Y., Choi, H., and Kim, S. (2020). A study on risk parity asset allocation model with XGBoo, Journal of Intelligence and Information Systems, 26(1), 135-149. https://doi.org/10.13088/JIIS.2020.26.1.135
  9. Kwak, J. (2021). A study on the 3-month prior prediction of Chl-a concentraion in the Daechong lake using hydrometeorological forecasting data, Journal of Wetlands Research, 23(2), 144-153. [Korean Literature] https://doi.org/10.17663/JWR.2021.23.2.144
  10. K-water. (2022). Mywater, http://www.water.or.kr/ (Aug 4, 2022).
  11. Lee, K. M., Baek, H. J., Park, S. H., Kang, H. S., and Cho, C. H. (2012). Future projection of changes in extreme temperatures using high resolution regional climate change scenario in the Republic of Korea, Journal of the Korean Geographical Society, 47(2), 208-225. [Korean Literature]
  12. Lee, S. M., Park, K. D., and Kim, I. K. (2020). Comparison of machine learning algorithms for Chl-a prediction in the middle of Nakdong river (focusing on water quality and quantity factors), Journal of Korean Socitey of Water and Wastewater, 34(4), 277-288. [Korean Literature] https://doi.org/10.11001/jksww.2020.34.4.277
  13. Lim, H. S. and An, H. U. (2018). Prediction of pollution loads in Geum river using machine learning, Proceedings of the Korea Water Resources Association Conference, Korea Water Resources Association, 445. [Korean Literature]
  14. Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q., and Niu, X. (2018). Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning, Electronic Commerce Research and Applications, 31, 24-39. https://doi.org/10.1016/j.elerap.2018.08.002
  15. Nasir, N., Kansal, A., Alshaltone, O., Barneih, F., Sameer, M., Shanableh, A., and Al-Shamma'a, A. (2022). Water quality classification using machine learning algorithms, Journal of Water Process Engineering, 48, 102920.
  16. National Institute of Environmental Research (NIER). (2022). Water environmental information system, https://water.nier.go.kr/web (Aug 4, 2022).
  17. National Institute of Meteorological Research (NIMR). (2009). Climate change in the Korean peninsula, present and future, National Institute of Meteorological Research, Seoul. [Korean Literature]
  18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12, 2825-2830.
  19. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features, Advances in Neural Information Processing Systems, 31.
  20. Shin, J. I., Park, J. S., and Shon, J. G. (2021). Prediction of semiconductor exposure process measurement results using XGBoost, In Proceedings of the Korea Information Processing Society Conference, Korea Information Processing Society, 505-508. [Korean Literature]
  21. Solomon, S. (2007). The physical science basis: Contribution of working group I to the fourth assessment report of the intergovernmental panel on climate change, Intergovernmental Panel on Climate Change (IPCC), Climate change 2007, 996.
  22. Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy, Remote Sensing of Environment, 62(1), 77-89. https://doi.org/10.1016/S0034-4257(97)00083-7
  23. Sutton, C. D. (2005). Classification and regression trees, bagging, and boosting, Handbook of statistics, 24, 303-329. https://doi.org/10.1016/S0169-7161(04)24011-1
  24. Uddameri, V., Silva, A. L. B., Singaraju, S., Mohammadi, G., and Hernandez, E. A. (2020). Tree-based modeling methods to predict nitrate exceedances in the Ogallala aquifer in Texas, Water, 12, 1023.
  25. Xin, L. and Mou, T. (2022). Research on the application of multimodal-based machine learning algorithms to water quality classification, Wireless Communications and Mobile Computing, 2022, 1-13. https://doi.org/10.1155/2022/9555790
  26. Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., and Si, Y. (2018). A data-driven design for fault detection of wind turbines using random forests and XGboost, IEEE Access, 6, 21020-21031. https://doi.org/10.1109/ACCESS.2018.2818678
  27. Zhao, X., Li, Y., Chen, Y., and Qiao, X. (2022). A method of cyanobacterial concentrations prediction using multispectral images, Sustainability, 14(19), 12784.