DOI QR코드

DOI QR Code

The Effect of Input Variables Clustering on the Characteristics of Ensemble Machine Learning Model for Water Quality Prediction

입력자료 군집화에 따른 앙상블 머신러닝 모형의 수질예측 특성 연구

  • Park, Jungsu (Department of Civil and Environmental Eng, Hanbat National University)
  • 박정수 (국립한밭대학교 건설환경공학과)
  • Received : 2021.08.17
  • Accepted : 2021.09.23
  • Published : 2021.09.30

Abstract

Water quality prediction is essential for the proper management of water supply systems. Increased suspended sediment concentration (SSC) has various effects on water supply systems such as increased treatment cost and consequently, there have been various efforts to develop a model for predicting SSC. However, SSC is affected by both the natural and anthropogenic environment, making it challenging to predict SSC. Recently, advanced machine learning models have increasingly been used for water quality prediction. This study developed an ensemble machine learning model to predict SSC using the XGBoost (XGB) algorithm. The observed discharge (Q) and SSC in two fields monitoring stations were used to develop the model. The input variables were clustered in two groups with low and high ranges of Q using the k-means clustering algorithm. Then each group of data was separately used to optimize XGB (Model 1). The model performance was compared with that of the XGB model using the entire data (Model 2). The models were evaluated by mean squared error-ob servation standard deviation ratio (RSR) and root mean squared error. The RSR were 0.51 and 0.57 in the two monitoring stations for Model 2, respectively, while the model performance improved to RSR 0.46 and 0.55, respectively, for Model 1.

Keywords

Acknowledgement

본 논문은 2021년도 정부(국토교통부)의 재원으로 국토교통과학기술진흥원의 지원을 받아 수행된 연구입니다(21UGCP-B157942-02).

References

  1. Ahmad, A. and Dey, L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, 63, 503-527. https://doi.org/10.1016/j.datak.2007.03.016
  2. Ayub, J., Ahmad, J., Muhammad, J., Aziz, L., Ayub, S., Akram, U., and Basit, I. (2016). Glaucoma detection through optic disc and cup segmentation using k-mean clustering, 2016 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), 143-147.
  3. Bennett, N. D., Croke, B. F., Guariso, G., Guillaume, J. H., Hamilton, S. H., Jakeman, A. J., Marsili-Libelli, S., Newham, L. T., Norton, J. P., and Perrin, C. (2013). Characterising performance of environmental models, Environmental Modelling & Software, 40, 1-20. https://doi.org/10.1016/j.envsoft.2012.09.011
  4. Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), Association for Computing Machinery, 785-794.
  5. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine, Annals of statistics, 29(5), 1189-1232. https://doi.org/10.1214/aos/1013203451
  6. Gray, A. B., Pasternack, G. B., Watson, E. B., Goni, M. A., Hatten, J. A., and Warrick, J. A. (2016). Conversion to drip irrigated agriculture may offset historic anthropogenic and wildfire contributions to sediment production, Science of the Total Environment, 556, 219-230. https://doi.org/10.1016/j.scitotenv.2016.03.018
  7. Gray, A. B., Pasternack, G. B., Watson, E. B., Warrick, J. A., and Goni, M. A. (2015). The effect of El Nino Southern Oscillation cycles on the decadal scale suspended sediment behavior of a coastal dry-summer subtropical catchment, Earth Surface Processes and Landforms, 40, 272-284. https://doi.org/10.1002/esp.3627
  8. Haghiabi, A. H., Nasrolahi, A. H., and Parsaie, A. (2018). Water quality prediction using machine learning methods, Water Quality Research Journal, 53, 3-13. https://doi.org/10.2166/wqrj.2018.025
  9. Hicks, D. M., Gomez, B., and Trustrum, N. A. (2000). Erosion thresholds and suspended sediment yields, Waipaoa river basin, New Zealand, Water Resources Research, 36, 1129-1142. https://doi.org/10.1029/1999WR900340
  10. Hollister, J. W., Milstead, W. B., and Kreakie, B. J. (2016). Modeling lake trophic state: A random forest approach, Ecosphere, 7, e01321. https://doi.org/10.1002/ecs2.1321
  11. Li, L., Rong, S., Wang, R., and Yu, S. (2021). Recent advances in artificial intelligence and machine learning for nonlinear relationship analysis and process control in drinking water treatment: A review, Chemical Engineering Journal, 405, 126673. https://doi.org/10.1016/j.cej.2020.126673
  12. Lin, W., Sung, S., Chen, L., Chung, H., Wang, C., Wu, R., Lee, D., Huang, C., Juang, R., and Peng, X. (2004). Treating high-turbidity water using full-scale floc blanket clarifiers, Journal of Environmental Engineering, 130(12), 1481-1487. https://doi.org/10.1061/(ASCE)0733-9372(2004)130:12(1481)
  13. Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L., Harmel, R. D., and Veith, T. L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations, Transactions of the American Society of Agricultural and Biological Engineers, 50(3), 885-900.
  14. Muhammad, S. Y., Makhtar, M., Rozaimee, A., Aziz, A. A., and Jamal, A. A. (2015). Classification model for water quality using machine learning techniques, International Journal of software engineering and its applications, 9, 45-52. https://doi.org/10.14257/ijseia.2015.9.6.05
  15. Packman, A. I. and MacKay, J. S. (2003). Interplay of stream-subsurface exchange, clay particle deposition, and streambed evolution, Water Resources Research, 39(4), 1097.
  16. Park, J. (2021). Comparative characteristic of ensemble machine learning and deep learning models for turbidity prediction in a river, Journal of Korean Society of Water and Wastewater, 35, 83-91. [Korean Literature] https://doi.org/10.11001/jksww.2021.35.1.083
  17. Park, J. and Hunt, J. R. (2017). Coupling fine particle and bedload transport in gravel-bedded streams, Journal of Hydrology, 552, 532-543. https://doi.org/10.1016/j.jhydrol.2017.07.023
  18. Park, J. and Lee, H. (2020). Prediction of high turbidity in rivers using LSTM algorithm, Journal of Korean Society of Water and Wastewater, 34, 35-43. [Korean Literature] https://doi.org/10.11001/jksww.2020.34.1.035
  19. Park, R. K. (2018). An empirical comparison and verification study on the containerports clustering measurement using k-means and hierarchical clustering (average linkage method Using Cross-Efficiency Metrics, and Ward Method) and Mixed Models, Journal of Korea Port Economic Association, 34, 17-52. [Korean Literature] https://doi.org/10.38121/kpea.2018.09.34.3.17
  20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12, 2825-2830.
  21. Shin, Y., Kim, T., Hong, S., Lee, S., Lee, E., Hong, S., Lee, C., Kim, T., Park, M. S., and Park, J. (2020). Prediction of chlorophyll-a concentrations in the Nakdong river using machine learning methods, Water, 12, 1822. https://doi.org/10.3390/w12061822
  22. Singer, M. B., Aalto, R., James, L. A., Kilham, N. E., Higson, J. L., and Ghoshal, S. (2013). Enduring legacy of a toxic fan via episodic redistribution of California gold mining debris, Proceedings of the National Academy of Sciences, 110, 18436-18441. https://doi.org/10.1073/pnas.1302295110
  23. Song, J. (2017). K-means cluster analysis for missing data, Journal of Korean Data Analysis Society, 19, 689-697. [Korean Literature] https://doi.org/10.37727/jkdas.2017.19.2.689
  24. Stevenson, M. and Bravo, C. (2019). Advanced turbidity prediction for operational water supply planning, Decision Support Systems, 119, 72-84. https://doi.org/10.1016/j.dss.2019.02.009
  25. Sutton, C. D. (2005). Classification and regression trees, bagging, and boosting, Handbook of statistics, 24, 303-329. https://doi.org/10.1016/S0169-7161(04)24011-1
  26. Uddameri, V., Silva, A. L. B., Singaraju, S., Mohammadi, G., and Hernandez, E. A. (2020). Tree-based modeling methods to predict nitrate exceedances in the Ogallala aquifer in Texas, Water, 12, 1023. https://doi.org/10.3390/w12041023
  27. United States Geological Survey (USGS). (2009). USGS(United States Geological Survey) Water-Data Report 2009, 11482500 Redwood Creek at Orick, CA.
  28. United States Geological Survey (USGS). (2014). National Water Information System (NWIS). https://waterdata.usgs.gov/nwis (accessed Jun. 2014).
  29. Walling, D. (1977). Assessing the accuracy of suspended sediment rating curves for a small basin, Water Resources Research, 13(3), 531-538. https://doi.org/10.1029/WR013i003p00531
  30. Wang, Y., Chen, J., Cai, H., Yu, Q., and Zhou, Z. (2021). Predicting water turbidity in a macro-tidal coastal bay using machine learning approaches, Estuarine, Coastal and Shelf Science, 252, 107276. https://doi.org/10.1016/j.ecss.2021.107276
  31. Warrick, J. A. (2015). Trend analyses with river sediment rating curves, Hydrological processes, 29(6), 936-949. https://doi.org/10.1002/hyp.10198
  32. Warrick, J. A., Madej, M. A., Goni, M., and Wheatcroft, R. (2013). Trends in the suspended-sediment yields of coastal rivers of northern California, 1955-2010, Journal of Hydrology, 489, 108-123. https://doi.org/10.1016/j.jhydrol.2013.02.041
  33. Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., and Si, Y. (2018). A data-driven design for fault detection of wind turbines using random forests and XGboost, IEEE Access, 6, 21020-21031. https://doi.org/10.1109/access.2018.2818678
  34. Zhang, Y., Bouadi, T., and Martin, A. (2018). An empirical study to determine the optimal k in Ek-NNclus method, 5th International Conference on Belief Functions (BELIEF2018), 260-268.