The Effect of Input Variables Clustering on the Characteristics of Ensemble Machine Learning Model for Water Quality Prediction

Park, Jungsu;

doi:10.15681/KSWE.2021.37.5.335

Journal of Korean Society on Water Environment (한국물환경학회지)

Volume 37 Issue 5
/
Pages.335-343
/
2021
/
2289-0971(pISSN)
/
2289-098X(eISSN)

Korean Society on Water Environment (한국물환경학회)

DOI QR Code

The Effect of Input Variables Clustering on the Characteristics of Ensemble Machine Learning Model for Water Quality Prediction

입력자료 군집화에 따른 앙상블 머신러닝 모형의 수질예측 특성 연구

Park, Jungsu (Department of Civil and Environmental Eng, Hanbat National University)

박정수 (국립한밭대학교 건설환경공학과)

Received : 2021.08.17
Accepted : 2021.09.23
Published : 2021.09.30

https://doi.org/10.15681/KSWE.2021.37.5.335 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Water quality prediction is essential for the proper management of water supply systems. Increased suspended sediment concentration (SSC) has various effects on water supply systems such as increased treatment cost and consequently, there have been various efforts to develop a model for predicting SSC. However, SSC is affected by both the natural and anthropogenic environment, making it challenging to predict SSC. Recently, advanced machine learning models have increasingly been used for water quality prediction. This study developed an ensemble machine learning model to predict SSC using the XGBoost (XGB) algorithm. The observed discharge (Q) and SSC in two fields monitoring stations were used to develop the model. The input variables were clustered in two groups with low and high ranges of Q using the k-means clustering algorithm. Then each group of data was separately used to optimize XGB (Model 1). The model performance was compared with that of the XGB model using the entire data (Model 2). The models were evaluated by mean squared error-ob servation standard deviation ratio (RSR) and root mean squared error. The RSR were 0.51 and 0.57 in the two monitoring stations for Model 2, respectively, while the model performance improved to RSR 0.46 and 0.55, respectively, for Model 1.

Keywords

Acknowledgement

본 논문은 2021년도 정부(국토교통부)의 재원으로 국토교통과학기술진흥원의 지원을 받아 수행된 연구입니다(21UGCP-B157942-02).

References

Ahmad, A. and Dey, L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, 63, 503-527. https://doi.org/10.1016/j.datak.2007.03.016
Ayub, J., Ahmad, J., Muhammad, J., Aziz, L., Ayub, S., Akram, U., and Basit, I. (2016). Glaucoma detection through optic disc and cup segmentation using k-mean clustering, 2016 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), 143-147.
Bennett, N. D., Croke, B. F., Guariso, G., Guillaume, J. H., Hamilton, S. H., Jakeman, A. J., Marsili-Libelli, S., Newham, L. T., Norton, J. P., and Perrin, C. (2013). Characterising performance of environmental models, Environmental Modelling & Software, 40, 1-20. https://doi.org/10.1016/j.envsoft.2012.09.011
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), Association for Computing Machinery, 785-794.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine, Annals of statistics, 29(5), 1189-1232. https://doi.org/10.1214/aos/1013203451
Gray, A. B., Pasternack, G. B., Watson, E. B., Goni, M. A., Hatten, J. A., and Warrick, J. A. (2016). Conversion to drip irrigated agriculture may offset historic anthropogenic and wildfire contributions to sediment production, Science of the Total Environment, 556, 219-230. https://doi.org/10.1016/j.scitotenv.2016.03.018
Gray, A. B., Pasternack, G. B., Watson, E. B., Warrick, J. A., and Goni, M. A. (2015). The effect of El Nino Southern Oscillation cycles on the decadal scale suspended sediment behavior of a coastal dry-summer subtropical catchment, Earth Surface Processes and Landforms, 40, 272-284. https://doi.org/10.1002/esp.3627
Haghiabi, A. H., Nasrolahi, A. H., and Parsaie, A. (2018). Water quality prediction using machine learning methods, Water Quality Research Journal, 53, 3-13. https://doi.org/10.2166/wqrj.2018.025
Hicks, D. M., Gomez, B., and Trustrum, N. A. (2000). Erosion thresholds and suspended sediment yields, Waipaoa river basin, New Zealand, Water Resources Research, 36, 1129-1142. https://doi.org/10.1029/1999WR900340
Hollister, J. W., Milstead, W. B., and Kreakie, B. J. (2016). Modeling lake trophic state: A random forest approach, Ecosphere, 7, e01321. https://doi.org/10.1002/ecs2.1321
Li, L., Rong, S., Wang, R., and Yu, S. (2021). Recent advances in artificial intelligence and machine learning for nonlinear relationship analysis and process control in drinking water treatment: A review, Chemical Engineering Journal, 405, 126673. https://doi.org/10.1016/j.cej.2020.126673
Lin, W., Sung, S., Chen, L., Chung, H., Wang, C., Wu, R., Lee, D., Huang, C., Juang, R., and Peng, X. (2004). Treating high-turbidity water using full-scale floc blanket clarifiers, Journal of Environmental Engineering, 130(12), 1481-1487. https://doi.org/10.1061/(ASCE)0733-9372(2004)130:12(1481)
Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L., Harmel, R. D., and Veith, T. L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations, Transactions of the American Society of Agricultural and Biological Engineers, 50(3), 885-900.
Muhammad, S. Y., Makhtar, M., Rozaimee, A., Aziz, A. A., and Jamal, A. A. (2015). Classification model for water quality using machine learning techniques, International Journal of software engineering and its applications, 9, 45-52. https://doi.org/10.14257/ijseia.2015.9.6.05
Packman, A. I. and MacKay, J. S. (2003). Interplay of stream-subsurface exchange, clay particle deposition, and streambed evolution, Water Resources Research, 39(4), 1097.
Park, J. (2021). Comparative characteristic of ensemble machine learning and deep learning models for turbidity prediction in a river, Journal of Korean Society of Water and Wastewater, 35, 83-91. [Korean Literature] https://doi.org/10.11001/jksww.2021.35.1.083
Park, J. and Hunt, J. R. (2017). Coupling fine particle and bedload transport in gravel-bedded streams, Journal of Hydrology, 552, 532-543. https://doi.org/10.1016/j.jhydrol.2017.07.023
Park, J. and Lee, H. (2020). Prediction of high turbidity in rivers using LSTM algorithm, Journal of Korean Society of Water and Wastewater, 34, 35-43. [Korean Literature] https://doi.org/10.11001/jksww.2020.34.1.035
Park, R. K. (2018). An empirical comparison and verification study on the containerports clustering measurement using k-means and hierarchical clustering (average linkage method Using Cross-Efficiency Metrics, and Ward Method) and Mixed Models, Journal of Korea Port Economic Association, 34, 17-52. [Korean Literature] https://doi.org/10.38121/kpea.2018.09.34.3.17
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12, 2825-2830.
Shin, Y., Kim, T., Hong, S., Lee, S., Lee, E., Hong, S., Lee, C., Kim, T., Park, M. S., and Park, J. (2020). Prediction of chlorophyll-a concentrations in the Nakdong river using machine learning methods, Water, 12, 1822. https://doi.org/10.3390/w12061822
Singer, M. B., Aalto, R., James, L. A., Kilham, N. E., Higson, J. L., and Ghoshal, S. (2013). Enduring legacy of a toxic fan via episodic redistribution of California gold mining debris, Proceedings of the National Academy of Sciences, 110, 18436-18441. https://doi.org/10.1073/pnas.1302295110
Song, J. (2017). K-means cluster analysis for missing data, Journal of Korean Data Analysis Society, 19, 689-697. [Korean Literature] https://doi.org/10.37727/jkdas.2017.19.2.689
Stevenson, M. and Bravo, C. (2019). Advanced turbidity prediction for operational water supply planning, Decision Support Systems, 119, 72-84. https://doi.org/10.1016/j.dss.2019.02.009
Sutton, C. D. (2005). Classification and regression trees, bagging, and boosting, Handbook of statistics, 24, 303-329. https://doi.org/10.1016/S0169-7161(04)24011-1
Uddameri, V., Silva, A. L. B., Singaraju, S., Mohammadi, G., and Hernandez, E. A. (2020). Tree-based modeling methods to predict nitrate exceedances in the Ogallala aquifer in Texas, Water, 12, 1023. https://doi.org/10.3390/w12041023
United States Geological Survey (USGS). (2009). USGS(United States Geological Survey) Water-Data Report 2009, 11482500 Redwood Creek at Orick, CA.
United States Geological Survey (USGS). (2014). National Water Information System (NWIS). https://waterdata.usgs.gov/nwis (accessed Jun. 2014).
Walling, D. (1977). Assessing the accuracy of suspended sediment rating curves for a small basin, Water Resources Research, 13(3), 531-538. https://doi.org/10.1029/WR013i003p00531
Wang, Y., Chen, J., Cai, H., Yu, Q., and Zhou, Z. (2021). Predicting water turbidity in a macro-tidal coastal bay using machine learning approaches, Estuarine, Coastal and Shelf Science, 252, 107276. https://doi.org/10.1016/j.ecss.2021.107276
Warrick, J. A. (2015). Trend analyses with river sediment rating curves, Hydrological processes, 29(6), 936-949. https://doi.org/10.1002/hyp.10198
Warrick, J. A., Madej, M. A., Goni, M., and Wheatcroft, R. (2013). Trends in the suspended-sediment yields of coastal rivers of northern California, 1955-2010, Journal of Hydrology, 489, 108-123. https://doi.org/10.1016/j.jhydrol.2013.02.041
Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., and Si, Y. (2018). A data-driven design for fault detection of wind turbines using random forests and XGboost, IEEE Access, 6, 21020-21031. https://doi.org/10.1109/access.2018.2818678
Zhang, Y., Bouadi, T., and Martin, A. (2018). An empirical study to determine the optimal k in Ek-NNclus method, 5th International Conference on Belief Functions (BELIEF2018), 260-268.

Journal of Korean Society on Water Environment (한국물환경학회지)

The Effect of Input Variables Clustering on the Characteristics of Ensemble Machine Learning Model for Water Quality Prediction

입력자료 군집화에 따른 앙상블 머신러닝 모형의 수질예측 특성 연구

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)