Browse > Article
http://dx.doi.org/10.15681/KSWE.2021.37.5.335

The Effect of Input Variables Clustering on the Characteristics of Ensemble Machine Learning Model for Water Quality Prediction  

Park, Jungsu (Department of Civil and Environmental Eng, Hanbat National University)
Publication Information
Abstract
Water quality prediction is essential for the proper management of water supply systems. Increased suspended sediment concentration (SSC) has various effects on water supply systems such as increased treatment cost and consequently, there have been various efforts to develop a model for predicting SSC. However, SSC is affected by both the natural and anthropogenic environment, making it challenging to predict SSC. Recently, advanced machine learning models have increasingly been used for water quality prediction. This study developed an ensemble machine learning model to predict SSC using the XGBoost (XGB) algorithm. The observed discharge (Q) and SSC in two fields monitoring stations were used to develop the model. The input variables were clustered in two groups with low and high ranges of Q using the k-means clustering algorithm. Then each group of data was separately used to optimize XGB (Model 1). The model performance was compared with that of the XGB model using the entire data (Model 2). The models were evaluated by mean squared error-ob servation standard deviation ratio (RSR) and root mean squared error. The RSR were 0.51 and 0.57 in the two monitoring stations for Model 2, respectively, while the model performance improved to RSR 0.46 and 0.55, respectively, for Model 1.
Keywords
Clustering; Ensemble machine learning; Gradient boosting decision tree; Water quality prediction; Water supply system; XGBoost;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Song, J. (2017). K-means cluster analysis for missing data, Journal of Korean Data Analysis Society, 19, 689-697. [Korean Literature]   DOI
2 Stevenson, M. and Bravo, C. (2019). Advanced turbidity prediction for operational water supply planning, Decision Support Systems, 119, 72-84.   DOI
3 Sutton, C. D. (2005). Classification and regression trees, bagging, and boosting, Handbook of statistics, 24, 303-329.   DOI
4 Park, J. and Hunt, J. R. (2017). Coupling fine particle and bedload transport in gravel-bedded streams, Journal of Hydrology, 552, 532-543.   DOI
5 Lin, W., Sung, S., Chen, L., Chung, H., Wang, C., Wu, R., Lee, D., Huang, C., Juang, R., and Peng, X. (2004). Treating high-turbidity water using full-scale floc blanket clarifiers, Journal of Environmental Engineering, 130(12), 1481-1487.   DOI
6 Haghiabi, A. H., Nasrolahi, A. H., and Parsaie, A. (2018). Water quality prediction using machine learning methods, Water Quality Research Journal, 53, 3-13.   DOI
7 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12, 2825-2830.
8 Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), Association for Computing Machinery, 785-794.
9 Wang, Y., Chen, J., Cai, H., Yu, Q., and Zhou, Z. (2021). Predicting water turbidity in a macro-tidal coastal bay using machine learning approaches, Estuarine, Coastal and Shelf Science, 252, 107276.   DOI
10 Zhang, Y., Bouadi, T., and Martin, A. (2018). An empirical study to determine the optimal k in Ek-NNclus method, 5th International Conference on Belief Functions (BELIEF2018), 260-268.
11 Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L., Harmel, R. D., and Veith, T. L. (2007). Model evaluation guidelines for systematic quantification of accuracy in watershed simulations, Transactions of the American Society of Agricultural and Biological Engineers, 50(3), 885-900.
12 Muhammad, S. Y., Makhtar, M., Rozaimee, A., Aziz, A. A., and Jamal, A. A. (2015). Classification model for water quality using machine learning techniques, International Journal of software engineering and its applications, 9, 45-52.   DOI
13 Packman, A. I. and MacKay, J. S. (2003). Interplay of stream-subsurface exchange, clay particle deposition, and streambed evolution, Water Resources Research, 39(4), 1097.
14 Shin, Y., Kim, T., Hong, S., Lee, S., Lee, E., Hong, S., Lee, C., Kim, T., Park, M. S., and Park, J. (2020). Prediction of chlorophyll-a concentrations in the Nakdong river using machine learning methods, Water, 12, 1822.   DOI
15 Park, J. (2021). Comparative characteristic of ensemble machine learning and deep learning models for turbidity prediction in a river, Journal of Korean Society of Water and Wastewater, 35, 83-91. [Korean Literature]   DOI
16 Park, J. and Lee, H. (2020). Prediction of high turbidity in rivers using LSTM algorithm, Journal of Korean Society of Water and Wastewater, 34, 35-43. [Korean Literature]   DOI
17 Park, R. K. (2018). An empirical comparison and verification study on the containerports clustering measurement using k-means and hierarchical clustering (average linkage method Using Cross-Efficiency Metrics, and Ward Method) and Mixed Models, Journal of Korea Port Economic Association, 34, 17-52. [Korean Literature]   DOI
18 Uddameri, V., Silva, A. L. B., Singaraju, S., Mohammadi, G., and Hernandez, E. A. (2020). Tree-based modeling methods to predict nitrate exceedances in the Ogallala aquifer in Texas, Water, 12, 1023.   DOI
19 United States Geological Survey (USGS). (2009). USGS(United States Geological Survey) Water-Data Report 2009, 11482500 Redwood Creek at Orick, CA.
20 United States Geological Survey (USGS). (2014). National Water Information System (NWIS). https://waterdata.usgs.gov/nwis (accessed Jun. 2014).
21 Walling, D. (1977). Assessing the accuracy of suspended sediment rating curves for a small basin, Water Resources Research, 13(3), 531-538.   DOI
22 Warrick, J. A. (2015). Trend analyses with river sediment rating curves, Hydrological processes, 29(6), 936-949.   DOI
23 Ayub, J., Ahmad, J., Muhammad, J., Aziz, L., Ayub, S., Akram, U., and Basit, I. (2016). Glaucoma detection through optic disc and cup segmentation using k-mean clustering, 2016 International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), 143-147.
24 Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., and Si, Y. (2018). A data-driven design for fault detection of wind turbines using random forests and XGboost, IEEE Access, 6, 21020-21031.   DOI
25 Warrick, J. A., Madej, M. A., Goni, M., and Wheatcroft, R. (2013). Trends in the suspended-sediment yields of coastal rivers of northern California, 1955-2010, Journal of Hydrology, 489, 108-123.   DOI
26 Ahmad, A. and Dey, L. (2007). A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, 63, 503-527.   DOI
27 Bennett, N. D., Croke, B. F., Guariso, G., Guillaume, J. H., Hamilton, S. H., Jakeman, A. J., Marsili-Libelli, S., Newham, L. T., Norton, J. P., and Perrin, C. (2013). Characterising performance of environmental models, Environmental Modelling & Software, 40, 1-20.   DOI
28 Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine, Annals of statistics, 29(5), 1189-1232.   DOI
29 Gray, A. B., Pasternack, G. B., Watson, E. B., Goni, M. A., Hatten, J. A., and Warrick, J. A. (2016). Conversion to drip irrigated agriculture may offset historic anthropogenic and wildfire contributions to sediment production, Science of the Total Environment, 556, 219-230.   DOI
30 Gray, A. B., Pasternack, G. B., Watson, E. B., Warrick, J. A., and Goni, M. A. (2015). The effect of El Nino Southern Oscillation cycles on the decadal scale suspended sediment behavior of a coastal dry-summer subtropical catchment, Earth Surface Processes and Landforms, 40, 272-284.   DOI
31 Singer, M. B., Aalto, R., James, L. A., Kilham, N. E., Higson, J. L., and Ghoshal, S. (2013). Enduring legacy of a toxic fan via episodic redistribution of California gold mining debris, Proceedings of the National Academy of Sciences, 110, 18436-18441.   DOI
32 Hicks, D. M., Gomez, B., and Trustrum, N. A. (2000). Erosion thresholds and suspended sediment yields, Waipaoa river basin, New Zealand, Water Resources Research, 36, 1129-1142.   DOI
33 Hollister, J. W., Milstead, W. B., and Kreakie, B. J. (2016). Modeling lake trophic state: A random forest approach, Ecosphere, 7, e01321.   DOI
34 Li, L., Rong, S., Wang, R., and Yu, S. (2021). Recent advances in artificial intelligence and machine learning for nonlinear relationship analysis and process control in drinking water treatment: A review, Chemical Engineering Journal, 405, 126673.   DOI