Browse > Article
http://dx.doi.org/10.9719/EEG.2019.52.2.199

Study on the Effect of Training Data Sampling Strategy on the Accuracy of the Landslide Susceptibility Analysis Using Random Forest Method  

Kang, Kyoung-Hee (Dept. of Geoinformation Engineering, Sejong University)
Park, Hyuck-Jin (Dept. of Geoinformation Engineering, Sejong University)
Publication Information
Economic and Environmental Geology / v.52, no.2, 2019 , pp. 199-212 More about this Journal
Abstract
In the machine learning techniques, the sampling strategy of the training data affects a performance of the prediction model such as generalizing ability as well as prediction accuracy. Especially, in landslide susceptibility analysis, the data sampling procedure is the essential step for setting the training data because the number of non-landslide points is much bigger than the number of landslide points. However, the previous researches did not consider the various sampling methods for the training data. That is, the previous studies selected the training data randomly. Therefore, in this study the authors proposed several different sampling methods and assessed the effect of the sampling strategies of the training data in landslide susceptibility analysis. For that, total six different scenarios were set up based on the sampling strategies of landslide points and non-landslide points. Then Random Forest technique was trained on the basis of six different scenarios and the attribute importance for each input variable was evaluated. Subsequently, the landslide susceptibility maps were produced using the input variables and their attribute importances. In the analysis results, the AUC values of the landslide susceptibility maps, obtained from six different sampling strategies, showed high prediction rates, ranges from 70 % to 80 %. It means that the Random Forest technique shows appropriate predictive performance and the attribute importance for the input variables obtained from Random Forest can be used as the weight of landslide conditioning factors in the susceptibility analysis. In addition, the analysis results obtained using specific sampling strategies for training data show higher prediction accuracy than the analysis results using the previous random sampling method.
Keywords
landslide susceptibility; machine learning; Random Forest; training data; sampling strategy;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Stumpf, A. and Kerle, N. (2011) Object-oriented mapping of landslides using Random Forests. Remote sensing of environment, v.115, p.2564-2577.   DOI
2 Tien Bui, D., Pradhan, B., Lofman, O., Revhaug, I. and Dick, O.B. (2012) Application of support vector machines in landslide susceptibility assessment for the Hoa Binh province (Vietnam) with kernel functions analysis. Proceedings of the iEMSs Sixth Biennial Meeting: International Congress on Environmental Modelling and Software (iEMSs 2012). International Environmental Modelling and Software Society, Leipzig, Germany(July).
3 Tien Bui, D., Pradhan, B., Lofman, O., Revhaug, I. and Dick, O.B. (2013) Regional prediction of landslide hazard using probability analysis of intense rainfall in the Hoa Binh province, Vietnam. Natural hazards, v.66, p.707-730.   DOI
4 Tien Bui, D., Tuan, T.A., Klempe, H., Pradhan, B. and Revhaug, I. (2016) Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides, v.13, p.361-378.   DOI
5 Tsangaratos, P. and Ilia, I. (2016) Comparison of a logistic regression and Naive Bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size. Catena, v.145, p.164-179.   DOI
6 Watts, J.D., Lawrence, R.L., Miller, P.R. and Montagne, C. (2009) Monitoring of cropland practices for carbon sequestration purposes in north central Montana by Landsat remote sensing. Remote Sensing of Environment, v.113, p.1843-1852.   DOI
7 Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., MacLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z. H., Steinbach, M., Hand, D.J. and Steinberg, D. (2008) Top 10 algorithms in data mining. Knowledge and information systems, v.14, p.1-37.   DOI
8 Yilmaz, I. (2010) The effect of the sampling strategies on the landslide susceptibility mapping by conditional probability and artificial neural networks. Environmental Earth Sciences, v.60, p.505-519.   DOI
9 Zhang, K., Wu, X., Niu, R., Yang, K. and Zhao, L. (2017) The assessment of landslide susceptibility mapping using random forest and decision tree methods in the Three Gorges Reservoir area, China. Environmental Earth Sciences, v.76, p.405.   DOI
10 Baeza, C., Lantada, N. and Moya, J. (2010) Influence of sample and terrain unit on landslide susceptibility assessment at La Pobla de Lillet, Eastern Pyrenees, Spain. Environmental Earth Sciences, v.60, p.155-167.   DOI
11 Breiman, L. (2001) Random forest. Machine Learning, v.45, p.5-32.   DOI
12 Brenning, A. (2005) Spatial prediction models for landslide hazards: review, comparison and evaluation. Natural Hazards and Earth System Science, v.5, p.853-862.   DOI
13 Cho, J.H. and Kurup, P.U. (2011) Decision tree approach for classification and dimensionality reduction of electronic nose data. Sensors and Actuators B: Chemical, v.160, p.542-548.   DOI
14 Catani, F., Lagomarsino, D., Segoni, S. and Tofani, V. (2013) Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues. Natural Hazards and Earth System Sciences, v.13, p.2815-2831.   DOI
15 Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Bui, D.T., Duan, Z. and Ma, J. (2017) A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena, v.151, p.147-160.   DOI
16 Chen, W., Zhang, S., Li, R. and Shahabi, H. (2018) Performance evaluation of the GIS-based data mining techniques of best-first decision tree, random forest, and naive Bayes tree for landslide susceptibility modeling. Science of the Total Environment, v.644, p.1006-1018.   DOI
17 Goetz, J.N., Brenning, A., Petschko, H. and Leopold, P. (2015) Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling. Computers & Geosciences, v.81, p.1-11.   DOI
18 Dittman, D. J., Khoshgoftaar, T. M. and Napolitano, A. (2015). The effect of data sampling when using random forest on imbalanced bioinformatics data. In: 2015 IEEE International Conference on Information Reuse and Integration (IRI), pp. 457-463.
19 Dudoit, S., Fridlyand, J. and Speed, T.P. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association, v.97, p.77-87.   DOI
20 Duro, D.C., Franklin, S.E. and Dube, M.G. (2012) Multiscale object-based image analysis and feature selection of multi-sensor earth observation imagery using random forests. International Journal of Remote Sensing, v.33, p.4502-4526.   DOI
21 Guzzetti, F., Carrara, A., Cardinali, M. and Reichenbach, P. (1999) Landslide hazard evaluation: a review of current techniques and their application in a multiscale study, Central Italy. Geomorphology, v.31, p.181-216.   DOI
22 Hong, H., Pradhan, B., Xu, C. and Bui, D.T. (2015) Spatial prediction of landslide hazard at the Yihuang area (China) using two-class kernel logistic regression, alternating decision tree and support vector machines. Catena, v.133, p.266-281.   DOI
23 Hamza, M. and Larocque, D. (2005) An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation and Simulation, v.75, p.629-643.   DOI
24 Hong, H., Pourghasemi, H.R. and Pourtaghi, Z.S. (2016) Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models. Geomorphology, v.259, p.105-118.   DOI
25 Chung, C.J.F. and Fabbri, A.G. (2003) Validation of spatial prediction models for landslide hazard mapping. Natural Hazards, v.30, p.451-472.   DOI
26 Kalantar, B., Pradhan, B., Naghibi, S.A., Motevalli, A. and Mansor, S. (2018) Assessment of the effects of training data selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN). Geomatics, Natural Hazards and Risk, v.9, p.49-69.   DOI
27 Kim, J.C., Lee, S., Jung, H.S. and Lee, S. (2018) Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang, Korea. Geocarto International, v.33, p.1000-1015.   DOI
28 Kim, W.Y., Chae B.G., Kim, K.S., Cho, Y.C., Lee, C.O., Lee, C.W., Kim, K.Y., Kim, J.H. and Kim, J.M. (2003) Study on landslide hazard prediction. Ministry of Science and Technology, 339p.
29 Lee S., Lee M.J. and Won J.S. (2005) Landslide susceptibility analysis and verification using artificial neural network in the Kangneung area. Economic and Environmental Geology, v.38, p.1-11.
30 Lee, J.H. and Park, H.J. (2012) Assessment of landslide susceptibility using a coupled infinite slope model and hydrologic model in Jinbu area, Gangwon-do. Economic and Environmental Geology, v.45, p.697-707.   DOI
31 Liaw, A. and Wiener, M. (2002) Classification and regression by randomForest. R news, v.2, p.18-22.
32 Muller, A.C. and Guido, S. (2016) Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Inc., 386p.
33 Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A. and Brown, S.D. (2004) An introduction to decision tree modeling. Journal of Chemometrics: A Journal of the Chemometrics Society, v.18, p.275-285.   DOI
34 Park, C. (2017) A measure of discrepancy based on margin of victory useful for the determination of random forest size. The Korean Data & Information Science Society, v.28, p.515-524.
35 Na, X., Zhang, S., Li, X., Yu, H. and Liu, C. (2010) Improved land cover mapping using random forests combined with landsat thematic mapper imagery and ancillary geographic data. Photogrammetric Engineering & Remote Sensing, v.76, p.833-840.   DOI
36 Paola, J. D. and Schowengerdt, R. A. (1995) A review and analysis of backpropagation neural networks for classification of remotely-sensed multi-spectral imagery. International Journal of remote sensing, v.16, p.3033-3058.   DOI
37 Park, C. (2016) A simple diagnostic statistic for determining the size of random forest. Journal of the Korean Data and Information Science Society, v.27, p.855-863.   DOI
38 Pradhan, B. (2013) A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using GIS. Computers & Geosciences, v.51, p.350-365.   DOI
39 Pham, B.T., Bui, D.T., Prakash, I. and Dholakia, M.B. (2017) Hybrid integration of Multilayer Perceptron Neural Networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. Catena, v.149, p.52-63.   DOI
40 Pham, B.T., Pradhan, B., Bui, D.T., Prakash, I. and Dholakia, M.B. (2016) A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India). Environmental Modelling & Software, v.84, p.240-250.   DOI
41 Pradhan, B. and Lee, S. (2010) Landslide susceptibility assessment and factor effect analysis: backpropagation artificial neural networks and their comparison with frequency ratio and bivariate logistic regression modelling. Environmental Modelling & Software, v.25, p.747-759.   DOI