DOI QR코드

DOI QR Code

Study on the Effect of Training Data Sampling Strategy on the Accuracy of the Landslide Susceptibility Analysis Using Random Forest Method

Random Forest 기법을 이용한 산사태 취약성 평가 시 훈련 데이터 선택이 결과 정확도에 미치는 영향

  • 강경희 (세종대학교 지구정보공학과) ;
  • 박혁진 (세종대학교 지구정보공학과)
  • Received : 2019.02.01
  • Accepted : 2019.03.10
  • Published : 2019.04.28

Abstract

In the machine learning techniques, the sampling strategy of the training data affects a performance of the prediction model such as generalizing ability as well as prediction accuracy. Especially, in landslide susceptibility analysis, the data sampling procedure is the essential step for setting the training data because the number of non-landslide points is much bigger than the number of landslide points. However, the previous researches did not consider the various sampling methods for the training data. That is, the previous studies selected the training data randomly. Therefore, in this study the authors proposed several different sampling methods and assessed the effect of the sampling strategies of the training data in landslide susceptibility analysis. For that, total six different scenarios were set up based on the sampling strategies of landslide points and non-landslide points. Then Random Forest technique was trained on the basis of six different scenarios and the attribute importance for each input variable was evaluated. Subsequently, the landslide susceptibility maps were produced using the input variables and their attribute importances. In the analysis results, the AUC values of the landslide susceptibility maps, obtained from six different sampling strategies, showed high prediction rates, ranges from 70 % to 80 %. It means that the Random Forest technique shows appropriate predictive performance and the attribute importance for the input variables obtained from Random Forest can be used as the weight of landslide conditioning factors in the susceptibility analysis. In addition, the analysis results obtained using specific sampling strategies for training data show higher prediction accuracy than the analysis results using the previous random sampling method.

머신러닝 기법을 활용한 분석에서 훈련 데이터의 샘플링 전략은 예측 정확도 뿐 만 아니라 일반화 능력에도 많은 영향을 미친다. 특히, 산사태 취약성 분석의 경우, 산사태 발생부에 대한 정보에 비해 산사태 미발생부에 대한 정보가 과도하게 많은 데이터 불균형 현상이 발생하며, 이에 따라 분석 모델의 훈련 데이터 설계 시 데이터 샘플링 과정이 필수적이다. 그러나 기존의 연구들은 대부분 산사태 미발생부 선택 시 발생부 데이터와 1:1의 비율을 갖도록 무작위로 선택하는 방법을 적용하였을 뿐, 특정한 선택 기준에 따라 분석을 수행하지 않았다. 따라서 본 연구에서는 훈련 데이터의 샘플링 전략이 모델의 예측 성능에 미치는 결과를 확인하기 위하여 산사태 발생부와 미발생부의 샘플링 전략기준에 따라 서로 다른 6개의 시나리오를 만들어 Random Forest 모델의 훈련에 사용하였다. 또한 Random Forest의 결과 중 하나인 변수 중요도를 각 산사태 유발인자들에 가중치로 곱하여 줌으로써 산사태 취약지수 값을 산정하였으며, 취약지수 값을 이용해 산사태 취약성도를 제작하고 각 결과 지도의 정확도를 비교 분석하였다. 분석 결과, 훈련데이터의 샘플링 방법에 상관없이 두 지역의 산사태 취약성 분석 결과는 모두 70~80%의 정확도를 보였다. 이를 통해 Random Forest 기법의 산사태 취약성 분석기법으로서의 적용 가능성을 확인하였으며, Random Forest 모델이 제공하는 입력변수의 중요도를 산사태 유발인자 가중치로 활용할 수 있음을 확인하였다. 또한 훈련 시나리오 간의 정확도를 비교한 결과, 특정한 기준에 의해 훈련 데이터를 설계하는 것이 기존의 랜덤 선택 방법보다 높은 예측 정확도를 기대할 수 있음을 확인하였다.

Keywords

JOHGB2_2019_v52n2_199_f0001.png 이미지

Fig. 1. Building process of Random Forest.

JOHGB2_2019_v52n2_199_f0002.png 이미지

Fig. 2. Study area: (a) location of Sangju area, (b) shaded relief map of Sangju area with landslide locations, (c) location of Jinbu area, (d) shaded relief map of Jinbu area with landslide locations.

JOHGB2_2019_v52n2_199_f0003.png 이미지

Fig. 3. Thematic maps for Sangju area: (a) altitude, (b) slope angle, (c) aspect, (d) standard curvature, (e) plan curvature, (f) profile curvature, (g) TWI, (h) SPI, (i) geology, (j) distance from faults, (k) forest type, (l) timber age, (m) timber diameter, (n) forest density, (o) soil type.

JOHGB2_2019_v52n2_199_f0004.png 이미지

Fig. 4. Thematic maps for Jinbu area: (a) altitude, (b) slope angle, (c) aspect, (d) standard curvature, (e) plan curvature, (f) profile curvature, (g) TWI, (h) SPI, (i) geology, (j) distance from faults, (k) forest type, (l) timber age, (m) timber diameter, (n) forest density, (o) soil type.

JOHGB2_2019_v52n2_199_f0005.png 이미지

Fig. 5. Methodological flow chart of the research process.

JOHGB2_2019_v52n2_199_f0006.png 이미지

Fig. 6. Landslide susceptibility maps for Sangju area: (a) Case 1, (b) Case 2, (c) Case 3, (d) Case 4, (e) Case 5, (f) Case 6.

JOHGB2_2019_v52n2_199_f0007.png 이미지

Fig. 7. Landslide susceptibility maps for Jinbu area: (a) Case 1, (b) Case 2, (c) Case 3, (d) Case 4, (e) Case 5, (f) Case 6.

JOHGB2_2019_v52n2_199_f0008.png 이미지

Fig. 8. Prediction rate curve for landslide susceptibility maps for Sangju area.

JOHGB2_2019_v52n2_199_f0009.png 이미지

Fig. 9. Prediction rate curve for landslide susceptibility maps for Jinbu area.

Table 1. The landslide conditioning factors used in this study

JOHGB2_2019_v52n2_199_t0001.png 이미지

Table 2. Training scenarios according to the sampling strategy

JOHGB2_2019_v52n2_199_t0002.png 이미지

Table 3. Prediction rate of landslide susceptibility maps

JOHGB2_2019_v52n2_199_t0003.png 이미지

References

  1. Baeza, C., Lantada, N. and Moya, J. (2010) Influence of sample and terrain unit on landslide susceptibility assessment at La Pobla de Lillet, Eastern Pyrenees, Spain. Environmental Earth Sciences, v.60, p.155-167. https://doi.org/10.1007/s12665-009-0176-4
  2. Breiman, L. (2001) Random forest. Machine Learning, v.45, p.5-32. https://doi.org/10.1023/A:1010933404324
  3. Brenning, A. (2005) Spatial prediction models for landslide hazards: review, comparison and evaluation. Natural Hazards and Earth System Science, v.5, p.853-862. https://doi.org/10.5194/nhess-5-853-2005
  4. Catani, F., Lagomarsino, D., Segoni, S. and Tofani, V. (2013) Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues. Natural Hazards and Earth System Sciences, v.13, p.2815-2831. https://doi.org/10.5194/nhess-13-2815-2013
  5. Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Bui, D.T., Duan, Z. and Ma, J. (2017) A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena, v.151, p.147-160. https://doi.org/10.1016/j.catena.2016.11.032
  6. Chen, W., Zhang, S., Li, R. and Shahabi, H. (2018) Performance evaluation of the GIS-based data mining techniques of best-first decision tree, random forest, and naive Bayes tree for landslide susceptibility modeling. Science of the Total Environment, v.644, p.1006-1018. https://doi.org/10.1016/j.scitotenv.2018.06.389
  7. Cho, J.H. and Kurup, P.U. (2011) Decision tree approach for classification and dimensionality reduction of electronic nose data. Sensors and Actuators B: Chemical, v.160, p.542-548. https://doi.org/10.1016/j.snb.2011.08.027
  8. Chung, C.J.F. and Fabbri, A.G. (2003) Validation of spatial prediction models for landslide hazard mapping. Natural Hazards, v.30, p.451-472. https://doi.org/10.1023/B:NHAZ.0000007172.62651.2b
  9. Dittman, D. J., Khoshgoftaar, T. M. and Napolitano, A. (2015). The effect of data sampling when using random forest on imbalanced bioinformatics data. In: 2015 IEEE International Conference on Information Reuse and Integration (IRI), pp. 457-463.
  10. Dudoit, S., Fridlyand, J. and Speed, T.P. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association, v.97, p.77-87. https://doi.org/10.1198/016214502753479248
  11. Duro, D.C., Franklin, S.E. and Dube, M.G. (2012) Multiscale object-based image analysis and feature selection of multi-sensor earth observation imagery using random forests. International Journal of Remote Sensing, v.33, p.4502-4526. https://doi.org/10.1080/01431161.2011.649864
  12. Goetz, J.N., Brenning, A., Petschko, H. and Leopold, P. (2015) Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling. Computers & Geosciences, v.81, p.1-11. https://doi.org/10.1016/j.cageo.2015.04.007
  13. Guzzetti, F., Carrara, A., Cardinali, M. and Reichenbach, P. (1999) Landslide hazard evaluation: a review of current techniques and their application in a multiscale study, Central Italy. Geomorphology, v.31, p.181-216. https://doi.org/10.1016/S0169-555X(99)00078-1
  14. Hamza, M. and Larocque, D. (2005) An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation and Simulation, v.75, p.629-643. https://doi.org/10.1080/00949650410001729472
  15. Hong, H., Pourghasemi, H.R. and Pourtaghi, Z.S. (2016) Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models. Geomorphology, v.259, p.105-118. https://doi.org/10.1016/j.geomorph.2016.02.012
  16. Hong, H., Pradhan, B., Xu, C. and Bui, D.T. (2015) Spatial prediction of landslide hazard at the Yihuang area (China) using two-class kernel logistic regression, alternating decision tree and support vector machines. Catena, v.133, p.266-281. https://doi.org/10.1016/j.catena.2015.05.019
  17. Kalantar, B., Pradhan, B., Naghibi, S.A., Motevalli, A. and Mansor, S. (2018) Assessment of the effects of training data selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN). Geomatics, Natural Hazards and Risk, v.9, p.49-69. https://doi.org/10.1080/19475705.2017.1407368
  18. Kim, J.C., Lee, S., Jung, H.S. and Lee, S. (2018) Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang, Korea. Geocarto International, v.33, p.1000-1015. https://doi.org/10.1080/10106049.2017.1323964
  19. Kim, W.Y., Chae B.G., Kim, K.S., Cho, Y.C., Lee, C.O., Lee, C.W., Kim, K.Y., Kim, J.H. and Kim, J.M. (2003) Study on landslide hazard prediction. Ministry of Science and Technology, 339p.
  20. Lee S., Lee M.J. and Won J.S. (2005) Landslide susceptibility analysis and verification using artificial neural network in the Kangneung area. Economic and Environmental Geology, v.38, p.1-11.
  21. Lee, J.H. and Park, H.J. (2012) Assessment of landslide susceptibility using a coupled infinite slope model and hydrologic model in Jinbu area, Gangwon-do. Economic and Environmental Geology, v.45, p.697-707. https://doi.org/10.9719/EEG.2012.45.6.697
  22. Liaw, A. and Wiener, M. (2002) Classification and regression by randomForest. R news, v.2, p.18-22.
  23. Muller, A.C. and Guido, S. (2016) Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Inc., 386p.
  24. Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A. and Brown, S.D. (2004) An introduction to decision tree modeling. Journal of Chemometrics: A Journal of the Chemometrics Society, v.18, p.275-285. https://doi.org/10.1002/cem.873
  25. Na, X., Zhang, S., Li, X., Yu, H. and Liu, C. (2010) Improved land cover mapping using random forests combined with landsat thematic mapper imagery and ancillary geographic data. Photogrammetric Engineering & Remote Sensing, v.76, p.833-840. https://doi.org/10.14358/PERS.76.7.833
  26. Paola, J. D. and Schowengerdt, R. A. (1995) A review and analysis of backpropagation neural networks for classification of remotely-sensed multi-spectral imagery. International Journal of remote sensing, v.16, p.3033-3058. https://doi.org/10.1080/01431169508954607
  27. Park, C. (2016) A simple diagnostic statistic for determining the size of random forest. Journal of the Korean Data and Information Science Society, v.27, p.855-863. https://doi.org/10.7465/jkdi.2016.27.4.855
  28. Park, C. (2017) A measure of discrepancy based on margin of victory useful for the determination of random forest size. The Korean Data & Information Science Society, v.28, p.515-524.
  29. Pham, B.T., Bui, D.T., Prakash, I. and Dholakia, M.B. (2017) Hybrid integration of Multilayer Perceptron Neural Networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. Catena, v.149, p.52-63. https://doi.org/10.1016/j.catena.2016.09.007
  30. Pham, B.T., Pradhan, B., Bui, D.T., Prakash, I. and Dholakia, M.B. (2016) A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India). Environmental Modelling & Software, v.84, p.240-250. https://doi.org/10.1016/j.envsoft.2016.07.005
  31. Pradhan, B. (2013) A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using GIS. Computers & Geosciences, v.51, p.350-365. https://doi.org/10.1016/j.cageo.2012.08.023
  32. Pradhan, B. and Lee, S. (2010) Landslide susceptibility assessment and factor effect analysis: backpropagation artificial neural networks and their comparison with frequency ratio and bivariate logistic regression modelling. Environmental Modelling & Software, v.25, p.747-759. https://doi.org/10.1016/j.envsoft.2009.10.016
  33. Stumpf, A. and Kerle, N. (2011) Object-oriented mapping of landslides using Random Forests. Remote sensing of environment, v.115, p.2564-2577. https://doi.org/10.1016/j.rse.2011.05.013
  34. Tien Bui, D., Pradhan, B., Lofman, O., Revhaug, I. and Dick, O.B. (2012) Application of support vector machines in landslide susceptibility assessment for the Hoa Binh province (Vietnam) with kernel functions analysis. Proceedings of the iEMSs Sixth Biennial Meeting: International Congress on Environmental Modelling and Software (iEMSs 2012). International Environmental Modelling and Software Society, Leipzig, Germany(July).
  35. Tien Bui, D., Pradhan, B., Lofman, O., Revhaug, I. and Dick, O.B. (2013) Regional prediction of landslide hazard using probability analysis of intense rainfall in the Hoa Binh province, Vietnam. Natural hazards, v.66, p.707-730. https://doi.org/10.1007/s11069-012-0510-0
  36. Tien Bui, D., Tuan, T.A., Klempe, H., Pradhan, B. and Revhaug, I. (2016) Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides, v.13, p.361-378. https://doi.org/10.1007/s10346-015-0557-6
  37. Tsangaratos, P. and Ilia, I. (2016) Comparison of a logistic regression and Naive Bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size. Catena, v.145, p.164-179. https://doi.org/10.1016/j.catena.2016.06.004
  38. Watts, J.D., Lawrence, R.L., Miller, P.R. and Montagne, C. (2009) Monitoring of cropland practices for carbon sequestration purposes in north central Montana by Landsat remote sensing. Remote Sensing of Environment, v.113, p.1843-1852. https://doi.org/10.1016/j.rse.2009.04.015
  39. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., MacLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z. H., Steinbach, M., Hand, D.J. and Steinberg, D. (2008) Top 10 algorithms in data mining. Knowledge and information systems, v.14, p.1-37. https://doi.org/10.1007/s10115-007-0114-2
  40. Yilmaz, I. (2010) The effect of the sampling strategies on the landslide susceptibility mapping by conditional probability and artificial neural networks. Environmental Earth Sciences, v.60, p.505-519. https://doi.org/10.1007/s12665-009-0191-5
  41. Zhang, K., Wu, X., Niu, R., Yang, K. and Zhao, L. (2017) The assessment of landslide susceptibility mapping using random forest and decision tree methods in the Three Gorges Reservoir area, China. Environmental Earth Sciences, v.76, p.405. https://doi.org/10.1007/s12665-017-6731-5