[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.29220/CSAM.2022.29.5.561

Comparison of tree-based ensemble models for regression

Park, Sangho (Department of Statistics, Sungkyunkwan University)
Kim, Chanmin (Department of Statistics, Sungkyunkwan University)

Publication Information

Communications for Statistical Applications and Methods / v.29, no.5, 2022 , pp. 561-589 More about this Journal

Abstract

When multiple classifications and regression trees are combined, tree-based ensemble models, such as random forest (RF) and Bayesian additive regression trees (BART), are produced. We compare the model structures and performances of various ensemble models for regression settings in this study. RF learns bootstrapped samples and selects a splitting variable from predictors gathered at each node. The BART model is specified as the sum of trees and is calculated using the Bayesian backfitting algorithm. Throughout the extensive simulation studies, the strengths and drawbacks of the two methods in the presence of missing data, high-dimensional data, or highly correlated data are investigated. In the presence of missing data, BART performs well in general, whereas RF provides adequate coverage. The BART outperforms in high dimensional, highly correlated data. However, in all of the scenarios considered, the RF has a shorter computation time. The performance of the two methods is also compared using two real data sets that represent the aforementioned situations, and the same conclusion is reached.

Keywords

Bayesian additive regression trees; random forest; missingness; high-dimensional data; multicollinearity;

Citations & Related Records

Reference

1	Kapelner A and Bleich J (2016). bartMachine: Machine learning with Bayesian additive regression trees, Journal of Statistical Software, 70, 1-40.
2	Kern C, Klausch T, and Kreuter F (2019). Tree-based machine learning methods for survey research, Survey Research Method, 13, 73-93.
3	Kuhn M and Johnson K (2013). Applied predictive modeling, Springer, New York.
4	Kapelner A and Bleich J (2015). Prediction with missing data via Bayesian additive regression trees, Canadian Journal of Statistics, 43, 224-239. DOI
5	Waldmann P (2016). Genome-wide prediction using Bayesian additive regression trees, Genetics Selection Evolution, 48, 1-12. DOI
6	Wright MN, Wager S, and Probst P (2020). Ranger: A fast implementation of random forests, R package version 0.12,1.
7	Zhang H, Zimmerman J, Nettleton D, and Nordman DJ (2019). Random forest prediction intervals, The American Statistician, 74, 392-406. DOI
8	Linero AR (2018). Bayesian regression trees for high-dimensional prediction and variable selection, Journal of the American Statistical Association, 113, 626-636. DOI
9	Rubin DB (1976). Inference and missing data, Biometrika, 63, 581-592. DOI
10	Sparapani R, Spanbauer C, and McCulloch R (2021). Nonparametric machine learning and efficient computation with Bayesian additive regression trees: the BART R package, Journal of Statistical Software, 97, 1-66.
11	Stekhoven DJ and B¨uhlmann P (2012). MissForest-non-parametric missing value imputation for mixed-type data, Bioinformatics, 28, 112-118. DOI
12	Strobl C, Boulesteix AL, Zeileis A, and Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 8, 1-21. DOI
13	Tang F and Ishwaran H (2017). Random forest missing data algorithms, Statistical Analysis and Data Mining, 10, 363-377. DOI
14	Breiman L, Friedman JH, Olshen R, and Stong CJ (1984). Classification and Regression Trees, Routledge, New York.
15	Buhlmann P and Van De Geer S (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer Science & Business Media, New York.
16	Chipman HA, George EI, and McCulloch RE (1998). Bayesian CART model search, Journal of the American Statistical Association, 93, 935-948. DOI
17	Chipman HA, George EI, and McCulloch RE (2010). BART: Bayesian additive regression trees, The Annals of Applied Statistics, 4, 266-298. DOI
18	Gunduz N and Fokoue E (2015). Robust classification of high dimension low sample size data, arXiv:1501.00592.
19	Fox EW, Hill RA, Leibowitz SG, Olsen AR, Thornbrugh DJ, and Weber MH (2017). Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology, Environmental Monitoring and Asessment, 189, 1-20. DOI
20	Friedman JH (1991). Multivariate adaptive regression splines, The Annals of Statistics, 19, 1-141. DOI
21	Janitza S, Celik E, and Boulesteix AL (2018). A computationally fast variable importance test for random forests for high-dimensional data, Advances in Data Analysis and Classification, 12, 885-915. DOI
22	Liaw A and Wiener M (2002). Classification and regression by randomForest, R News, 2, 18-22.
23	Breiman L (2001). Random forests, Machine Learning, 45, 5-32. DOI
24	Hernandez B, Raftery AE, Pennington SR, and Parnell AC (2018). Bayesian additive regression trees using Bayesian model averaging, Statistics and Computing, 28, 869-890. DOI