Browse > Article
http://dx.doi.org/10.7465/jkdi.2016.27.1.255

Tree size determination for classification ensemble  

Choi, Sung Hoon (Department of Applied Statistics, Yonsei University)
Kim, Hyunjoong (Department of Applied Statistics, Yonsei University)
Publication Information
Journal of the Korean Data and Information Science Society / v.27, no.1, 2016 , pp. 255-264 More about this Journal
Abstract
Classification is a predictive modeling for a categorical target variable. Various classification ensemble methods, which predict with better accuracy by combining multiple classifiers, became a powerful machine learning and data mining paradigm. Well-known methodologies of classification ensemble are boosting, bagging and random forest. In this article, we assume that decision trees are used as classifiers in the ensemble. Further, we hypothesized that tree size affects classification accuracy. To study how the tree size in uences accuracy, we performed experiments using twenty-eight data sets. Then we compare the performances of ensemble algorithms; bagging, double-bagging, boosting and random forest, with different tree sizes in the experiment.
Keywords
Bagging; boosting; classification; decision tree; double-bagging; ensemble; random forest;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Asuncion, A. and Newman, D. J. (2007). UCI machine learning repository. University of California, Irvine, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
2 Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36, 105-139.   DOI
3 Breiman, L. (1996a). Bagging predictors. Machine Learning, 26, 123-140.
4 Breiman, L. (1996b). Out-of-bag estimation, Technical Report, Statistics Department, University of California Berkeley, Berkeley, California 94708, https://www.stat.berkeley.edu/-breiman/OOBestimation.pdf.
5 Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.   DOI
6 Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and regression trees, Chapman and Hall, New York.
7 Dietterich, T. (2000). Ensemble methods in machine learning, Springer, Berlin.
8 Freund, Y. and Schapire, R. (1996). Game theory, on-line prediction and boosting. Proceedings of the Ninth Annual Conference on Computational Learning Theory, 325-332.
9 Hansen, L. K., Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and machine Intelligence, 12, 993-1001.   DOI
10 Heinz, G., Peterson, L. J., Johnson, R. W. and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
11 Hothorn, T. and Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36, 1303-1309.   DOI
12 Kim, A., Kim, J. and Kim, H. (2012). The guideline for choosing the right-size of tree for boosting algorithm. Journal of the Korean Data and Information Science Society, 23, 949-959.   DOI
13 Kim, H. and Loh, W. Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589-604.   DOI
14 Kim, H. and Loh, W. Y. (2003). Classification trees with bivariate linear discriminant node models. Journal of Computational and Graphical Statistics, 12, 512-530.   DOI
15 Kwak, S. and Kim, H. (2014). Comparison of ensemble pruning methods using Lasso-bagging and WAVE-bagging. Journal of the Korean Data and Information Science Society, 25, 1371-1383.   DOI
16 Liew, A. and Wiener, M. (2002). Classification and regression by random forests. R News, 2, 18-22.
17 Loh, W. Y. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 3, 1710-1737.   DOI
18 Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.
19 Shim, J. and Hwang, C. H. (2014). Support vector quantile regression ensemble with bagging. Journal of the Korean Data and Information Science Society, 25, 677-684.   DOI
20 Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297-336.   DOI
21 Statlib. (2010). Datasets archive. Carnegie Mellon University, Department of Statistics, http://lib.stat.cmu.edu.
22 Terhune, J. M. (1994). Geographical variation of harp seal underwater vocalizations. Canadian Journal of Zoology, 72, 892-897.   DOI
23 Therneau, T. and Atkinson, E. (1997). An introduction to recursive partitioning using the RPART routines, Mayo Foundation, Rochester, New York. http://eric.univ-lyon2.fr/-ricco/cours/didacticiels/r/longdocrpart.pdf.
24 Zhu, J., Zou, H., Rosset, S. and Hastie, T. (2009). Multi-class AdaBoost. Statistics and its Interface, 2, 349-360.   DOI