[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7465/jkdi.2016.27.1.255

Tree size determination for classification ensemble

Choi, Sung Hoon (Department of Applied Statistics, Yonsei University)
Kim, Hyunjoong (Department of Applied Statistics, Yonsei University)

Publication Information

Journal of the Korean Data and Information Science Society / v.27, no.1, 2016 , pp. 255-264 More about this Journal

Abstract

Classification is a predictive modeling for a categorical target variable. Various classification ensemble methods, which predict with better accuracy by combining multiple classifiers, became a powerful machine learning and data mining paradigm. Well-known methodologies of classification ensemble are boosting, bagging and random forest. In this article, we assume that decision trees are used as classifiers in the ensemble. Further, we hypothesized that tree size affects classification accuracy. To study how the tree size in uences accuracy, we performed experiments using twenty-eight data sets. Then we compare the performances of ensemble algorithms; bagging, double-bagging, boosting and random forest, with different tree sizes in the experiment.

Keywords

Bagging; boosting; classification; decision tree; double-bagging; ensemble; random forest;

Citations & Related Records

Times Cited By KSCI : 3 (Citation Analysis)

Reference
Cited By KSCI

1	Asuncion, A. and Newman, D. J. (2007). UCI machine learning repository. University of California, Irvine, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
2	Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36, 105-139. DOI
3	Breiman, L. (1996a). Bagging predictors. Machine Learning, 26, 123-140.
4	Breiman, L. (1996b). Out-of-bag estimation, Technical Report, Statistics Department, University of California Berkeley, Berkeley, California 94708, https://www.stat.berkeley.edu/-breiman/OOBestimation.pdf.
5	Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. DOI
6	Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and regression trees, Chapman and Hall, New York.
7	Dietterich, T. (2000). Ensemble methods in machine learning, Springer, Berlin.
8	Freund, Y. and Schapire, R. (1996). Game theory, on-line prediction and boosting. Proceedings of the Ninth Annual Conference on Computational Learning Theory, 325-332.
9	Hansen, L. K., Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and machine Intelligence, 12, 993-1001. DOI
10	Heinz, G., Peterson, L. J., Johnson, R. W. and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11, http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html.
11	Hothorn, T. and Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36, 1303-1309. DOI
12	Kim, A., Kim, J. and Kim, H. (2012). The guideline for choosing the right-size of tree for boosting algorithm. Journal of the Korean Data and Information Science Society, 23, 949-959. DOI
13	Kim, H. and Loh, W. Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589-604. DOI
14	Kim, H. and Loh, W. Y. (2003). Classification trees with bivariate linear discriminant node models. Journal of Computational and Graphical Statistics, 12, 512-530. DOI
15	Kwak, S. and Kim, H. (2014). Comparison of ensemble pruning methods using Lasso-bagging and WAVE-bagging. Journal of the Korean Data and Information Science Society, 25, 1371-1383. DOI
16	Liew, A. and Wiener, M. (2002). Classification and regression by random forests. R News, 2, 18-22.
17	Loh, W. Y. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 3, 1710-1737. DOI
18	Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.
19	Shim, J. and Hwang, C. H. (2014). Support vector quantile regression ensemble with bagging. Journal of the Korean Data and Information Science Society, 25, 677-684. DOI
20	Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 297-336. DOI
21	Statlib. (2010). Datasets archive. Carnegie Mellon University, Department of Statistics, http://lib.stat.cmu.edu.
22	Terhune, J. M. (1994). Geographical variation of harp seal underwater vocalizations. Canadian Journal of Zoology, 72, 892-897. DOI
23	Therneau, T. and Atkinson, E. (1997). An introduction to recursive partitioning using the RPART routines, Mayo Foundation, Rochester, New York. http://eric.univ-lyon2.fr/-ricco/cours/didacticiels/r/longdocrpart.pdf.
24	Zhu, J., Zou, H., Rosset, S. and Hastie, T. (2009). Multi-class AdaBoost. Statistics and its Interface, 2, 349-360. DOI

4	Cheolyong Park. (2016) Journal of the Korean Data and Information Science Society A simple diagnostic statistic for determining the size of random forest / 27 (4) , 855
3	(2016) Journal of the Korean Data & Information Science Society 랜덤포레스트의 크기 결정에 유용한 승리표차에 기반한 불일치 측도 / 28 (3) , 515
2	(2019) 정보관리학회지 랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구 / 36 (2) , 57