[KSCI] Korea Science Citation Index Service

Simple hypotheses testing for the number of trees in a random forest

Park, Cheol-Yong (Department of Statistics, Keimyung University)

Publication Information

Journal of the Korean Data and Information Science Society / v.21, no.2, 2010 , pp. 371-377 More about this Journal

Abstract

In this study, we propose two informal hypothesis tests which may be useful in determining the number of trees in a random forest for use in classification. The first test declares that a case is 'easy' if the hypothesis of the equality of probabilities of two most popular classes is rejected. The second test declares that a case is 'hard' if the hypothesis that the relative difference or the margin of victory between the probabilities of two most popular classes is greater than or equal to some small number, say 0.05, is rejected. We propose to continue generating trees until all (or all but a small fraction) of the training cases are declared easy or hard. The advantage of combining the second test along with the first test is that the number of trees required to stop becomes much smaller than the first test only, where all (or all but a small fraction) of the training cases should be declared easy.

Keywords

Hypotheses testing; random forest;

Citations & Related Records

Reference

1	Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. DOI ScienceOn
2	Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classfi- cation of tumors using gene expression data. Journal of the American Statistical Society, 97, 77-87. DOI ScienceOn
3	Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation & Simulation, 75, 629-643. DOI ScienceOn
4	Lee, J. W., Lee, J. B., Park, M. and Song, S. H. (2005). An extensive evaluation of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48, 869-885. DOI ScienceOn
5	Park, C. (2007). A stopping rule for the number of generating trees in a random forest. Journal of the Institute of Natural Sciences, 27, 7-10.
6	Ramey, J. T. and Alam, K. (1979). A sequential procedure for selecting the most probable multinomial event. Biometrika, 55, 171-173.
7	Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686. DOI
8	Bhandari, S. K. and Ali, M. M. (1994). An asymptotically minimax procedure for selecting the t -best multinomial cells. Journal of Statistical Planning & Inference, 38, 65-74. DOI ScienceOn
9	Alam, K. (1971). On selecting the most probable category. Technometrics, 13, 843-850. DOI ScienceOn
10	Amaratunga, D., Cabrera, J. and Lee, Y. S. (2008). Enriched random forests. Bioinformatics, 24, 2010-2014. DOI ScienceOn
11	Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.