Browse > Article

Simple hypotheses testing for the number of trees in a random forest  

Park, Cheol-Yong (Department of Statistics, Keimyung University)
Publication Information
Journal of the Korean Data and Information Science Society / v.21, no.2, 2010 , pp. 371-377 More about this Journal
Abstract
In this study, we propose two informal hypothesis tests which may be useful in determining the number of trees in a random forest for use in classification. The first test declares that a case is 'easy' if the hypothesis of the equality of probabilities of two most popular classes is rejected. The second test declares that a case is 'hard' if the hypothesis that the relative difference or the margin of victory between the probabilities of two most popular classes is greater than or equal to some small number, say 0.05, is rejected. We propose to continue generating trees until all (or all but a small fraction) of the training cases are declared easy or hard. The advantage of combining the second test along with the first test is that the number of trees required to stop becomes much smaller than the first test only, where all (or all but a small fraction) of the training cases should be declared easy.
Keywords
Hypotheses testing; random forest;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.   DOI   ScienceOn
2 Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classfi- cation of tumors using gene expression data. Journal of the American Statistical Society, 97, 77-87.   DOI   ScienceOn
3 Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation & Simulation, 75, 629-643.   DOI   ScienceOn
4 Lee, J. W., Lee, J. B., Park, M. and Song, S. H. (2005). An extensive evaluation of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48, 869-885.   DOI   ScienceOn
5 Park, C. (2007). A stopping rule for the number of generating trees in a random forest. Journal of the Institute of Natural Sciences, 27, 7-10.
6 Ramey, J. T. and Alam, K. (1979). A sequential procedure for selecting the most probable multinomial event. Biometrika, 55, 171-173.
7 Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686.   DOI
8 Bhandari, S. K. and Ali, M. M. (1994). An asymptotically minimax procedure for selecting the t -best multinomial cells. Journal of Statistical Planning & Inference, 38, 65-74.   DOI   ScienceOn
9 Alam, K. (1971). On selecting the most probable category. Technometrics, 13, 843-850.   DOI   ScienceOn
10 Amaratunga, D., Cabrera, J. and Lee, Y. S. (2008). Enriched random forests. Bioinformatics, 24, 2010-2014.   DOI   ScienceOn
11 Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.