[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7465/jkdi.2016.27.4.855

A simple diagnostic statistic for determining the size of random forest

Park, Cheolyong (Major in Statistics, Keimyung University)

Publication Information

Journal of the Korean Data and Information Science Society / v.27, no.4, 2016 , pp. 855-863 More about this Journal

Abstract

In this study, a simple diagnostic statistic for determining the size of random forest is proposed. This method is based on MV (margin of victory), a scaled difference in the votes at the infinite forest between the first and second most popular categories of the current random forest. We can note that if MV is negative then there is discrepancy between the current and infinite forests. More precisely, our method is based on the proportion of cases that -MV is greater than a fixed small positive number (say, 0.03). We derive an appropriate diagnostic statistic for our method and then calculate the distribution of the statistic. A simulation study is performed to compare our method with a recently proposed diagnostic statistic.

Keywords

Diagnostic statistic; margin of victory; random forest; size determination;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	Banfield, R. E., Hall, L. O., Bowyer, K. W. and Kegelmeyer, W. P. (2007). A comparison of decision tree creation techniques. IEEE Transactions on Pattern Recognition and Machine Learning, 29, 173-180. DOI
2	Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
3	Breiman, L. (2001). Random forest. Machine Learning, 45, 5-32. DOI
4	Choi, S. H. and Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data & Information Science Society, 27, 255-264. DOI
5	Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Society, 97, 77-87. DOI
6	Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation and Simulation, 75, 629-643. DOI
7	Hernandez-Lobato, D., Martinez-Munoz, G. and Suarez, A. (2011). Inference on prediction of ensembles of infinite size. Pattern Recognition, 44, 1426-1434. DOI
8	Hernandez-Lobato, D., Martinez-Munoz, G. and Suarez, A. (2013). How large should ensembles of classifiers be? Pattern Recognition, 46, 1323-1336. DOI
9	Park, C. (2010). Simple hypotheses testing for the number of trees in a random forest. Journal of the Korean Data & Information Science Society, 21, 371-377.
10	Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686. DOI

2	(2016) Journal of the Korean Data & Information Science Society 시간단위 전력사용량 시계열 패턴의 군집 및 분류분석 / 28 (2) , 395
3	(2016) Journal of the Korean Data & Information Science Society 랜덤포레스트의 크기 결정에 유용한 승리표차에 기반한 불일치 측도 / 28 (3) , 515
11	(2016) 디지털융복합연구 돌발홍수 예보를 위한 빅데이터 분석방법 / 15 (11) , 245

KSCI

A simple diagnostic statistic for determining the size of random forest 랜덤포레스트의 크기 결정을 위한 간편 진단통계량

A simple diagnostic statistic for determining the size of random forest