[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/CKSS.2006.13.1.151

Estimating Prediction Errors in Binary Classification Problem: Cross-Validation versus Bootstrap

Kim Ji-Hyun (Dept. of Statistics, Soongsil University)
Cha Eun-Song (Dept. of Statistics, Soongsil University)

Publication Information

Communications for Statistical Applications and Methods / v.13, no.1, 2006 , pp. 151-165 More about this Journal

Abstract

It is important to estimate the true misclassification rate of a given classifier when an independent set of test data is not available. Cross-validation and bootstrap are two possible approaches in this case. In related literature bootstrap estimators of the true misclassification rate were asserted to have better performance for small samples than cross-validation estimators. We compare the two estimators empirically when the classification rule is so adaptive to training data that its apparent misclassification rate is close to zero. We confirm that bootstrap estimators have better performance for small samples because of small variance, and we have found a new fact that their bias tends to be significant even for moderate to large samples, in which case cross-validation estimators have better performance with less computation.

Keywords

Generalization Error; Prediction Accuracy; Classification Tree; Boosting;

Citations & Related Records

Reference

1	Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, Vol. 36, 105-139 DOI
2	Merler, S. and Furlanello, C. (1997). Selection of tree-based classifiers with the bootstrap 632+ rule. RIST Technical Report: TR-9605-01, revised Jan 97
3	Efron, B. and Tibshirani, R. (1997), Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association, Vol. 92. 548-560 DOI
4	Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman and Hall
5	Themeau, T.M. and Atkinson, E.J. (1997). An introduction to recursive partitioning using the RPART routines. Technical Report, Mayo Foundation
6	Cha, E.S. (2005). 예측오차 추정방법에 대한 비교연구, 석사학위논문, 숭실대학교
7	Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases. University of California in Irvine, Department of Information and Computer Science
8	Braga-Neto, U.M. and Dougherty, E.R. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics, Vol. 20, 374-380 DOI ScienceOn
9	Crawford, S.L. (1989). Extensions to the CART algorithm, Intemational Journal of Man-Machine Studies, Vol. 31, 197-217 DOI ScienceOn
10	Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross -validation. Journal of the American Statistical Association, Vol. 78, 316-331 DOI
11	Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, Vol. 55, 119-139 DOI ScienceOn
12	Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Technical Report, Stanford University, Department of Computer Sciences
13	R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Available from http://www.R-project.org