DOI QR코드

DOI QR Code

Variable Selection with Regression Trees

  • Received : 20100100
  • Accepted : 20100200
  • Published : 2010.04.30

Abstract

Many tree algorithms have been developed for regression problems. Although they are regarded as good algorithms, most of them suffer from loss of prediction accuracy when there are many noise variables. To handle this problem, we propose the multi-step GUIDE, which is a regression tree algorithm with a variable selection process. The multi-step GUIDE performs better than some of the well-known algorithms such as Random Forest and MARS. The results based on simulation study shows that the multi-step GUIDE outperforms other algorithms in terms of variable selection and prediction accuracy. It generally selects the important variables correctly with relatively few noise variables and eventually gives good prediction accuracy.

Keywords

References

  1. Belsley, D. A. (1980). On the efficient computation of the nonlinear full-information maximum-likelihood estimator, Journal of Econometrics, 14, 203-225. https://doi.org/10.1016/0304-4076(80)90091-3
  2. Breiman, L. (2001). Random Forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  3. Chattopadhyay, S. (2003). Divergence Between the Hicksian Welfare Measures: The Case of Revealed Preference for Public Amenities, Journal of Applied Econometrics, 17, 641-66.
  4. Cook, D. and Weisberg, S. (1994). An introduction to Regression Graphics, Wiley, New York.
  5. Denman, N. and Gregory, D. (1998). Analysis of sugar cane yields in the mulgrave area, for the 1997 sugar cane season, Technical report, MS305 Data Analysis Project, Department of Mathematics, University of Queensland.
  6. Doksum, K., Tang, S. and Tsui, K. W. (2006). Nonparametric variable selection: The EARTH algorithm, Journal of the American Statistical Association, 103, 1609-1620. https://doi.org/10.1198/016214508000000878
  7. Friedman, J. H. (1991). Multivariate adaptive regression splines, Annals of Statistics, 19, 1-67. https://doi.org/10.1214/aos/1176347963
  8. Kenkel, D. and Terza, J. (2001). The effect of physician advice on alcohol consumption: countregression with an endogenous treatment effect, Journal of applied econometrics, 16, 165-184. https://doi.org/10.1002/jae.596
  9. Liu, Z. and Stengos, T. (1999). Non-linearities in cross country growth regressions: A semiparametric approach, Journal of Applied Econometrics, 14, 527-538. https://doi.org/10.1002/(SICI)1099-1255(199909/10)14:5<527::AID-JAE528>3.0.CO;2-X
  10. Loh, W. Y. (2002). Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 12, 361-386.
  11. Onoyama, K., Ohsumi, N., Mitsumochi, N. and Kishihara, T. (1998). Data analysis of deer-train collisions in eastern Hokkaido, Data Science, Classification, and Related Methods (ed. by Hayashi, C., Ohsumi, N., Yajima, K., Tanaka, Y., Bock, H.-H., Baba, Y.), 746-751, Japan. BMC Bioinformatics, 8:25
  12. Svetnik, V., Liaw, A., Tong, C. and Culberson, J. C. (2003). Random forest: A classification and regression tool for compound classification and QSAR modeling, Journal of Chemical Information and Computer Sciences, 43, 1947-1958. https://doi.org/10.1021/ci034160g

Cited by

  1. Multi-Step Classification Trees vol.41, pp.9, 2012, https://doi.org/10.1080/03610918.2011.624238