Browse > Article
http://dx.doi.org/10.5351/KJAS.2017.30.1.041

Prediction of golf scores on the PGA tour using statistical models  

Lim, Jungeun (Department of Statistics, Ewha Womans University)
Lim, Youngin (Department of Statistics, Ewha Womans University)
Song, Jongwoo (Department of Statistics, Ewha Womans University)
Publication Information
The Korean Journal of Applied Statistics / v.30, no.1, 2017 , pp. 41-55 More about this Journal
Abstract
This study predicts the average scores of top 150 PGA golf players on 132 PGA Tour tournaments (2013-2015) using data mining techniques and statistical analysis. This study also aims to predict the Top 10 and Top 25 best players in 4 different playoffs. Linear and nonlinear regression methods were used to predict average scores. Stepwise regression, all best subset, LASSO, ridge regression and principal component regression were used for the linear regression method. Tree, bagging, gradient boosting, neural network, random forests and KNN were used for nonlinear regression method. We found that the average score increases as fairway firmness or green height or average maximum wind speed increases. We also found that the average score decreases as the number of one-putts or scrambling variable or longest driving distance increases. All 11 different models have low prediction error when predicting the average scores of PGA Tournaments in 2015 which is not included in the training set. However, the performances of Bagging and Random Forest models are the best among all models and these two models have the highest prediction accuracy when predicting the Top 10 and Top 25 best players in 4 different playoffs.
Keywords
PGA tour; golf; average score; linear regression; tree; bagging; gradient boosting; neural network; random forest; KNN; FedExCup;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Stone, M. and Brooks, R. (1990). Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression, Journal of the Royal Statistical Society Series B (Methodological), 52, 237-269.
2 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society B (Methodological), 58, 267-288.
3 Breiman, L. (1996). Bagging predictors, Machine Learning, 24, 123-140.
4 Connolly, R. A. and Rendleman Jr., R. J. (2012). What it takes to win on the PGA tour (If your name is "Tiger" or if it isn't), Interfaces, 42, 554-576.   DOI
5 Breiman, L. (2001). Random forests, Machine Learning, 45, 5-32.   DOI
6 Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984)., Classification and Regression Trees, Chapman and Hall, New York.
7 Connolly, R. A. and Rendleman Jr., R. J. (2008). Skill, luck and streaky play on the PGA tour, Journal of The American Statistical Association, 103, 74-88.   DOI
8 Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion), Technometrics, 35, 109-148.   DOI
9 Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55, 119-139.   DOI
10 Friedman, J. (2002). Stochastic gradient boosting, Computational Statistics & Data Analysis, 38, 367-378.   DOI
11 Gunther, F. and Fritsch, S. (2010). Neuralnet: training of neural networks, The R Journal, 2, 30-38.
12 Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, Springer, New York.
13 Hickman, D. C. and Metz, N. E. (2015). The impact of pressure on performance: evidence from the PGA tour, Journal of Economic Behavior & Organization, 116, 319-330.   DOI
14 Hoerl, A. and Kennard, R. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12, 55-67.   DOI
15 Ridgeway, G. (2012). Generalized Boosted Models: A guide to the gbm package.
16 Lee, H. W. and Lee, S. H. (2014). Analysis on the trend of domestic studies on golf : focusing on the Korean Journal of Golf Studies, Korean Journal of Golf Studies, 8, 77-84.
17 Park, C., Kim, Y., Kim, J., Song, J., and Choi, H. (2011). Datamining using R, Kyowoo, Seoul.
18 R Development Core Team. (2015). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
19 Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13, 21-27.   DOI