DOI QR코드

DOI QR Code

Prediction of golf scores on the PGA tour using statistical models

PGA 투어의 골프 스코어 예측 및 분석

  • Lim, Jungeun (Department of Statistics, Ewha Womans University) ;
  • Lim, Youngin (Department of Statistics, Ewha Womans University) ;
  • Song, Jongwoo (Department of Statistics, Ewha Womans University)
  • 임정은 (이화여자대학교 통계학과) ;
  • 임영인 (이화여자대학교 통계학과) ;
  • 송종우 (이화여자대학교 통계학과)
  • Received : 2016.09.21
  • Accepted : 2016.12.27
  • Published : 2017.02.28

Abstract

This study predicts the average scores of top 150 PGA golf players on 132 PGA Tour tournaments (2013-2015) using data mining techniques and statistical analysis. This study also aims to predict the Top 10 and Top 25 best players in 4 different playoffs. Linear and nonlinear regression methods were used to predict average scores. Stepwise regression, all best subset, LASSO, ridge regression and principal component regression were used for the linear regression method. Tree, bagging, gradient boosting, neural network, random forests and KNN were used for nonlinear regression method. We found that the average score increases as fairway firmness or green height or average maximum wind speed increases. We also found that the average score decreases as the number of one-putts or scrambling variable or longest driving distance increases. All 11 different models have low prediction error when predicting the average scores of PGA Tournaments in 2015 which is not included in the training set. However, the performances of Bagging and Random Forest models are the best among all models and these two models have the highest prediction accuracy when predicting the Top 10 and Top 25 best players in 4 different playoffs.

최근 골프는 많은 사람들의 취미 생활로서 자리를 잡아가고 있으며 골프와 관련된 연구도 다양하게 이루어지고 있다. 본 연구에서는 데이터 마이닝 기법을 사용하여 PGA 투어에 참여하는 선수들의 평균스코어를 예측하고 스코어에 유의한 영향을 미치는 변수들을 제시하고자 한다. 그리고 추가적으로 4개의 PGA 투어 플레이오프에 대해 상위 10명, 상위 25명의 선수들을 예측하는 것을 목표로 한다. 우리는 다양한 선형/비선형 회귀분석 방법을 이용하여 평균스코어를 예측하는데, 선형회귀분석 방법으로는 단계적 선택법, 모든 가능한 회귀모형, 라소(LASSO), 능형회귀, 주성분회귀분석을 사용하였으며 비선형회귀분석 방법으로는 트리(CART), 배깅, 그래디언트 부스팅, 신경망 모형, 랜덤 포레스트, 최근접이웃방법(KNN)을 사용하였다. 대부분의 모형에서 공통적으로 선택된 변수들을 살펴보면 페어웨이의 단단함와 그린의 풀의 높이, 평균최대풍속이 높을수록 선수들의 평균스코어는 높아지며 반대로 한 번에 퍼팅을 성공시키는 횟수와 그린적중률 실패 후 버디나 이글로 점수를 만드는 scrambling 변수들, 그리고 공을 멀리 보낼 수 있는 능력을 나타내는 longest drive는 그 값이 높아짐에 따라 선수들의 평균스코어가 낮아지는 경향이 있음을 알 수 있었다. 11가지 모형 모두 테스트 데이터인 2015년 경기 결과를 예측하는데 낮은 오류율을 보였으나 배깅과 랜덤 포레스트의 예측률이 가장 좋았으며 두 모형 모두 상위 10명과 상위 25명의 랭킹을 예측할 때 상당히 높은 적중률을 보였다.

Keywords

References

  1. Breiman, L. (1996). Bagging predictors, Machine Learning, 24, 123-140.
  2. Breiman, L. (2001). Random forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  3. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984)., Classification and Regression Trees, Chapman and Hall, New York.
  4. Connolly, R. A. and Rendleman Jr., R. J. (2008). Skill, luck and streaky play on the PGA tour, Journal of The American Statistical Association, 103, 74-88. https://doi.org/10.1198/016214507000000310
  5. Connolly, R. A. and Rendleman Jr., R. J. (2012). What it takes to win on the PGA tour (If your name is "Tiger" or if it isn't), Interfaces, 42, 554-576. https://doi.org/10.1287/inte.1110.0615
  6. Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13, 21-27. https://doi.org/10.1109/TIT.1967.1053964
  7. Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion), Technometrics, 35, 109-148. https://doi.org/10.1080/00401706.1993.10485033
  8. Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55, 119-139. https://doi.org/10.1006/jcss.1997.1504
  9. Friedman, J. (2002). Stochastic gradient boosting, Computational Statistics & Data Analysis, 38, 367-378. https://doi.org/10.1016/S0167-9473(01)00065-2
  10. Gunther, F. and Fritsch, S. (2010). Neuralnet: training of neural networks, The R Journal, 2, 30-38.
  11. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, Springer, New York.
  12. Hickman, D. C. and Metz, N. E. (2015). The impact of pressure on performance: evidence from the PGA tour, Journal of Economic Behavior & Organization, 116, 319-330. https://doi.org/10.1016/j.jebo.2015.04.007
  13. Hoerl, A. and Kennard, R. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
  14. Lee, H. W. and Lee, S. H. (2014). Analysis on the trend of domestic studies on golf : focusing on the Korean Journal of Golf Studies, Korean Journal of Golf Studies, 8, 77-84.
  15. Park, C., Kim, Y., Kim, J., Song, J., and Choi, H. (2011). Datamining using R, Kyowoo, Seoul.
  16. R Development Core Team. (2015). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
  17. Ridgeway, G. (2012). Generalized Boosted Models: A guide to the gbm package.
  18. Stone, M. and Brooks, R. (1990). Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression, Journal of the Royal Statistical Society Series B (Methodological), 52, 237-269.
  19. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society B (Methodological), 58, 267-288.