DOI QR코드

DOI QR Code

Horse race rank prediction using learning-to-rank approaches

Learning-to-rank 기법을 활용한 서울 경마경기 순위 예측

  • Junhyoung Chung (Department of Statistics, Seoul National University) ;
  • Donguk Shin (Department of Statistics, Seoul National University) ;
  • Seyong Hwang (Department of Statistics, Seoul National University) ;
  • Gunwoong Park (Department of Statistics, Seoul National University)
  • Received : 2023.09.24
  • Accepted : 2023.11.27
  • Published : 2024.04.30

Abstract

This research applies both point-wise and pair-wise learning strategies within the learning-to-rank (LTR) framework to predict horse race rankings in Seoul. Specifically, for point-wise learning, we employ a linear model and random forest. In contrast, for pair-wise learning, we utilize tools such as RankNet, and LambdaMART (XGBoost Ranker, LightGBM Ranker, and CatBoost Ranker). Furthermore, to enhance predictions, race records are standardized based on race distance, and we integrate various datasets, including race information, jockey information, horse training records, and trainer information. Our results empirically demonstrate that pair-wise learning approaches that can reflect the order information between items generally outperform point-wise learning approaches. Notably, CatBoost Ranker is the top performer. Through Shapley value analysis, we identified that the important variables for CatBoost Ranker include the performance of a horse, its previous race records, the count of its starting trainings, the total number of starting trainings, and the instances of disease diagnoses for the horse.

본 연구는 learning-to-rank (LTR) 기법 중 point-wise와 pair-wise learning을 적용하여 서울 경마경기 순위 예측을 수행하였다. Point-wise learning으로는 선형 회귀와 랜덤 포레스트를 pair-wise learning으로는 RankNet, LambdaMART (XGBoost Ranker, LightGBM Ranker, CatBoost Ranker)을 활용하였다. 또한 데이터 불균형 문제를 해결하기 위해 전처리 과정에서 경주기록을 경주거리에 따라 표준화하는 방식을 채택하였으며, 모형의 예측 능력 향상을 위해 경기 정보, 기수 정보, 마필 정보, 조교사 정보 등의 다양한 데이터를 사용하였다. 그 결과 아이템 간의 순위관계를 학습할 수 있는 pair-wise learning이 point-wise learning보다 전반적으로 더 뛰어난 예측력을 보이는 것을 확인하였다. 특히 CatBoost Ranker는 제시된 모형들 중 가장 뛰어난 예측 성능을 보였다. 마지막으로 섀플리 값을 통해 CatBoost Ranker에서 경주마의 성적, 직전 경주기록, 경주마의 출발훈련 횟수, 누적 출발훈련 횟수, 질병 진단횟수 등이 상위 10개 중요 변수에 포함된 것을 확인하였다.

Keywords

Acknowledgement

이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (NRF-2021R1C1C1004562 and RS-2023-00218231). 또한, 이 연구는 서울대학교 신임교수 연구정착금으로 지원되는 연구비에 의하여 수행되었음.

References

  1. Breiman L (2001). Random forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  2. Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, and Hullender G (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning. Association for Computing Machinery, New York, NY, 89-96.
  3. Burges C, Ragno R, and Le Q (2006). Learning to rank with nonsmooth cost functions, Advances in Neural Information Processing Systems, 19.
  4. Burges CJ (2010). From ranknet to lambdarank to lambdamart: An overview, Learning, 11, 81.
  5. Chen T and Guestrin C (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, San Francisco, CA, USA, 785-794).
  6. Choe H, Hwang N, Hwang C, and Song J (2015). Analysis of horse races: Prediction of winning horses in horse races using statistical models, The Korean Journal of Applied Statistics, 28, 1133-1146. https://doi.org/10.5351/KJAS.2015.28.6.1133
  7. Grinsztajn L, Oyallon E, and Varoquaux G (2022). Why do tree-based models still outperform deep learning on typical tabular data?, Advances in Neural Information Processing Systems, 35, 507-520.
  8. Hu Z, Wang Y, Peng Q, and Li H (2019). Unbiased lambdamart: An unbiased pairwise learning-to-rank algorithm. In Proceedings of The World Wide Web Conference. Association for Computing Machinery, New York, NY, USA, 2830-2836.
  9. Jarvelin K and Kekalainen J (2017). IR evaluation methods for retrieving highly relevant documents, ACM SIGIR Forum, 51, 243-250. https://doi.org/10.1145/3130348.3130374
  10. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, and Liu TY (2017). Lightgbm: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems, 30.
  11. Kholkine L, Servotte T, De Leeuw AW, De Schepper T, Hellinckx P, Verdonck T, and Latre S (2021). A learn-to-rank approach for predicting road cycling race outcomes, Frontiers in Sports and Active Living, 3, 714107.
  12. Li P, Qin Z, Wang X, and Metzler D (2019). Combining decision trees and neural networks for learning-to-rank in personal search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 2032-2040.
  13. Liu TY (2009). Learning to rank for information retrieval, Foundations and Trends® in Information Retrieval, 3, 225-331. https://doi.org/10.1561/1500000016
  14. Park G, Park R, and Song J (2017). Analysis of cycle racing ranking using statistical prediction models, The Korean Journal of Applied Statistics, 30, 25-39. https://doi.org/10.5351/KJAS.2017.30.1.025
  15. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, and Gulin A (2018). CatBoost: Unbiased boosting with categorical features, Advances in Neural Information Processing Systems, 31.
  16. Pudaruth S, Medard N, and Dookhun ZB (2013). Horse racing prediction at the champ de mars using a weighted probabilistic approach, International Journal of Computer Applications, 72, 39-42. https://doi.org/10.5120/12493-9048
  17. Soldaini L and Goharian N (2017). Learning to rank for consumer health search: A semantic approach. In Advances in Information Retrieval: 39th European Conference on IR Research, ECIR 2017, Aberdeen, UK, April 8-13, 2017, Proceedings 39 (pp. 640-646). Springer International Publishing.
  18. Wang X, Li C, Golbandi N, Bendersky M, and Najork M (2018). The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 1313-1322.