DOI QR코드

DOI QR Code

Predicting movie audience with stacked generalization by combining machine learning algorithms

  • Park, Junghoon (Department of Applied Statistics, Chung-Ang University) ;
  • Lim, Changwon (Department of Applied Statistics, Chung-Ang University)
  • 투고 : 2020.10.20
  • 심사 : 2021.02.13
  • 발행 : 2021.05.31

초록

The Korea film industry has matured and the number of movie-watching per capita has reached the highest level in the world. Since then, movie industry growth rate is decreasing and even the total sales of movies per year slightly decreased in 2018. The number of moviegoers is the first factor of sales in movie industry and also an important factor influencing additional sales. Thus it is important to predict the number of movie audiences. In this study, we predict the cumulative number of audiences of films using stacking, an ensemble method. Stacking is a kind of ensemble method that combines all the algorithms used in the prediction. We use box office data from Korea Film Council and web comment data from Daum Movie (www.movie.daum.net). This paper describes the process of collecting and preprocessing of explanatory variables and explains regression models used in stacking. Final stacking model outperforms in the prediction of test set in terms of RMSE.

키워드

참고문헌

  1. Breiman L (1999). Random forests, UC Berkeley TR567.
  2. Friedman JH (2002). Stochastic gradient boosting, Computational Statistics & Data Analysis, 38, 367-378. https://doi.org/10.1016/S0167-9473(01)00065-2
  3. Geurts P and Louppe G (2011). Learning to rank with extremely randomized trees, Proceedings of Machine Learning Research, 14, 49-61.
  4. Kim SY, Lim SH, and Jung YS (2010). A Comparative study of predictors of film performance by film types: focused on art and commercial films, The Journal of the Korea Contents Association, 10, 381-389.
  5. Korean Film Council (2019). 2018 Nyeon hangug yeonghwa san-eob gyeolsan bogoseo, Busan, Korea.
  6. Lawson R (2015). Web Scraping with Python (1st ed), Packt Publishing Ltd.
  7. Lee JM (2018). Juyo byeonsu seontaeggwa decision tree-leul hwalyounghan gaebong cheot ju yeonghwa heunghaeng yecheug-e daehan mosin leoning gibub yeongu (Doctoral dissertation), Hanyang University.
  8. Lee K, Park J, Kim I, and Choi Y (2018). Predicting movie success with machine learning techniques: ways to improve accuracy, Information Systems Frontiers, 20, 577-588. https://doi.org/10.1007/s10796-016-9689-z
  9. Liaw A and Wiener M (2002). Classification and regression by randomforest, R News, 2, 18-22.
  10. Paniagua Tineo A, Salcedo Sanz S, Casanova Mateo C, Ortiz Garcia EG, Cony MA, and Hernandez Martin E (2011). Prediction of daily maximum temperature using a support vector regression algorithm, Renewable Energy, 36, 3054-3060. https://doi.org/10.1016/j.renene.2011.03.030
  11. Park SY (2012). SNSleul tonghae gujeon hyogwaga yeonghwaheunghaeng-e daehae michineun yeonghyang : ssuny-ui salyeleul jungsim-eulo, The Journal of the Korea Contents Association, 12, 40-53. https://doi.org/10.5392/JKCA.2012.12.07.040
  12. Pontil M, Rifkin R and Evgeniou T (1998). From regression to classification in support vector machines.
  13. R Core Team (2019). R: A language and environment for statistical computing, R Foundation for Statistical Computing, URL https://www.R-project.org/.
  14. Segal MR (2004). Machine learning benchmarks and random forest regression, UCSF: Center for Bioinformatics and Molecular Biostatistics, Retrieved June 2, 2021, from: https://escholarship.org/uc/item/35x3v9t4.
  15. Solomatine DP and Shrestha DL (2004). AdaBoost. RT: a boosting algorithm for regression problems. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), 1163-1168.
  16. Solomatine DP and Shrestha DL (2006). Experiments with AdaBoost. RT, an improved boosting scheme for regression, Neural Computation, 18, 1678-1710. https://doi.org/10.1162/neco.2006.18.7.1678
  17. Song Y, Liang J, Lu J, and Zhao X (2017). An efficient instance selection algorithm for k nearest neighbor regression, Neurocomputing, 251, 26-34. https://doi.org/10.1016/j.neucom.2017.04.018
  18. Van Rossum G and Drake FL (2009). Python 3 Reference Manual, SohoBooks, United States.
  19. Wolpert DH (1992). Stacked Generalization, Neural Network, 5, 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1
  20. Yu JP and Lee EH (2018). A model of predictive movie 10 million spectators through big data analysis, The Korean Journal of Bigdata, 3, 63-71. https://doi.org/10.36498/kbigdt.2018.3.1.63
  21. Zhang Y and Haghani A (2015). A gradient boosting method to improve travel time prediction, Transportation Research Part C: Emerging Technologies, 58, 308-324. https://doi.org/10.1016/j.trc.2015.02.019
  22. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x