Browse > Article
http://dx.doi.org/10.29220/CSAM.2021.28.3.217

Predicting movie audience with stacked generalization by combining machine learning algorithms  

Park, Junghoon (Department of Applied Statistics, Chung-Ang University)
Lim, Changwon (Department of Applied Statistics, Chung-Ang University)
Publication Information
Communications for Statistical Applications and Methods / v.28, no.3, 2021 , pp. 217-232 More about this Journal
Abstract
The Korea film industry has matured and the number of movie-watching per capita has reached the highest level in the world. Since then, movie industry growth rate is decreasing and even the total sales of movies per year slightly decreased in 2018. The number of moviegoers is the first factor of sales in movie industry and also an important factor influencing additional sales. Thus it is important to predict the number of movie audiences. In this study, we predict the cumulative number of audiences of films using stacking, an ensemble method. Stacking is a kind of ensemble method that combines all the algorithms used in the prediction. We use box office data from Korea Film Council and web comment data from Daum Movie (www.movie.daum.net). This paper describes the process of collecting and preprocessing of explanatory variables and explains regression models used in stacking. Final stacking model outperforms in the prediction of test set in terms of RMSE.
Keywords
stacking; stacked generalization; ensemble; machine learning; data mining; movie audience prediction;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Yu JP and Lee EH (2018). A model of predictive movie 10 million spectators through big data analysis, The Korean Journal of Bigdata, 3, 63-71.   DOI
2 Geurts P and Louppe G (2011). Learning to rank with extremely randomized trees, Proceedings of Machine Learning Research, 14, 49-61.
3 Lawson R (2015). Web Scraping with Python (1st ed), Packt Publishing Ltd.
4 Paniagua Tineo A, Salcedo Sanz S, Casanova Mateo C, Ortiz Garcia EG, Cony MA, and Hernandez Martin E (2011). Prediction of daily maximum temperature using a support vector regression algorithm, Renewable Energy, 36, 3054-3060.   DOI
5 Segal MR (2004). Machine learning benchmarks and random forest regression, UCSF: Center for Bioinformatics and Molecular Biostatistics, Retrieved June 2, 2021, from: https://escholarship.org/uc/item/35x3v9t4.
6 Solomatine DP and Shrestha DL (2006). Experiments with AdaBoost. RT, an improved boosting scheme for regression, Neural Computation, 18, 1678-1710.   DOI
7 Song Y, Liang J, Lu J, and Zhao X (2017). An efficient instance selection algorithm for k nearest neighbor regression, Neurocomputing, 251, 26-34.   DOI
8 Breiman L (1999). Random forests, UC Berkeley TR567.
9 Friedman JH (2002). Stochastic gradient boosting, Computational Statistics & Data Analysis, 38, 367-378.   DOI
10 Kim SY, Lim SH, and Jung YS (2010). A Comparative study of predictors of film performance by film types: focused on art and commercial films, The Journal of the Korea Contents Association, 10, 381-389.
11 Korean Film Council (2019). 2018 Nyeon hangug yeonghwa san-eob gyeolsan bogoseo, Busan, Korea.
12 Lee JM (2018). Juyo byeonsu seontaeggwa decision tree-leul hwalyounghan gaebong cheot ju yeonghwa heunghaeng yecheug-e daehan mosin leoning gibub yeongu (Doctoral dissertation), Hanyang University.
13 Lee K, Park J, Kim I, and Choi Y (2018). Predicting movie success with machine learning techniques: ways to improve accuracy, Information Systems Frontiers, 20, 577-588.   DOI
14 Pontil M, Rifkin R and Evgeniou T (1998). From regression to classification in support vector machines.
15 Solomatine DP and Shrestha DL (2004). AdaBoost. RT: a boosting algorithm for regression problems. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), 1163-1168.
16 Van Rossum G and Drake FL (2009). Python 3 Reference Manual, SohoBooks, United States.
17 R Core Team (2019). R: A language and environment for statistical computing, R Foundation for Statistical Computing, URL https://www.R-project.org/.
18 Wolpert DH (1992). Stacked Generalization, Neural Network, 5, 241-259.   DOI
19 Zou H and Hastie T (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320.   DOI
20 Park SY (2012). SNSleul tonghae gujeon hyogwaga yeonghwaheunghaeng-e daehae michineun yeonghyang : ssuny-ui salyeleul jungsim-eulo, The Journal of the Korea Contents Association, 12, 40-53.   DOI
21 Liaw A and Wiener M (2002). Classification and regression by randomforest, R News, 2, 18-22.
22 Zhang Y and Haghani A (2015). A gradient boosting method to improve travel time prediction, Transportation Research Part C: Emerging Technologies, 58, 308-324.   DOI