Consumer behavior prediction using Airbnb web log data

An, Hyoin;Choi, Yuri;Oh, Raeeun;Song, Jongwoo;

doi:10.5351/KJAS.2019.32.3.391

응용통계연구 (The Korean Journal of Applied Statistics)

제32권3호
/
Pages.391-404
/
2019
/
1225-066X(pISSN)
/
2383-5818(eISSN)

한국통계학회 (The Korean Statistical Society)

DOI QR Code

에어비앤비(Airbnb) 웹 로그 데이터를 이용한 고객 행동 예측

Consumer behavior prediction using Airbnb web log data

안효인 (이화여자대학교 통계학과) ;
최유리 (이화여자대학교 통계학과) ;
오래은 (이화여자대학교 통계학과) ;
송종우 (이화여자대학교 통계학과)

An, Hyoin (Department of Statistics, Ewha Womans University) ;
Choi, Yuri (Department of Statistics, Ewha Womans University) ;
Oh, Raeeun (Department of Statistics, Ewha Womans University) ;
Song, Jongwoo (Department of Statistics, Ewha Womans University)

투고 : 2018.10.25
심사 : 2019.01.19
발행 : 2019.06.30

https://doi.org/10.5351/KJAS.2019.32.3.391 인용 PDF KSCI HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

그동안의 고객 행동에 대한 예측은 주로 고객이 가지는 고정적인 특성을 이용해왔다. 최근에는 점차 고객들의 활동이 오프라인에서 온라인으로 이동하면서 각 고객의 웹 로그를 추적하는 일이 가능해졌다. 그러나 방대한 양의 웹 로그 데이터를 수집할 수 있게 된 반면, 이에 대한 연구는 로그 데이터를 정리하거나 기술적인 특성만을 설명하는 것에 그쳤다. 본 연구에서는 웹사이트 Kaggle에서 제공하는 Airbnb 고객들의 성별, 연령 등의 기본 정보 및 웹 로그가 포함된 데이터셋을 이용하여 첫 숙소 예약까지 걸리는 개인의 의사 결정 시간을 예측하였다. Lasso, SVM, Random Forest, XGBoost 등 다양한 방법론을 활용하여 최적의 모형을 찾고, 웹 로그 데이터의 유무에 따른 예측 오차를 비교하여 웹 로그의 효용성을 확인하였다. 결과적으로 오분류율이 약 20%로 낮은 랜덤 포레스트 분류모형을 최적모형으로 선택하였다. 또한, 웹 로그 데이터를 이용하여 고객 개개인의 행동을 예측한 결과 사용하지 않은 경우와 비교해 예측의 정확도가 최대 두 배 더 높아진 것을 확인할 수 있었다.

Customers' fixed characteristics have often been used to predict customer behavior. It has recently become possible to track customer web logs as customer activities move from offline to online. It has become possible to collect large amounts of web log data; however, the researchers only focused on organizing the log data or describing the technical characteristics. In this study, we predict the decision-making time until each customer makes the first reservation, using Airbnb customer data provided by the Kaggle website. This data set includes basic customer information such as gender, age, and web logs. We use various methodologies to find the optimal model and compare prediction errors for cases with web log data and without it. We consider six models such as Lasso, SVM, Random Forest, and XGBoost to explore the effectiveness of the web log data. As a result, we choose Random Forest as our optimal model with a misclassification rate of about 20%. In addition, we confirm that using web log data in our study doubles the prediction accuracy in predicting customer behavior compared to not using it.

키워드

GCGHDE_2019_v32n3_391_f0001.png 이미지

Figure 2.1. Visualization of web log data structure.

GCGHDE_2019_v32n3_391_f0002.png 이미지

Figure 2.2. Generation of derived variables using ‘Anti join’ method.

GCGHDE_2019_v32n3_391_f0003.png 이미지

Figure 3.1. Variable importance plot of the Random Forest model.

GCGHDE_2019_v32n3_391_f0004.png 이미지

Figure 3.2. Partial Dependence plot of variables with high importance.

Table 2.1. Descriptions on customer information variables

GCGHDE_2019_v32n3_391_t0001.png 이미지

Table 2.2. Descriptions on web log information variables

GCGHDE_2019_v32n3_391_t0002.png 이미지

Table 2.3. Percentage of customers by Duration’s category

GCGHDE_2019_v32n3_391_t0003.png 이미지

Table 2.4. Average score by customer groups

GCGHDE_2019_v32n3_391_t0004.png 이미지

Table 3.1. Comparison of 10-fold CV error and Test error by regression models

GCGHDE_2019_v32n3_391_t0005.png 이미지

Table 3.2. Comparison of 10-fold CV error and Test error by classiﬁcation models

GCGHDE_2019_v32n3_391_t0006.png 이미지

Table 3.3. Confusion matrix of the Random Forest model

GCGHDE_2019_v32n3_391_t0007.png 이미지

참고문헌

Breiman, L. (2001). Random forests, Machine Learning, 13, 5-32. https://doi.org/10.1023/A:1010933404324
Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794 .
Friedman, J. (2001). Greedy boosting approximation: a gradient boosting machine, The Annals of Statistics, 29, 1189-1232 . https://doi.org/10.1214/aos/1013203451
Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., andWatts, D. J. (2010). Predicting consumer behavior with Web search. In Proceedings of the National Academy of Sciences of the United States of America, 107, 17486-17490. https://doi.org/10.1073/pnas.1005962107
Harford, T. (2014). Big data: are we making a big mistake?, Significance, 14-19.
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B. (1998). Support vector machines, IEEE Intelligent Systems and their Applications, 13, 18-28. https://doi.org/10.1109/5254.708428
Igor, V. C., Scott, G., and Smyth, P. (2000). A general probabilistic framework for clustering individuals and objects. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining 2000, 140-149.
Kim, J. K. (2002). A study of web log file analysis for internet marketing of travel agency, Journal of Tourism and Leisure Research, 13, 147-160 .
Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis, Science, 343, 1203-1205. https://doi.org/10.1126/science.1248506
Pandagre, K. N. and Veenadhari, S. (2017). Data mining techniques with web log, International Journal of Advanced Research in Computer Science Transactions on Pattern Analysis and Machine Intelligence, 8, 384-386.
Sujatha, V. and Punithavalli (2012). Improved user navigation pattern prediction technique from web log data, Procedia Engineering, 30, 92-99. https://doi.org/10.1016/j.proeng.2012.01.838
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

응용통계연구 (The Korean Journal of Applied Statistics)

에어비앤비(Airbnb) 웹 로그 데이터를 이용한 고객 행동 예측

Consumer behavior prediction using Airbnb web log data

초록

키워드

참고문헌

자세히 찾기