DOI QR코드

DOI QR Code

Consumer behavior prediction using Airbnb web log data

에어비앤비(Airbnb) 웹 로그 데이터를 이용한 고객 행동 예측

  • An, Hyoin (Department of Statistics, Ewha Womans University) ;
  • Choi, Yuri (Department of Statistics, Ewha Womans University) ;
  • Oh, Raeeun (Department of Statistics, Ewha Womans University) ;
  • Song, Jongwoo (Department of Statistics, Ewha Womans University)
  • 안효인 (이화여자대학교 통계학과) ;
  • 최유리 (이화여자대학교 통계학과) ;
  • 오래은 (이화여자대학교 통계학과) ;
  • 송종우 (이화여자대학교 통계학과)
  • Received : 2018.10.25
  • Accepted : 2019.01.19
  • Published : 2019.06.30

Abstract

Customers' fixed characteristics have often been used to predict customer behavior. It has recently become possible to track customer web logs as customer activities move from offline to online. It has become possible to collect large amounts of web log data; however, the researchers only focused on organizing the log data or describing the technical characteristics. In this study, we predict the decision-making time until each customer makes the first reservation, using Airbnb customer data provided by the Kaggle website. This data set includes basic customer information such as gender, age, and web logs. We use various methodologies to find the optimal model and compare prediction errors for cases with web log data and without it. We consider six models such as Lasso, SVM, Random Forest, and XGBoost to explore the effectiveness of the web log data. As a result, we choose Random Forest as our optimal model with a misclassification rate of about 20%. In addition, we confirm that using web log data in our study doubles the prediction accuracy in predicting customer behavior compared to not using it.

그동안의 고객 행동에 대한 예측은 주로 고객이 가지는 고정적인 특성을 이용해왔다. 최근에는 점차 고객들의 활동이 오프라인에서 온라인으로 이동하면서 각 고객의 웹 로그를 추적하는 일이 가능해졌다. 그러나 방대한 양의 웹 로그 데이터를 수집할 수 있게 된 반면, 이에 대한 연구는 로그 데이터를 정리하거나 기술적인 특성만을 설명하는 것에 그쳤다. 본 연구에서는 웹사이트 Kaggle에서 제공하는 Airbnb 고객들의 성별, 연령 등의 기본 정보 및 웹 로그가 포함된 데이터셋을 이용하여 첫 숙소 예약까지 걸리는 개인의 의사 결정 시간을 예측하였다. Lasso, SVM, Random Forest, XGBoost 등 다양한 방법론을 활용하여 최적의 모형을 찾고, 웹 로그 데이터의 유무에 따른 예측 오차를 비교하여 웹 로그의 효용성을 확인하였다. 결과적으로 오분류율이 약 20%로 낮은 랜덤 포레스트 분류모형을 최적모형으로 선택하였다. 또한, 웹 로그 데이터를 이용하여 고객 개개인의 행동을 예측한 결과 사용하지 않은 경우와 비교해 예측의 정확도가 최대 두 배 더 높아진 것을 확인할 수 있었다.

Keywords

GCGHDE_2019_v32n3_391_f0001.png 이미지

Figure 2.1. Visualization of web log data structure.

GCGHDE_2019_v32n3_391_f0002.png 이미지

Figure 2.2. Generation of derived variables using ‘Anti join’ method.

GCGHDE_2019_v32n3_391_f0003.png 이미지

Figure 3.1. Variable importance plot of the Random Forest model.

GCGHDE_2019_v32n3_391_f0004.png 이미지

Figure 3.2. Partial Dependence plot of variables with high importance.

Table 2.1. Descriptions on customer information variables

GCGHDE_2019_v32n3_391_t0001.png 이미지

Table 2.2. Descriptions on web log information variables

GCGHDE_2019_v32n3_391_t0002.png 이미지

Table 2.3. Percentage of customers by Duration’s category

GCGHDE_2019_v32n3_391_t0003.png 이미지

Table 2.4. Average score by customer groups

GCGHDE_2019_v32n3_391_t0004.png 이미지

Table 3.1. Comparison of 10-fold CV error and Test error by regression models

GCGHDE_2019_v32n3_391_t0005.png 이미지

Table 3.2. Comparison of 10-fold CV error and Test error by classification models

GCGHDE_2019_v32n3_391_t0006.png 이미지

Table 3.3. Confusion matrix of the Random Forest model

GCGHDE_2019_v32n3_391_t0007.png 이미지

References

  1. Breiman, L. (2001). Random forests, Machine Learning, 13, 5-32. https://doi.org/10.1023/A:1010933404324
  2. Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794 .
  3. Friedman, J. (2001). Greedy boosting approximation: a gradient boosting machine, The Annals of Statistics, 29, 1189-1232 . https://doi.org/10.1214/aos/1013203451
  4. Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., andWatts, D. J. (2010). Predicting consumer behavior with Web search. In Proceedings of the National Academy of Sciences of the United States of America, 107, 17486-17490. https://doi.org/10.1073/pnas.1005962107
  5. Harford, T. (2014). Big data: are we making a big mistake?, Significance, 14-19.
  6. Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B. (1998). Support vector machines, IEEE Intelligent Systems and their Applications, 13, 18-28. https://doi.org/10.1109/5254.708428
  7. Igor, V. C., Scott, G., and Smyth, P. (2000). A general probabilistic framework for clustering individuals and objects. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining 2000, 140-149.
  8. Kim, J. K. (2002). A study of web log file analysis for internet marketing of travel agency, Journal of Tourism and Leisure Research, 13, 147-160 .
  9. Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis, Science, 343, 1203-1205. https://doi.org/10.1126/science.1248506
  10. Pandagre, K. N. and Veenadhari, S. (2017). Data mining techniques with web log, International Journal of Advanced Research in Computer Science Transactions on Pattern Analysis and Machine Intelligence, 8, 384-386.
  11. Sujatha, V. and Punithavalli (2012). Improved user navigation pattern prediction technique from web log data, Procedia Engineering, 30, 92-99. https://doi.org/10.1016/j.proeng.2012.01.838
  12. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x