DOI QR코드

DOI QR Code

Comparison of data mining methods with daily lens data

데일리 렌즈 데이터를 사용한 데이터마이닝 기법 비교

  • Seok, Kyungha (Department of Data Science, Inje University) ;
  • Lee, Taewoo (Department of Data Science, Inje University)
  • 석경하 (인제대학교 데이터정보학과) ;
  • 이태우 (인제대학교 데이터정보학과)
  • Received : 2013.09.30
  • Accepted : 2013.10.24
  • Published : 2013.11.30

Abstract

To solve the classification problems, various data mining techniques have been applied to database marketing, credit scoring and market forecasting. In this paper, we compare various techniques such as bagging, boosting, LASSO, random forest and support vector machine with the daily lens transaction data. The classical techniques-decision tree, logistic regression-are used too. The experiment shows that the random forest has a little smaller misclassification rate and standard error than those of other methods. The performance of the SVM is good in the sense of misclassfication rate and bad in the sense of standard error. Taking the model interpretation and computing time into consideration, we conclude that the LASSO gives the best result.

데이터베이스 마케팅과 시장예측 등의 분야에서 분류문제를 해결하기 위해 다양한 데이터마이닝 기법들이 적용되고 있다. 본 연구에서는 데일리 렌즈 고객들의 거래 데이터를 기반으로 의사결정나무, 로지스틱 회귀모형과 같은 기존의 통계적 분류기법과 최근에 개발된 배깅, 부스팅, 라소, 랜덤 포리스트 그리고 지지벡터기계의 분류 성능을 비교하고자 한다. 비교 실험을 위해 데이터 정제, 탐색, 파생변수 생성, 그리고 변수 선택과정을 거쳤다. 실험결과 정분류율 측면에서는 지지벡터기계가 다른 모형보다 근소하게 높았지만 표준편차가 크게 나왔다. 정분류율과 표준편차의 관점에서는 랜덤 포리스트가 가장 좋은 결과를 보였다. 그러나 모형의 해석, 간명성 그리고 학습에 걸리는 시간을 고려하였을 때 라소모형이 적합하다는 결론을 내렸다.

Keywords

References

  1. Breiman, L. (1996). Bagging predictors. Machine Learning, 26, 123-140.
  2. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  3. Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and regression trees, Wadsworth, New York.
  4. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of The Thirteenth International Conference on Machine Learning, 148-156.
  5. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The element of statistical learning: Data mining, inference, and prediction, New York, Spring Verlag.
  6. Hwang, J,. Lee, J. and Kim, J. (2006). A comparison study of multiclass SVM methods in microarray data. Journal of the Korean Data & Information Science Society, 17, 311-324.
  7. Kim, A., Kim, J. and Kim, H. (2012). The guideline for choosing the right-size of tree for boosting algorithm. Journal of the Korean Data & Information Science Society, 23, 949-959. https://doi.org/10.7465/jkdi.2012.23.5.949
  8. Kim, B., Cho, D., Lee, J., Lee, T., Hyun, J. and Kim, S. (2012). Comparison of two repurchase models using logistic regression and memory based reasoning. Journal of the Korean Data Analysis Society, 14, 1301 - 1314.
  9. Opitz, D. and Maclin, R. A. (1999). Popular ensemble methods : An empirical study. Journal of Artificial Intelligence Research, 11, 169-198.
  10. Park, H. (2011). Online abnormal events detection with online support vector machine. Journal of the Korean Data & Information Science Society, 22, 197-206.
  11. Pi, S. (2013). Self-diagnostic system for smartphone addictionusing multiclass SVM. Journal of the Korean Data & Information Science Society, 24, 13-22. https://doi.org/10.7465/jkdi.2013.24.1.13
  12. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society B, 58, 267-288.
  13. Vapnik, V. N. (1996). The nature of statistical learning theory, Springer, New York.

Cited by

  1. The effect of road weather factors on traffic accident - Focused on Busan area - vol.26, pp.3, 2015, https://doi.org/10.7465/jkdi.2015.26.3.661