A Data Mining Procedure for Unbalanced Binary Classification

불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차

  • Jung, Han-Na (Department of Industrial and Management Engineering, Pohang University of Science and Technology) ;
  • Lee, Jeong-Hwa (Department of Industrial and Management Engineering, Pohang University of Science and Technology) ;
  • Jun, Chi-Hyuck (Department of Industrial and Management Engineering, Pohang University of Science and Technology)
  • 정한나 (포항공과대학교 산업경영공학과) ;
  • 이정화 (포항공과대학교 산업경영공학과) ;
  • 전치혁 (포항공과대학교 산업경영공학과)
  • Received : 2009.08.05
  • Accepted : 2010.02.09
  • Published : 2010.03.01

Abstract

The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

Keywords

References

  1. Au, W-H., Chan, K. C. C. and Yao, X. (2003), A novel evolutionary data mining algorithm with applications to churn prediction, IEEE Transactions on Evolutionary Computation, 7(6), 532-545. https://doi.org/10.1109/TEVC.2003.819264
  2. Benson, S. and More, J. J. (2002), A limited memory variable metric method for bound constrained minimization, Tech. Rep. ANL-95/11-Revision 2.1.3, Argonne National Laboratory.
  3. Catlett, J. (1991), Megainduction: a test flight, Proceedings of the Eighth International Workshop on Machine Learning, Morgan KaufKaufmann, 596-599.
  4. Coussement, K. and Van den Poel, D. (2008), Churn prediction in subscription services : an application of support vector machines while comparing two parameter-selection techniques, Expert Systems with Applications, 34(1), 313-327. https://doi.org/10.1016/j.eswa.2006.09.038
  5. Darroch, J. N. and Ratcliff, D. (1972), Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, 43(5), 1470-1480. https://doi.org/10.1214/aoms/1177692379
  6. Datta, P., Masand, B., Mani, D. R., and Li, B. (2000), Automated cellular modeling and prediction on a large scale, Artificial Intelligence Review, 14(6), 485-502. https://doi.org/10.1023/A:1006643109702
  7. Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997), Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380-393. https://doi.org/10.1109/34.588021
  8. Freund, Y. and Schapire, R. (1996), Experiments with a new boosting algorithm, Machine Learning : Proceedings of the Thirteenth International Conference, San Francisco, USA, 148-156.
  9. Goodman, J. (2002), Sequential conditional generalized iterative scaling, Proceedings of the 40th Meeting of the ACL, Philadelphia, PA, 9-16.
  10. Gordon, D. and Perlis, D. (1989), Explicitly biased generalization computational intelligence, Computational Intelligence, 5(2), 67-81. https://doi.org/10.1111/j.1467-8640.1989.tb00317.x
  11. Heo, H., Park, H., Kim, N., and Lee J. (2008), Prediction of credit delinquents using locally transductive multi-layer perceptron, Fifth International Symposium on Neural Networks, Beijing, China, paper, 136.
  12. Hung, S-Y., Yen, D. C. and Wang, H-Y. (2006), Applying data mining to telecom churn management, Expert Systems with Applications 31(3), 515-524. https://doi.org/10.1016/j.eswa.2005.09.080
  13. Jin, R., Yan, R., and Zhang, J. (2003), A faster iterative scaling algorithm for conditional exponential model, Proceedings of the 20th International Conference on Machine Learning, Washington DC.
  14. Koh, K., Kim, S-J. and Boyd, S. (2007), An interior-point method for large-scale i1-regularized logistic regression, Journal of Machine Learning Research, 8, 1519-1555.
  15. Komarek, P., Moore, A. W. (2005), Making logistic regression a core data mining tool : A practical investigation of accuracy, speed, and simplicity. Technical Report. CMU-RI-TR-05-27, Carnegie Mellon University, USA.
  16. Kubat, M. and Matwin S. (1997), Addressing the curse of imbalanced training sets : one-sided selection, Proceedings of the Fourteenth International Conference on Machine Learning, 1997, 179-186.
  17. Lin, C-J., Weng, R. C., and Keerthi, S. S. (2008), Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9, 627-650.
  18. Liu, D. and Nocedal, J. (1989), On the limited memory BFGS method for large scale optimization, Mathematical Programming, 45(1), 503-528. https://doi.org/10.1007/BF01589116
  19. Malouf, R. (2002), A comparison of algorithms for maximum entropy parameter estimation, Proceedings of the 6th Conference on Natural Language Learning, 20, 1-7.
  20. Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., and Kaushansky, H. (2000), Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry, IEEE Transactions on Neural Networks, 11(3), 690-696. https://doi.org/10.1109/72.846740
  21. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., and Brunk, C. (1994), Reducing misclassification costs, Machine Learning : Proceedings of the Eleventh International Conference, Morgan Kaufmann.
  22. Phua, C., Alahakoon, D., and Lee, V. (2004), Minority report in fraud detection: classification of skewed data, ACM SIGKDD Explorations Newsletter, 6(1), 50-59. https://doi.org/10.1145/1007730.1007738
  23. Quinlan, J. (1993), C4. 5 : Programs for Machine Learning, Morgan Kaufmann.
  24. Stolfo, S., Fan, D., Lee, W., Prodromidis, A., and Chan, P. (1997), Credit card fraud detection using meta-learning: issues and initial results, Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management (AAAI Technical Report WS-97-07), Menlo Park : CA : AAAI Press, 83-90.
  25. Sutton, C. and McCallum, A. (2006), An introduction to conditional random fields for relational learning, In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, MIT Press.
  26. Sung, K-K. and Poggio, T. (1995), Learning human face detection in cluttered scenes, Lecture Notes in Computer Science, 970, 432-439.
  27. Tan, P. N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining. Addison-Wesley, Reading.
  28. Vapnik, V. N. (1998), Statistical Learning Theory. Wiley, New York.
  29. Viaene, S., Derrig, R. A., Baesens, B., and Dedene, G. (2002), A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection, The Journal of Risk and Insurance, 69(3), 373-421. https://doi.org/10.1111/1539-6975.00023
  30. Wei, C-P. and Chiu, I-T. (2002), Turning telecommunications call details to churn prediction : a data mining approach, Expert Systems with Applications, 23(2), 103-112. https://doi.org/10.1016/S0957-4174(02)00030-1