A Data Mining Procedure for Unbalanced Binary Classification

Jung, Han-Na;Lee, Jeong-Hwa;Jun, Chi-Hyuck;

Journal of Korean Institute of Industrial Engineers (대한산업공학회지)

Volume 36 Issue 1
/
Pages.13-21
/
2010
/
1225-0988(pISSN)
/
2234-6457(eISSN)

Korean Institute of Industrial Engineers (대한산업공학회)

A Data Mining Procedure for Unbalanced Binary Classification

불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차

Jung, Han-Na (Department of Industrial and Management Engineering, Pohang University of Science and Technology) ;
Lee, Jeong-Hwa (Department of Industrial and Management Engineering, Pohang University of Science and Technology) ;
Jun, Chi-Hyuck (Department of Industrial and Management Engineering, Pohang University of Science and Technology)

정한나 (포항공과대학교 산업경영공학과) ;
이정화 (포항공과대학교 산업경영공학과) ;
전치혁 (포항공과대학교 산업경영공학과)

Received : 2009.08.05
Accepted : 2010.02.09
Published : 2010.03.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

Keywords

References

Au, W-H., Chan, K. C. C. and Yao, X. (2003), A novel evolutionary data mining algorithm with applications to churn prediction, IEEE Transactions on Evolutionary Computation, 7(6), 532-545. https://doi.org/10.1109/TEVC.2003.819264
Benson, S. and More, J. J. (2002), A limited memory variable metric method for bound constrained minimization, Tech. Rep. ANL-95/11-Revision 2.1.3, Argonne National Laboratory.
Catlett, J. (1991), Megainduction: a test flight, Proceedings of the Eighth International Workshop on Machine Learning, Morgan KaufKaufmann, 596-599.
Coussement, K. and Van den Poel, D. (2008), Churn prediction in subscription services : an application of support vector machines while comparing two parameter-selection techniques, Expert Systems with Applications, 34(1), 313-327. https://doi.org/10.1016/j.eswa.2006.09.038
Darroch, J. N. and Ratcliff, D. (1972), Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, 43(5), 1470-1480. https://doi.org/10.1214/aoms/1177692379
Datta, P., Masand, B., Mani, D. R., and Li, B. (2000), Automated cellular modeling and prediction on a large scale, Artificial Intelligence Review, 14(6), 485-502. https://doi.org/10.1023/A:1006643109702
Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997), Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380-393. https://doi.org/10.1109/34.588021
Freund, Y. and Schapire, R. (1996), Experiments with a new boosting algorithm, Machine Learning : Proceedings of the Thirteenth International Conference, San Francisco, USA, 148-156.
Goodman, J. (2002), Sequential conditional generalized iterative scaling, Proceedings of the 40th Meeting of the ACL, Philadelphia, PA, 9-16.
Gordon, D. and Perlis, D. (1989), Explicitly biased generalization computational intelligence, Computational Intelligence, 5(2), 67-81. https://doi.org/10.1111/j.1467-8640.1989.tb00317.x
Heo, H., Park, H., Kim, N., and Lee J. (2008), Prediction of credit delinquents using locally transductive multi-layer perceptron, Fifth International Symposium on Neural Networks, Beijing, China, paper, 136.
Hung, S-Y., Yen, D. C. and Wang, H-Y. (2006), Applying data mining to telecom churn management, Expert Systems with Applications 31(3), 515-524. https://doi.org/10.1016/j.eswa.2005.09.080
Jin, R., Yan, R., and Zhang, J. (2003), A faster iterative scaling algorithm for conditional exponential model, Proceedings of the 20th International Conference on Machine Learning, Washington DC.
Koh, K., Kim, S-J. and Boyd, S. (2007), An interior-point method for large-scale i1-regularized logistic regression, Journal of Machine Learning Research, 8, 1519-1555.
Komarek, P., Moore, A. W. (2005), Making logistic regression a core data mining tool : A practical investigation of accuracy, speed, and simplicity. Technical Report. CMU-RI-TR-05-27, Carnegie Mellon University, USA.
Kubat, M. and Matwin S. (1997), Addressing the curse of imbalanced training sets : one-sided selection, Proceedings of the Fourteenth International Conference on Machine Learning, 1997, 179-186.
Lin, C-J., Weng, R. C., and Keerthi, S. S. (2008), Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9, 627-650.
Liu, D. and Nocedal, J. (1989), On the limited memory BFGS method for large scale optimization, Mathematical Programming, 45(1), 503-528. https://doi.org/10.1007/BF01589116
Malouf, R. (2002), A comparison of algorithms for maximum entropy parameter estimation, Proceedings of the 6th Conference on Natural Language Learning, 20, 1-7.
Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., and Kaushansky, H. (2000), Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry, IEEE Transactions on Neural Networks, 11(3), 690-696. https://doi.org/10.1109/72.846740
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., and Brunk, C. (1994), Reducing misclassification costs, Machine Learning : Proceedings of the Eleventh International Conference, Morgan Kaufmann.
Phua, C., Alahakoon, D., and Lee, V. (2004), Minority report in fraud detection: classification of skewed data, ACM SIGKDD Explorations Newsletter, 6(1), 50-59. https://doi.org/10.1145/1007730.1007738
Quinlan, J. (1993), C4. 5 : Programs for Machine Learning, Morgan Kaufmann.
Stolfo, S., Fan, D., Lee, W., Prodromidis, A., and Chan, P. (1997), Credit card fraud detection using meta-learning: issues and initial results, Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management (AAAI Technical Report WS-97-07), Menlo Park : CA : AAAI Press, 83-90.
Sutton, C. and McCallum, A. (2006), An introduction to conditional random fields for relational learning, In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, MIT Press.
Sung, K-K. and Poggio, T. (1995), Learning human face detection in cluttered scenes, Lecture Notes in Computer Science, 970, 432-439.
Tan, P. N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining. Addison-Wesley, Reading.
Vapnik, V. N. (1998), Statistical Learning Theory. Wiley, New York.
Viaene, S., Derrig, R. A., Baesens, B., and Dedene, G. (2002), A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection, The Journal of Risk and Insurance, 69(3), 373-421. https://doi.org/10.1111/1539-6975.00023
Wei, C-P. and Chiu, I-T. (2002), Turning telecommunications call details to churn prediction : a data mining approach, Expert Systems with Applications, 23(2), 103-112. https://doi.org/10.1016/S0957-4174(02)00030-1

Journal of Korean Institute of Industrial Engineers (대한산업공학회지)

A Data Mining Procedure for Unbalanced Binary Classification

불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)