[KSCI] Korea Science Citation Index Service

A Data Mining Procedure for Unbalanced Binary Classification

Jung, Han-Na (Department of Industrial and Management Engineering, Pohang University of Science and Technology)
Lee, Jeong-Hwa (Department of Industrial and Management Engineering, Pohang University of Science and Technology)
Jun, Chi-Hyuck (Department of Industrial and Management Engineering, Pohang University of Science and Technology)

Publication Information

Journal of Korean Institute of Industrial Engineers / v.36, no.1, 2010 , pp. 13-21 More about this Journal

Abstract

The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

Keywords

Clustering; Large-scale Data; Over-sampling; Regularized Logistic Regression; Unbalanced Data;

Citations & Related Records

Reference

1	Darroch, J. N. and Ratcliff, D. (1972), Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, 43(5), 1470-1480. DOI ScienceOn
2	Au, W-H., Chan, K. C. C. and Yao, X. (2003), A novel evolutionary data mining algorithm with applications to churn prediction, IEEE Transactions on Evolutionary Computation, 7(6), 532-545. DOI ScienceOn
3	Benson, S. and More, J. J. (2002), A limited memory variable metric method for bound constrained minimization, Tech. Rep. ANL-95/11-Revision 2.1.3, Argonne National Laboratory.
4	Catlett, J. (1991), Megainduction: a test flight, Proceedings of the Eighth International Workshop on Machine Learning, Morgan KaufKaufmann, 596-599.
5	Coussement, K. and Van den Poel, D. (2008), Churn prediction in subscription services : an application of support vector machines while comparing two parameter-selection techniques, Expert Systems with Applications, 34(1), 313-327. DOI ScienceOn
6	Wei, C-P. and Chiu, I-T. (2002), Turning telecommunications call details to churn prediction : a data mining approach, Expert Systems with Applications, 23(2), 103-112. DOI ScienceOn
7	Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., and Kaushansky, H. (2000), Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry, IEEE Transactions on Neural Networks, 11(3), 690-696. DOI ScienceOn
8	Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., and Brunk, C. (1994), Reducing misclassification costs, Machine Learning : Proceedings of the Eleventh International Conference, Morgan Kaufmann.
9	Phua, C., Alahakoon, D., and Lee, V. (2004), Minority report in fraud detection: classification of skewed data, ACM SIGKDD Explorations Newsletter, 6(1), 50-59. DOI
10	Quinlan, J. (1993), C4. 5 : Programs for Machine Learning, Morgan Kaufmann.
11	Stolfo, S., Fan, D., Lee, W., Prodromidis, A., and Chan, P. (1997), Credit card fraud detection using meta-learning: issues and initial results, Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management (AAAI Technical Report WS-97-07), Menlo Park : CA : AAAI Press, 83-90.
12	Sutton, C. and McCallum, A. (2006), An introduction to conditional random fields for relational learning, In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, MIT Press.
13	Sung, K-K. and Poggio, T. (1995), Learning human face detection in cluttered scenes, Lecture Notes in Computer Science, 970, 432-439.
14	Tan, P. N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining. Addison-Wesley, Reading.
15	Vapnik, V. N. (1998), Statistical Learning Theory. Wiley, New York.
16	Viaene, S., Derrig, R. A., Baesens, B., and Dedene, G. (2002), A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection, The Journal of Risk and Insurance, 69(3), 373-421. DOI ScienceOn
17	Koh, K., Kim, S-J. and Boyd, S. (2007), An interior-point method for large-scale i1-regularized logistic regression, Journal of Machine Learning Research, 8, 1519-1555.
18	Komarek, P., Moore, A. W. (2005), Making logistic regression a core data mining tool : A practical investigation of accuracy, speed, and simplicity. Technical Report. CMU-RI-TR-05-27, Carnegie Mellon University, USA.
19	Kubat, M. and Matwin S. (1997), Addressing the curse of imbalanced training sets : one-sided selection, Proceedings of the Fourteenth International Conference on Machine Learning, 1997, 179-186.
20	Lin, C-J., Weng, R. C., and Keerthi, S. S. (2008), Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9, 627-650.
21	Liu, D. and Nocedal, J. (1989), On the limited memory BFGS method for large scale optimization, Mathematical Programming, 45(1), 503-528. DOI
22	Malouf, R. (2002), A comparison of algorithms for maximum entropy parameter estimation, Proceedings of the 6th Conference on Natural Language Learning, 20, 1-7.
23	Datta, P., Masand, B., Mani, D. R., and Li, B. (2000), Automated cellular modeling and prediction on a large scale, Artificial Intelligence Review, 14(6), 485-502. DOI ScienceOn
24	Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997), Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380-393. DOI ScienceOn
25	Freund, Y. and Schapire, R. (1996), Experiments with a new boosting algorithm, Machine Learning : Proceedings of the Thirteenth International Conference, San Francisco, USA, 148-156.
26	Goodman, J. (2002), Sequential conditional generalized iterative scaling, Proceedings of the 40th Meeting of the ACL, Philadelphia, PA, 9-16.
27	Gordon, D. and Perlis, D. (1989), Explicitly biased generalization computational intelligence, Computational Intelligence, 5(2), 67-81. DOI
28	Heo, H., Park, H., Kim, N., and Lee J. (2008), Prediction of credit delinquents using locally transductive multi-layer perceptron, Fifth International Symposium on Neural Networks, Beijing, China, paper, 136.
29	Hung, S-Y., Yen, D. C. and Wang, H-Y. (2006), Applying data mining to telecom churn management, Expert Systems with Applications 31(3), 515-524. DOI ScienceOn
30	Jin, R., Yan, R., and Zhang, J. (2003), A faster iterative scaling algorithm for conditional exponential model, Proceedings of the 20th International Conference on Machine Learning, Washington DC.

KSCI

A Data Mining Procedure for Unbalanced Binary Classification 불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차

A Data Mining Procedure for Unbalanced Binary Classification