Browse > Article

A Data Mining Procedure for Unbalanced Binary Classification  

Jung, Han-Na (Department of Industrial and Management Engineering, Pohang University of Science and Technology)
Lee, Jeong-Hwa (Department of Industrial and Management Engineering, Pohang University of Science and Technology)
Jun, Chi-Hyuck (Department of Industrial and Management Engineering, Pohang University of Science and Technology)
Publication Information
Journal of Korean Institute of Industrial Engineers / v.36, no.1, 2010 , pp. 13-21 More about this Journal
Abstract
The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.
Keywords
Clustering; Large-scale Data; Over-sampling; Regularized Logistic Regression; Unbalanced Data;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Darroch, J. N. and Ratcliff, D. (1972), Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, 43(5), 1470-1480.   DOI   ScienceOn
2 Au, W-H., Chan, K. C. C. and Yao, X. (2003), A novel evolutionary data mining algorithm with applications to churn prediction, IEEE Transactions on Evolutionary Computation, 7(6), 532-545.   DOI   ScienceOn
3 Benson, S. and More, J. J. (2002), A limited memory variable metric method for bound constrained minimization, Tech. Rep. ANL-95/11-Revision 2.1.3, Argonne National Laboratory.
4 Catlett, J. (1991), Megainduction: a test flight, Proceedings of the Eighth International Workshop on Machine Learning, Morgan KaufKaufmann, 596-599.
5 Coussement, K. and Van den Poel, D. (2008), Churn prediction in subscription services : an application of support vector machines while comparing two parameter-selection techniques, Expert Systems with Applications, 34(1), 313-327.   DOI   ScienceOn
6 Wei, C-P. and Chiu, I-T. (2002), Turning telecommunications call details to churn prediction : a data mining approach, Expert Systems with Applications, 23(2), 103-112.   DOI   ScienceOn
7 Quinlan, J. (1993), C4. 5 : Programs for Machine Learning, Morgan Kaufmann.
8 Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., and Kaushansky, H. (2000), Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry, IEEE Transactions on Neural Networks, 11(3), 690-696.   DOI   ScienceOn
9 Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., and Brunk, C. (1994), Reducing misclassification costs, Machine Learning : Proceedings of the Eleventh International Conference, Morgan Kaufmann.
10 Phua, C., Alahakoon, D., and Lee, V. (2004), Minority report in fraud detection: classification of skewed data, ACM SIGKDD Explorations Newsletter, 6(1), 50-59.   DOI
11 Stolfo, S., Fan, D., Lee, W., Prodromidis, A., and Chan, P. (1997), Credit card fraud detection using meta-learning: issues and initial results, Proceedings of the AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management (AAAI Technical Report WS-97-07), Menlo Park : CA : AAAI Press, 83-90.
12 Sutton, C. and McCallum, A. (2006), An introduction to conditional random fields for relational learning, In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, MIT Press.
13 Sung, K-K. and Poggio, T. (1995), Learning human face detection in cluttered scenes, Lecture Notes in Computer Science, 970, 432-439.
14 Tan, P. N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining. Addison-Wesley, Reading.
15 Vapnik, V. N. (1998), Statistical Learning Theory. Wiley, New York.
16 Viaene, S., Derrig, R. A., Baesens, B., and Dedene, G. (2002), A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection, The Journal of Risk and Insurance, 69(3), 373-421.   DOI   ScienceOn
17 Lin, C-J., Weng, R. C., and Keerthi, S. S. (2008), Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9, 627-650.
18 Koh, K., Kim, S-J. and Boyd, S. (2007), An interior-point method for large-scale i1-regularized logistic regression, Journal of Machine Learning Research, 8, 1519-1555.
19 Komarek, P., Moore, A. W. (2005), Making logistic regression a core data mining tool : A practical investigation of accuracy, speed, and simplicity. Technical Report. CMU-RI-TR-05-27, Carnegie Mellon University, USA.
20 Kubat, M. and Matwin S. (1997), Addressing the curse of imbalanced training sets : one-sided selection, Proceedings of the Fourteenth International Conference on Machine Learning, 1997, 179-186.
21 Liu, D. and Nocedal, J. (1989), On the limited memory BFGS method for large scale optimization, Mathematical Programming, 45(1), 503-528.   DOI
22 Malouf, R. (2002), A comparison of algorithms for maximum entropy parameter estimation, Proceedings of the 6th Conference on Natural Language Learning, 20, 1-7.
23 Datta, P., Masand, B., Mani, D. R., and Li, B. (2000), Automated cellular modeling and prediction on a large scale, Artificial Intelligence Review, 14(6), 485-502.   DOI   ScienceOn
24 Gordon, D. and Perlis, D. (1989), Explicitly biased generalization computational intelligence, Computational Intelligence, 5(2), 67-81.   DOI
25 Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997), Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380-393.   DOI   ScienceOn
26 Freund, Y. and Schapire, R. (1996), Experiments with a new boosting algorithm, Machine Learning : Proceedings of the Thirteenth International Conference, San Francisco, USA, 148-156.
27 Goodman, J. (2002), Sequential conditional generalized iterative scaling, Proceedings of the 40th Meeting of the ACL, Philadelphia, PA, 9-16.
28 Heo, H., Park, H., Kim, N., and Lee J. (2008), Prediction of credit delinquents using locally transductive multi-layer perceptron, Fifth International Symposium on Neural Networks, Beijing, China, paper, 136.
29 Hung, S-Y., Yen, D. C. and Wang, H-Y. (2006), Applying data mining to telecom churn management, Expert Systems with Applications 31(3), 515-524.   DOI   ScienceOn
30 Jin, R., Yan, R., and Zhang, J. (2003), A faster iterative scaling algorithm for conditional exponential model, Proceedings of the 20th International Conference on Machine Learning, Washington DC.