Browse > Article
http://dx.doi.org/10.13088/jiis.2013.19.2.125

A Hybrid SVM Classifier for Imbalanced Data Sets  

Lee, Jae Sik (Dept. of e-Business, School of Business Administration, Ajou University)
Kwon, Jong Gu (Dept. of Management Information Systems, Graduate School, Ajou University)
Publication Information
Journal of Intelligence and Information Systems / v.19, no.2, 2013 , pp. 125-140 More about this Journal
Abstract
We call a data set in which the number of records belonging to a certain class far outnumbers the number of records belonging to the other class, 'imbalanced data set'. Most of the classification techniques perform poorly on imbalanced data sets. When we evaluate the performance of a certain classification technique, we need to measure not only 'accuracy' but also 'sensitivity' and 'specificity'. In a customer churn prediction problem, 'retention' records account for the majority class, and 'churn' records account for the minority class. Sensitivity measures the proportion of actual retentions which are correctly identified as such. Specificity measures the proportion of churns which are correctly identified as such. The poor performance of the classification techniques on imbalanced data sets is due to the low value of specificity. Many previous researches on imbalanced data sets employed 'oversampling' technique where members of the minority class are sampled more than those of the majority class in order to make a relatively balanced data set. When a classification model is constructed using this oversampled balanced data set, specificity can be improved but sensitivity will be decreased. In this research, we developed a hybrid model of support vector machine (SVM), artificial neural network (ANN) and decision tree, that improves specificity while maintaining sensitivity. We named this hybrid model 'hybrid SVM model.' The process of construction and prediction of our hybrid SVM model is as follows. By oversampling from the original imbalanced data set, a balanced data set is prepared. SVM_I model and ANN_I model are constructed using the imbalanced data set, and SVM_B model is constructed using the balanced data set. SVM_I model is superior in sensitivity and SVM_B model is superior in specificity. For a record on which both SVM_I model and SVM_B model make the same prediction, that prediction becomes the final solution. If they make different prediction, the final solution is determined by the discrimination rules obtained by ANN and decision tree. For a record on which SVM_I model and SVM_B model make different predictions, a decision tree model is constructed using ANN_I output value as input and actual retention or churn as target. We obtained the following two discrimination rules: 'IF ANN_I output value <0.285, THEN Final Solution = Retention' and 'IF ANN_I output value ${\geq}0.285$, THEN Final Solution = Churn.' The threshold 0.285 is the value optimized for the data used in this research. The result we present in this research is the structure or framework of our hybrid SVM model, not a specific threshold value such as 0.285. Therefore, the threshold value in the above discrimination rules can be changed to any value depending on the data. In order to evaluate the performance of our hybrid SVM model, we used the 'churn data set' in UCI Machine Learning Repository, that consists of 85% retention customers and 15% churn customers. Accuracy of the hybrid SVM model is 91.08% that is better than that of SVM_I model or SVM_B model. The points worth noticing here are its sensitivity, 95.02%, and specificity, 69.24%. The sensitivity of SVM_I model is 94.65%, and the specificity of SVM_B model is 67.00%. Therefore the hybrid SVM model developed in this research improves the specificity of SVM_B model while maintaining the sensitivity of SVM_I model.
Keywords
Data Mining; Imbalanced Data Set; SVM; Hybrid Model;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE : Synthetic Minority Over Sampling Technique," Journal of Artificial Intelligence Research, Vol.16(2002), 321-357.
2 Chen, X., B. Gerlach, and D. Casasent, "Pruning Support Vectors for Imbalanced Data Classification," Proc. Int'l Joint Conf. on Neural Networks, (2005), 1883-1888.
3 Cristianini, N. and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, MA, 2000.
4 Ganganwar, V., "An Overview of Classification Algorithms for Imbalanced Datasets," Int'l Journal of Emerging Technology and Advanced Engineering, Vol.2, No.4(2012), 42-47.
5 Grzymala-Busse, J., X. Zheng, L. Goodwin, and W. Grzymala-Busse, "An Approach to Imbalanced Data Sets Based on Changing Rule Strength," Proc. AAAI Workshop, (2000), 69-74.
6 Jang, Y. S., J. W. Kim, and J. Hur, "Combined Application of Data Imbalance Reduction Techniques Using Genetic Algorithm," Journal of Intelligence and Information Systems, Vol.14, No.3 (2008), 133-154.
7 Jo, T. and N. Japkowicz, "Class Imbalances versus Small Disjuncts," ACM SIGKDD Exploration, Vol.6(2004), 40-49.   DOI
8 Joshi, M., V. Kumar, and R. Agarwal, "Evaluating Boosting Algorithms to Classify Rare Classes : Comparison and Improvements," Proc. 1st IEEE Int'l Conf. on Data Mining, (2001), 257-264.
9 Ling, C. and C. Li, "Data Mining for Direct Marketing Problems and Solutions," Proc. 4th Int'l Conf. on Knowledge Discovery and Data Mining (KDD-98), New York, 1998.
10 Linoff, G. and M. Berry, Data Mining Techniques, 3rd Ed., Wiley Pub. Inc., 2011.
11 McNamee, B., P. Cunningham, S. Byrne, and O. Corrigan, "The Problem of Bias in Training Data in Regression Problems in Medical Decision Support," Artificial Intelligence in Medicine, Vol.24(2002), 51-70.   DOI   ScienceOn
12 Min, J. H. and Y. C. Lee, "Bankruptcy Prediction Using Support Vector Machine with Optimal Choice of Kernel Function Parameters," Expert Systems with Applications, Vol.28(2005), 603-614.   DOI   ScienceOn
13 Vapnik, V., The Nature of Statistical Learning Theory, Chapter 5. Springer-Verlag, New York, 1995.
14 Veropoulos, K., C. Campbell, and N. Cristianini, "Controlling the Sensitivity of Support Vector Machines," Proc. Int'l Joint Conf. on AI , (1999), 55-60.
15 Akbani R., K. Wek, and S. J. Apkwicz, "Applying Support Vector Machines to Imbalanced Data Sets," Proc. 15th European Conf. on Machine Learning, (2004), 39-50.
16 Barandela, J., S. Sanchez, V. Garcaa, and E. Rangel, "Strategies for Learning in Class Imbalance Problems," Pattern Recognition, Vol.36(2003), 849-851.   DOI   ScienceOn
17 Bache, K. and M. Lichman, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA : University of California, School of Information and Computer Science, 2013.
18 Breiman, L., J. H. Friedman, J. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, 1984.
19 Calleja, J., A. Benitez, M. A. Medina, and O. Fuentes, "Machine Learning from Imbalanced Data Sets for Astronomical Object Classification," Proc. Int'l Conf. on Soft Computing and Pattern Recognition, (2011), 435-439.
20 Cardie, C. and N. Howe, "Improving Minority Class Prediction Using Case-Specific Feature Weights," Proc. 14th Int'l Conf. on Machine Learning, (1997), 57-65.
21 Kim, M.-J., "Ensemble Learning with Support Vector Machines for Bond Rating," Journal of Intelligence and Information Systems, Vol.18, No.2(2012), 29-45.
22 Kotsiantis, S. B. and P. E. Pintelas, "Mixture of Expert Agents for Handling Imbalanced Data Sets," Ann. Math. Computer Teleinformatics, (2003), 46-55.
23 Vapnik, V., Estimation of Dependences Based on Empirical Data, Nauka, Moscow, 1979.
24 Kubat, M. and S. Matwin, "Addressing the Curse of Imbalanced Data Sets : One-sided Sampling," Proc. 14th Int'l Conf. on Machine Learning, (1997), 179-186.
25 Lee, J. S. and J. C. Lee, "Customer Churn Prediction by Hybrid Model," Advanced Data Mining and Applications, Lecture Note on Artificial Intelligence Vol.4093(2006), 959-966.
26 Wu, G. and E. Chang, "Class-Boundary Alignment for Imbalanced Dataset Learning," Proc. Int'l Conf. on Machine Learning : 2003 Workshop on Learning from Imbalanced Data Sets, Washington, D.C., 2003.
27 Egan, J. P., Signal Detection Theory and Roc Analysis. New York : Academic Press, 1975.
28 Lee, H.-U. and H. Ahn, "An Intelligent Intrusion Detection Model Based on Support Vector Machines and the Classification Threshold Optimization for Considering the Asymmetric Error Cost," Journal of Intelligence and Information Systems, Vol.17, No.4(2011), 157-173.