Browse > Article
http://dx.doi.org/10.5351/KJAS.2022.35.2.177

A divide-oversampling and conquer algorithm based support vector machine for massive and highly imbalanced data  

Bang, Sungwan (Department of Mathematics, Korea Military Academy)
Kim, Jaeoh (Department of Data Science, Inha University)
Publication Information
The Korean Journal of Applied Statistics / v.35, no.2, 2022 , pp. 177-188 More about this Journal
Abstract
The support vector machine (SVM) has been successfully applied to various classification areas with a high level of classification accuracy. However, it is infeasible to use the SVM in analyzing massive data because of its significant computational problems. When analyzing imbalanced data with different class sizes, furthermore, the classification accuracy of SVM in minority class may drop significantly because its classifier could be biased toward the majority class. To overcome such a problem, we propose the DOC-SVM method, which uses divide-oversampling and conquers techniques. The proposed DOC-SVM divides the majority class into a few subsets and applies an oversampling technique to the minority class in order to produce the balanced subsets. And then the DOC-SVM obtains the final classifier by aggregating all SVM classifiers obtained from the balanced subsets. Simulation studies are presented to demonstrate the satisfactory performance of the proposed method.
Keywords
divide and conquer; imbalanced data; massive data; oversampling; support vector machine;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Hsieh C and Dhillon I (2014). A divide and conquer solver for kernel support vector machines. In Proceedings of the 31st International Conference on Machine Learning.
2 Lin Y, Lee Y, and Wahba G (2002). Support Vector Machines for Classification in Nonstandard Situations, Machine Learning, 46, 191-202.   DOI
3 Tang Y, Zhang YQ, Chawla NV, and Krasser S (2009). SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39, 281-288.   DOI
4 Oommen T, Baise LG, and Vogel RM (2011). Sampling bias and class imbalance in maximum-likelihood logistic regression, Mathematical Geosciences, 43, 99-120.   DOI
5 Bang S, Han SK, and Kim J (2021). Divide and conquer algorithm based support vector machine for massive data analysis, Journal of the Korean Data & Information Science Society, 32, 463-473.   DOI
6 Zhang Y, Duchi J, and Wainwright M (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, Journal of Machine Learning Research, 16, 3299-3340.
7 Chen L and Zhou Y (2020). Quantile regression in big data: A divide and conquer based strategy, Computational Statistics and Data Analysis, 144, 1-17.
8 Zhang YP, Zhang LN, and Wang YC (2010). Cluster-based majority under-sampling approaches for class imbalance learning. In Information and Financial Engineering (ICIFE) 2010 2nd IEEE International Conference, 400-404.
9 Vapnik VN (1998). Statistical Learning Theory, Wiley, New York.
10 Akbani R, Kwek S, and Japkowicz N (2004). Applying support vector machines to imbalanced datasets. In Proceedings of European Conference of Machine Learning, 3201, 39-50.
11 Bang S and Jhun M (2014). Weighted support vector machine using k-means clustering, Communications in Statistics-Simulation and Computation, 43, 2307-2324.   DOI
12 Bang S and Kim J (2020b). Sampling method using Gaussian mixture clustering for classification analysis of imbalanced data, Journal of the Korean Data Analysis Society, 22, 565-574.   DOI
13 Bunkhumpornpat C, Sinapiromsaran K, and Lursinsap C (2009). Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 475--482.
14 Chawla N, Bowyer K, Hall L, and Kegelmeyer W (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357.   DOI
15 Chen X, Liu W, and Zhang Y (2019). Quantile regression under memory constraint, Annals of Statistics, 47, 3244-3273.
16 Cristianini N and Shawe-Taylor J (2000). An Introduction to Support Vector Machines, Cambridge University Press, Cambridge.
17 Datta S and Das S (2015). Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs, Neural Networks, 70, 39-52.   DOI
18 Anand A, Pugalenthi G, Fogel GB, and Suganthan PN (2010). An approach for classification of highly imbalanced data using weighting and undersampling, Amino Acids, 39, 1385-1391.   DOI
19 Bunkhumpornpat C, Sinapiromsaran K, and Lursinsap C (2012). DBSMOTE: density-based synthetic minority over-sampling technique, Applied Intelligence, 36, 664--684.   DOI
20 Kang J and Jhun M (2020). Divide-and-conquer random sketched kernel ridge regression for large-scale data, Journal of the Korean Data & Information Science Society, 31, 15-23.   DOI
21 Bang S and Kim J (2020a). Divide and conquer kernel quantile regression for massive dataset, The Korean Journal of Applied Statistics, 33, 569-578.   DOI
22 Chen X and Xie M (2014). A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, 24, 1655-1684.
23 Cortes C and Vapnik V (1995). Support vector networks, Machine Learning, 20, 273-297.   DOI
24 Fan T, Lin D, and Cheng K (2007). Regression analysis for massive datasets, Data and Knowledge Engineering, 61, 554-562.   DOI
25 Kim E, Jhun M, and Bang S (2016). Hierarchically penalized support vector machine for the classification of imbalanced data with grouped variables, The Korea Journal of Applied Statistics, 29, 961-975.   DOI
26 Xu Q, Cai C, Jiang C, Sun F, and Huang X (2020). Block average quantile regression for massive dataset, Statistical Papers, 61, 141-165.   DOI
27 Dua D and Graff C (2019). UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science.
28 He H, Bai Y, Garcia EA, and Li S (2008). ADASYN: adaptive synthetic samplingapproach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference Neural Networks, 1322-1328.
29 Jeong H, Kang C, and Kim K (2008). The effect of oversampling method for imbalanced data, Journal of the Korean Data Analysis Society, 10, 2089-2098.
30 Jiang R, Hu X, Yu K, and Qian W (2018). Composite quantile regression for massive datasets, Statistics, 52, 980-1004.   DOI
31 Ling CX and Sheng VS (2008). Cost-sensitive learning and the class imbalance problem, Encyclopedia of Machine Learning, 2011, 231-235.
32 Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang CC, and Meyer MD (2019). Package 'e1071', The R Journal.
33 Owen AB (2007). Infinitely imbalanced logistic regression, The Journal of Machine Learning Research, 8, 761-773.
34 Park J and Bang S (2015). Logistic regression with sampling techniques for the classification of imbalanced data, Journal of The Korean Data Analysis Society, 17, 1877-1888.
35 Veropoulos K, Campbell C, and Cristianini N (1999). Controlling the sensitivity of support vector machines. In Proceedings of the International Joint Conference on AI, 55-60.
36 Han H, Wang WY, and Mao BH (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, Lecture Notes in Computer Science, 3644, 878-887.
37 Lian H and Fan Z (2018). Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions, Journal of Machine Learning Research, 18, 1-26.