[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2022.35.2.177

A divide-oversampling and conquer algorithm based support vector machine for massive and highly imbalanced data

Bang, Sungwan (Department of Mathematics, Korea Military Academy)
Kim, Jaeoh (Department of Data Science, Inha University)

Publication Information

The Korean Journal of Applied Statistics / v.35, no.2, 2022 , pp. 177-188 More about this Journal

Abstract

The support vector machine (SVM) has been successfully applied to various classification areas with a high level of classification accuracy. However, it is infeasible to use the SVM in analyzing massive data because of its significant computational problems. When analyzing imbalanced data with different class sizes, furthermore, the classification accuracy of SVM in minority class may drop significantly because its classifier could be biased toward the majority class. To overcome such a problem, we propose the DOC-SVM method, which uses divide-oversampling and conquers techniques. The proposed DOC-SVM divides the majority class into a few subsets and applies an oversampling technique to the minority class in order to produce the balanced subsets. And then the DOC-SVM obtains the final classifier by aggregating all SVM classifiers obtained from the balanced subsets. Simulation studies are presented to demonstrate the satisfactory performance of the proposed method.

Keywords

divide and conquer; imbalanced data; massive data; oversampling; support vector machine;

Citations & Related Records

Times Cited By KSCI : 3 (Citation Analysis)

Reference
Cited By KSCI

1	Hsieh C and Dhillon I (2014). A divide and conquer solver for kernel support vector machines. In Proceedings of the 31st International Conference on Machine Learning.
2	Lin Y, Lee Y, and Wahba G (2002). Support Vector Machines for Classification in Nonstandard Situations, Machine Learning, 46, 191-202. DOI
3	Tang Y, Zhang YQ, Chawla NV, and Krasser S (2009). SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39, 281-288. DOI
4	Oommen T, Baise LG, and Vogel RM (2011). Sampling bias and class imbalance in maximum-likelihood logistic regression, Mathematical Geosciences, 43, 99-120. DOI
5	Bang S, Han SK, and Kim J (2021). Divide and conquer algorithm based support vector machine for massive data analysis, Journal of the Korean Data & Information Science Society, 32, 463-473. DOI
6	Zhang Y, Duchi J, and Wainwright M (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, Journal of Machine Learning Research, 16, 3299-3340.
7	Chen L and Zhou Y (2020). Quantile regression in big data: A divide and conquer based strategy, Computational Statistics and Data Analysis, 144, 1-17.
8	Zhang YP, Zhang LN, and Wang YC (2010). Cluster-based majority under-sampling approaches for class imbalance learning. In Information and Financial Engineering (ICIFE) 2010 2nd IEEE International Conference, 400-404.
9	Vapnik VN (1998). Statistical Learning Theory, Wiley, New York.
10	Akbani R, Kwek S, and Japkowicz N (2004). Applying support vector machines to imbalanced datasets. In Proceedings of European Conference of Machine Learning, 3201, 39-50.
11	Bang S and Jhun M (2014). Weighted support vector machine using k-means clustering, Communications in Statistics-Simulation and Computation, 43, 2307-2324. DOI
12	Bang S and Kim J (2020b). Sampling method using Gaussian mixture clustering for classification analysis of imbalanced data, Journal of the Korean Data Analysis Society, 22, 565-574. DOI
13	Bunkhumpornpat C, Sinapiromsaran K, and Lursinsap C (2009). Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 475--482.
14	Chawla N, Bowyer K, Hall L, and Kegelmeyer W (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357. DOI
15	Chen X, Liu W, and Zhang Y (2019). Quantile regression under memory constraint, Annals of Statistics, 47, 3244-3273.
16	Cristianini N and Shawe-Taylor J (2000). An Introduction to Support Vector Machines, Cambridge University Press, Cambridge.
17	Datta S and Das S (2015). Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs, Neural Networks, 70, 39-52. DOI
18	Anand A, Pugalenthi G, Fogel GB, and Suganthan PN (2010). An approach for classification of highly imbalanced data using weighting and undersampling, Amino Acids, 39, 1385-1391. DOI
19	Bunkhumpornpat C, Sinapiromsaran K, and Lursinsap C (2012). DBSMOTE: density-based synthetic minority over-sampling technique, Applied Intelligence, 36, 664--684. DOI
20	Kang J and Jhun M (2020). Divide-and-conquer random sketched kernel ridge regression for large-scale data, Journal of the Korean Data & Information Science Society, 31, 15-23. DOI
21	Bang S and Kim J (2020a). Divide and conquer kernel quantile regression for massive dataset, The Korean Journal of Applied Statistics, 33, 569-578. DOI
22	Chen X and Xie M (2014). A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, 24, 1655-1684.
23	Cortes C and Vapnik V (1995). Support vector networks, Machine Learning, 20, 273-297. DOI
24	Fan T, Lin D, and Cheng K (2007). Regression analysis for massive datasets, Data and Knowledge Engineering, 61, 554-562. DOI
25	Kim E, Jhun M, and Bang S (2016). Hierarchically penalized support vector machine for the classification of imbalanced data with grouped variables, The Korea Journal of Applied Statistics, 29, 961-975. DOI
26	Xu Q, Cai C, Jiang C, Sun F, and Huang X (2020). Block average quantile regression for massive dataset, Statistical Papers, 61, 141-165. DOI
27	Dua D and Graff C (2019). UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science.
28	He H, Bai Y, Garcia EA, and Li S (2008). ADASYN: adaptive synthetic samplingapproach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference Neural Networks, 1322-1328.
29	Jeong H, Kang C, and Kim K (2008). The effect of oversampling method for imbalanced data, Journal of the Korean Data Analysis Society, 10, 2089-2098.
30	Jiang R, Hu X, Yu K, and Qian W (2018). Composite quantile regression for massive datasets, Statistics, 52, 980-1004. DOI
31	Ling CX and Sheng VS (2008). Cost-sensitive learning and the class imbalance problem, Encyclopedia of Machine Learning, 2011, 231-235.
32	Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang CC, and Meyer MD (2019). Package 'e1071', The R Journal.
33	Owen AB (2007). Infinitely imbalanced logistic regression, The Journal of Machine Learning Research, 8, 761-773.
34	Park J and Bang S (2015). Logistic regression with sampling techniques for the classification of imbalanced data, Journal of The Korean Data Analysis Society, 17, 1877-1888.
35	Veropoulos K, Campbell C, and Cristianini N (1999). Controlling the sensitivity of support vector machines. In Proceedings of the International Joint Conference on AI, 55-60.
36	Han H, Wang WY, and Mao BH (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, Lecture Notes in Computer Science, 3644, 878-887.
37	Lian H and Fan Z (2018). Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions, Journal of Machine Learning Research, 18, 1-26.

KSCI

A divide-oversampling and conquer algorithm based support vector machine for massive and highly imbalanced data 불균형의 대용량 범주형 자료에 대한 분할-과대추출 정복 서포트 벡터 머신

A divide-oversampling and conquer algorithm based support vector machine for massive and highly imbalanced data