[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.15207/JKCS.2021.12.1.049

Optimal Ratio of Data Oversampling Based on a Genetic Algorithm for Overcoming Data Imbalance

Shin, Seung-Soo (Dept. of Computer Software, Kwangwoon University)
Cho, Hwi-Yeon (Dept. of Computer Science, Kwangwoon University)
Kim, Yong-Hyuk (Dept. of Computer Software, Kwangwoon University)

Publication Information

Journal of the Korea Convergence Society / v.12, no.1, 2021 , pp. 49-55 More about this Journal

Abstract

Recently, with the development of database, it is possible to store a lot of data generated in finance, security, and networks. These data are being analyzed through classifiers based on machine learning. The main problem at this time is data imbalance. When we train imbalanced data, it may happen that classification accuracy is degraded due to over-fitting with majority class data. To overcome the problem of data imbalance, oversampling strategy that increases the quantity of data of minority class data is widely used. It requires to tuning process about suitable method and parameters for data distribution. To improve the process, In this study, we propose a strategy to explore and optimize oversampling combinations and ratio based on various methods such as synthetic minority oversampling technique and generative adversarial networks through genetic algorithms. After sampling credit card fraud detection which is a representative case of data imbalance, with the proposed strategy and single oversampling strategies, we compare the performance of trained classifiers with each data. As a result, a strategy that is optimized by exploring for ratio of each method with genetic algorithms was superior to previous strategies.

Keywords

Data analysis; Data imbalance; Oversampling; Genetic algorithm; Optimization;

Citations & Related Records

Reference

1	V. Chandola, A. Banerjee & V. Kumar. (2009). Anomaly detection: a survey. ACM Computing Surveys, 41(3), 1-58.
2	M. H. Bhuyan, D. K. Bhattacharyya & J. K. Kalita. (2013). Network anomaly detection: methods, systems and tools. IEEE Communications Suveys & Tutorials, 16(1), 303-336. DOI : 10.1109/SURV.2013.052213.00046 DOI
3	N. Japkowicz & S. Stephen. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429-449. DOI : 10.3233/IDA-2002-6504 DOI
4	H. Y. Cho & Y. H. Kim (2020). A genetic algorithm to optimize SMOTE and GAN ratio in class imbalanced datasets. In Proceedings of the Genetic and Evolutionary Computation Conference. (pp. 33-34). Cancun : ACM. DOI : 10.1145/3377929.3398153 DOI
5	H. Y. Cho. (2020). Optimization of Data Oversampling Ratio Using a Genetic Algorithm. Master's thesis. Kwangwoon University, Seoul.
6	N. V. Chawla, K. W. Bowyer, L. O. Hall & W. P. Kegelmeyer. (2002). SMOTE: synthetic minority over-sampling techniques. Journal of Artificial Intelligence Research, 16, 321-357. DOI : 10.1613/jair.953 DOI
7	H. Han, W. Y. Wang & B. H. Mao. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing. (pp. 878-887). Berlin : Springer.
8	J. Mathew, M. Luo, C. K. Pang & H. L. Chan. (2015). Kernel-based SMOTE for SVM classification of imbalanced datasets. Annual Conference of the IEEE Industrial Electronics Society. (pp. 1127-1132). Yokohama : IEEE. DOI : 10.1109/IECON.2015.7392251 DOI
9	F. Last, G. Douzas & F. Bacao. (2018). Oversampling for imbalanced learning based on K-means and SMOTE. Information Sciences, 465, 1-20. DOI : 10.1016/j.ins.2018.06.056 DOI
10	H. He, Y. Bai, E. A. Garcia & S. Li. (2008). ADASYN: adaptive synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks. (pp. 1322-1328). Hongkong : IEEE. DOI : 10.1109/IJCNN.2008.4633969 DOI
11	D. M. Powers. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63.
12	I. J. Goodfellow et al. (2014). Generative adversarial nets. Annual Conference on Neural Information Processing Systems. (pp. 2672-2680). Montreal : Curran Associates.
13	G. Douzas & F. Bacao. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications, 91, 464-471. DOI : 10.1016/j.eswa.2017.09.030 DOI
14	J. H. Holland. (1992). Adaptions in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Cambridge : MIT Press.
15	C. W. Ahn & S. R. Rudrapatna. (2003). Elitism-based compact genetic algorithm. IEEE Transactions on Evolutionary Computation, 7(4), 367-385. DOI : 10.1109/TEVC.2003.814633 DOI
16	J. Schmidhuber. (2015). Deep learning in neural networks: an overview. Neural Networks, 61, 85-117. DOI : 10.1016/j.neunet.2014.09.003 DOI

KSCI

Optimal Ratio of Data Oversampling Based on a Genetic Algorithm for Overcoming Data Imbalance 데이터 불균형 해소를 위한 유전알고리즘 기반 최적의 오버샘플링 비율

Optimal Ratio of Data Oversampling Based on a Genetic Algorithm for Overcoming Data Imbalance