DOI QR코드

DOI QR Code

Optimal Ratio of Data Oversampling Based on a Genetic Algorithm for Overcoming Data Imbalance

데이터 불균형 해소를 위한 유전알고리즘 기반 최적의 오버샘플링 비율

  • 신승수 (광운대학교 소프트웨어학부) ;
  • 조휘연 (광운대학교 컴퓨터과학과) ;
  • 김용혁 (광운대학교 소프트웨어학부)
  • Received : 2020.11.26
  • Accepted : 2021.01.20
  • Published : 2021.01.28

Abstract

Recently, with the development of database, it is possible to store a lot of data generated in finance, security, and networks. These data are being analyzed through classifiers based on machine learning. The main problem at this time is data imbalance. When we train imbalanced data, it may happen that classification accuracy is degraded due to over-fitting with majority class data. To overcome the problem of data imbalance, oversampling strategy that increases the quantity of data of minority class data is widely used. It requires to tuning process about suitable method and parameters for data distribution. To improve the process, In this study, we propose a strategy to explore and optimize oversampling combinations and ratio based on various methods such as synthetic minority oversampling technique and generative adversarial networks through genetic algorithms. After sampling credit card fraud detection which is a representative case of data imbalance, with the proposed strategy and single oversampling strategies, we compare the performance of trained classifiers with each data. As a result, a strategy that is optimized by exploring for ratio of each method with genetic algorithms was superior to previous strategies.

최근에는 데이터베이스의 발달로 금융, 보안, 네트워크 등에서 생성된 많은 데이터가 저장 가능하며, 기계학습 기반 분류기를 통해 분석이 이루어지고 있다. 이 때 주로 야기되는 문제는 데이터 불균형으로, 학습 시 다수 범주의 데이터들로 과적합이 되어 분류 정확도가 떨어지는 경우가 발생한다. 이를 해결하기 위해 소수 범주의 데이터 수를 증가시키는 오버샘플링 전략이 주로 사용되며, 데이터 분포에 적합한 기법과 인자들을 다양하게 조절하는 과정이 필요하다. 이러한 과정의 개선을 위해 본 연구에서는 스모트와 생성적 적대 신경망 등 다양한 기법 기반의 오버샘플링 조합과 비율을 유전알고리즘을 통해 탐색하고 최적화 하는 전략을 제안한다. 제안된 전략과 단일 오버샘플링 기법으로 신용카드 사기 탐지 데이터를 샘플링 한 뒤, 각각의 데이터들로 학습한 분류기의 성능을 비교한다. 그 결과 유전알고리즘으로 기법별 비율을 탐색하여 최적화 한 전략의 성능이 기존 전략들 보다 우수했다.

Keywords

References

  1. V. Chandola, A. Banerjee & V. Kumar. (2009). Anomaly detection: a survey. ACM Computing Surveys, 41(3), 1-58.
  2. M. H. Bhuyan, D. K. Bhattacharyya & J. K. Kalita. (2013). Network anomaly detection: methods, systems and tools. IEEE Communications Suveys & Tutorials, 16(1), 303-336. DOI : 10.1109/SURV.2013.052213.00046
  3. N. Japkowicz & S. Stephen. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis, 6(5), 429-449. DOI : 10.3233/IDA-2002-6504
  4. H. Y. Cho & Y. H. Kim (2020). A genetic algorithm to optimize SMOTE and GAN ratio in class imbalanced datasets. In Proceedings of the Genetic and Evolutionary Computation Conference. (pp. 33-34). Cancun : ACM. DOI : 10.1145/3377929.3398153
  5. H. Y. Cho. (2020). Optimization of Data Oversampling Ratio Using a Genetic Algorithm. Master's thesis. Kwangwoon University, Seoul.
  6. N. V. Chawla, K. W. Bowyer, L. O. Hall & W. P. Kegelmeyer. (2002). SMOTE: synthetic minority over-sampling techniques. Journal of Artificial Intelligence Research, 16, 321-357. DOI : 10.1613/jair.953
  7. H. Han, W. Y. Wang & B. H. Mao. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing. (pp. 878-887). Berlin : Springer.
  8. J. Mathew, M. Luo, C. K. Pang & H. L. Chan. (2015). Kernel-based SMOTE for SVM classification of imbalanced datasets. Annual Conference of the IEEE Industrial Electronics Society. (pp. 1127-1132). Yokohama : IEEE. DOI : 10.1109/IECON.2015.7392251
  9. F. Last, G. Douzas & F. Bacao. (2018). Oversampling for imbalanced learning based on K-means and SMOTE. Information Sciences, 465, 1-20. DOI : 10.1016/j.ins.2018.06.056
  10. H. He, Y. Bai, E. A. Garcia & S. Li. (2008). ADASYN: adaptive synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks. (pp. 1322-1328). Hongkong : IEEE. DOI : 10.1109/IJCNN.2008.4633969
  11. I. J. Goodfellow et al. (2014). Generative adversarial nets. Annual Conference on Neural Information Processing Systems. (pp. 2672-2680). Montreal : Curran Associates.
  12. G. Douzas & F. Bacao. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications, 91, 464-471. DOI : 10.1016/j.eswa.2017.09.030
  13. J. H. Holland. (1992). Adaptions in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Cambridge : MIT Press.
  14. C. W. Ahn & S. R. Rudrapatna. (2003). Elitism-based compact genetic algorithm. IEEE Transactions on Evolutionary Computation, 7(4), 367-385. DOI : 10.1109/TEVC.2003.814633
  15. J. Schmidhuber. (2015). Deep learning in neural networks: an overview. Neural Networks, 61, 85-117. DOI : 10.1016/j.neunet.2014.09.003
  16. D. M. Powers. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63.