DOI QR코드

DOI QR Code

A Hybrid Oversampling Technique for Imbalanced Structured Data based on SMOTE and Adapted CycleGAN

불균형 정형 데이터를 위한 SMOTE와 변형 CycleGAN 기반 하이브리드 오버샘플링 기법

  • 노정담 (Afreeca TV VOD 데이터 팀) ;
  • 최병구 (국민대학교 경영대학 AI빅데이터융합경영학과)
  • Received : 2022.10.19
  • Accepted : 2022.11.17
  • Published : 2022.11.30

Abstract

As generative adversarial network (GAN) based oversampling techniques have achieved impressive results in class imbalance of unstructured dataset such as image, many studies have begun to apply it to solving the problem of imbalance in structured dataset. However, these studies have failed to reflect the characteristics of structured data due to changing the data structure into an unstructured data format. In order to overcome the limitation, this study adapted CycleGAN to reflect the characteristics of structured data, and proposed hybridization of synthetic minority oversampling technique (SMOTE) and the adapted CycleGAN. In particular, this study tried to overcome the limitations of existing studies by using a one-dimensional convolutional neural network unlike previous studies that used two-dimensional convolutional neural network. Oversampling based on the method proposed have been experimented using various datasets and compared the performance of the method with existing oversampling methods such as SMOTE and adaptive synthetic sampling (ADASYN). The results indicated the proposed hybrid oversampling method showed superior performance compared to the existing methods when data have more dimensions or higher degree of imbalance. This study implied that the classification performance of oversampling structured data can be improved using the proposed hybrid oversampling method that considers the characteristic of structured data.

이미지와 같은 비정형 데이터의 불균형 클래스 문제 해결에 있어 생산적 적대 신경망(generative adversarial network)에 기반한 오버샘플링 기법의 우수성이 알려짐에 따라 다양한 연구들이 이를 정형 데이터의 불균형 문제 해결에도 적용하기 시작하였다. 그러나 이러한 연구들은 데이터의 형태를 비정형 데이터 구조로 변경함으로써 정형 데이터의 특징을 정확하게 반영하지 못한다는 점이 문제로 지적되고 있다. 본 연구에서는 이를 해결하기 위해 순환 생산적 적대 신경망(cycle GAN)을 정형 데이터의 구조에 맞게 재구성하고 이를 SMOTE(synthetic minority oversampling technique) 기법과 결합한 하이브리드 오버샘플링 기법을 제안하였다. 특히 기존 연구와 달리 생산적 적대 신경망을 구성함에 있어 1차원 합성곱 신경망(1D-convolutional neural network)을 사용함으로써 기존 연구의 한계를 극복하고자 하였다. 본 연구에서 제안한 기법의 성능 비교를 위해 불균형 정형 데이터를 기반으로 오버샘플링을 진행하고 그 결과를 SMOTE, ADASYN(adaptive synthetic sampling) 등과 같은 기존 기법과 비교하였다. 비교 결과 차원이 많을수록, 불균형 정도가 심할수록 제안된 모형이 우수한 성능을 보이는 것으로 나타났다. 본 연구는 기존 연구와 달리 정형 데이터의 구조를 유지하면서 소수 클래스의 특징을 반영한 오버샘플링을 통해 분류의 성능을 향상시켰다는 점에서 의의가 있다.

Keywords

References

  1. 김예원, 유예림, 최홍용, "생성적 적대 신경망과 딥러닝을 활용한 이상거래 탐지 시스템 모형", Information Systems Review, 제22권, 제1호, 2020, pp. 59-72. 
  2. 최형욱, 이승현, 김형훈, 서용철, "CycleGAN을 활용한 항공영상 학습 데이터 셋 보완 기법에 관한 연구", Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography, 제38권, 제6호, 2020, pp. 499-509. 
  3. Chen, H., J. Chen, and J. Ding, "Data evaluation and enhancement for quality improvement of machine learning", IEEE Transactions on Reliability, Vol.70, No.2, 2021, pp. 831-847.  https://doi.org/10.1109/TR.2021.3070863
  4. Zhou, F., S. Yang, H. Fujita, D. Chen, C. Wene, "Deep learning fault diagnosis method based on global optimization GAN for unbalanced data", Knowledge-Based Systems, Vol.187, 2020, 104837. 
  5. Arjovsky, M., S. Chintala, and L. Bottou, "Wasserstein generative adversarial networks", Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 214-223. 
  6. Aydilek, I. B. and A. Arslan, "A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm", Information Science, Vol.233, 2013, pp. 25-35.  https://doi.org/10.1016/j.ins.2013.01.021
  7. Ba, H., "Improving detection of credit card fraudulent transactions using generative adversarial networks", arXiv, 2019, Available at https://doi.org/10.48550/arXiv.1907.03355. 
  8. Bai, S., J. Z. Kolter, and V. Koltun, "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling", arXiv, 2018, Available at https://doi.org/10.48550/arXiv.1803.01271. 
  9. Bosu, M. F. and S. G. MacDonell, "A taxonomy of data quality challenges in empirical software engineering", Proceedings of the 22nd Australian Software Engineering Conference, 2013, pp. 97-106. 
  10. Cao, Q. and S. Wang, "Applying over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning", Proceedings of the 2011 International Conference on Information Management, Innovation Management and Industrial Engineering, 2011, pp. 543-548. 
  11. Chandola, V., A. Banerjee, and V. Kumar, "Anomaly detection: A survey", ACM Computing Surveys, Vol.41, No.3, 2009, pp. 1-58. 
  12. Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique", Journal of Artificial Intelligence Research, Vol.16, 2002, pp. 321-357. 
  13. Chawla, N. V., N. Japkowicz, and A. Kotcz, "Editorial: Special issue on learning from imbalanced data sets", ACM SIGKDD Explorations Newsletter, Vol.6, No.1, 2004, pp. 1-6.  https://doi.org/10.1145/1007730.1007733
  14. Deepa, T., and M. Punithavalli, "An E-SMOTE technique for feature selection in high-dimensional imbalanced dataset", Proceedings of the 3rd International Conference on Electronics Computer Technology, Vol.2, 2011, pp. 322-324. 
  15. Dlamini, G., and M. Fahim, "DGM: A data generative model to improve minority class presence in anomaly detection domain", Neural Computing and Applications, Vol.33, No.20, 2021, pp. 13635-13646.  https://doi.org/10.1007/s00521-021-05993-w
  16. Douzas, G. and F. Bacao, "Effective data generation for imbalanced learning using conditional generative adversarial networks", Expert Systems with Applications, Vol.91, 2018, pp. 464-471.  https://doi.org/10.1016/j.eswa.2017.09.030
  17. Engelmann, J., and S. Lessmann, "Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning", Expert Systems with Applications, Vol.174, 2021, 114582. 
  18. Fangyu, W., Z. Jianhui, B. Youjun, and C. Bo, "Research on imbalanced data set preprocessing based on deep learning", Proceedings of the 2021 Asia-Pacific Conference on Communications Technology and Computer Science, 2021, pp. 75-79. 
  19. Fernandez, A., S., del Rio, N. V. Chawla, F. Herrera1, "An insight into imbalanced big data classification: Outcomes and challenges", Complex & Intelligence Systems, Vol.3, 2017, pp. 105-120.  https://doi.org/10.1007/s40747-017-0037-9
  20. Fernandez-Delgado, M., E. Cernadas, S. Barro, and D. Amorim, "Do we need hundreds of classifiers to solve real world classification problems?", Journal of Machine Learning Research, Vol.15, No.1, 2014, pp. 3133-3181. 
  21. Fiore, U., A. De Santis, F. Perla, P. Zanetti, F. Palmieri, "Using generative adversarial networks for improving classification effectiveness in credit card fraud detection", Information Science, Vol.479, 2019, pp. 448-455.  https://doi.org/10.1016/j.ins.2017.12.030
  22. Gangwar, A. K., and V. Ravi, "WiP: Generative adversarial network for oversampling data in credit card fraud detection", Proceedings of the 15th International Conference on Information Systems Security, 2019, pp. 123-134. 
  23. Gazzah, S., and N. E. B. Amara, "New oversampling approaches based on polynomial fitting for imbalanced data sets", Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems, 2008, pp. 677-684. 
  24. Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courvile, and Y. Bengio, "Generative adversarial nets", Advances in Neural Information Processing Systems, Vol.27, 2014, pp. 2672-2680. 
  25. Gui, J., Z. Sun, Y. Wen, D. Tao, and J. Ye, "A review on generative adversarial networks: Algorithms, theory, and applications", IEEE Transactions on Knowledge and Data Engineering, in press, 2021. 
  26. Han, H., W. Y. Wang, B. H. Mao, "Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning", Lecture Notes in Computer Science, Vol.3644, No.5, 2005, pp. 878-887. 
  27. He, H., and E.A. Garcia, "Learning from imbalanced data", IEEE Transactions on Knowledge and Data Engineering, Vol.21, No.9, 2009, pp. 1263-1284.  https://doi.org/10.1109/TKDE.2008.239
  28. He, H., Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning", Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, 2008, pp. 1322-1328. 
  29. IBM, "Inforgraphic-Extracting business value form the 4Vs of big data", 2020, Available at https://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data. 
  30. Islam, A., S. B. Belhaouari, A. U. Rehman, and H. Bensmail, "KNNOR: An oversampling technique for imbalanced datasets", Applied Soft Computing, Vol.115, 2022, 108288. 
  31. Johnson, J. M., and T. M. Khoshgoftaar, "Survey on deep learning with class imbalance", Journal of Big Data, Vol.6, 2019, p. 27. 
  32. Kate, P., V. Ravi, and A. Gangwar, "FinGAN: Generative adversarial network for analytical customer relationship management in banking and insurance", arXiv, 2022, Available at https://doi.org/10.48550/arXiv.2201.11486. 
  33. Khoshgoftaar, T. M., A. Fazelpour, D. J. Dittman, and A. Napolitano, "Ensemble vs. data sampling: Which option is best suited to improve classification performance of imbalanced bioinformatics data?", Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence, 2015, pp. 705-712. 
  34. Kingma, D. P., and P. Dhariwal. "Glow: Generative flow with invertible 1x1 convolutions", Proceedings of the Advances in Neural Information Processing Systems 31, 2018, Available at https://doi.org/10.48550/arXiv.1807.03039. 
  35. Kovacs, G., "An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets", Applied Soft Computing, Vol.83, 2019, 105662. 
  36. Krawczyk, B. "Learning from imbalanced data: Open challenges and future directions", Progress in Artificial Intelligence, Vol.5, No.4, 2016, pp. 221-232.  https://doi.org/10.1007/s13748-016-0094-0
  37. Krizhevsky, A., I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks", Communications of the ACM, Vol.60, No.6, 2017, pp. 84-90.  https://doi.org/10.1145/3065386
  38. Leevy, J. L., T. M. Khoshgoftaar, R. A., Bauder, and N. Seliya, "A survey on addressing high-class imbalance in big data", Journal of Big Data, Vol.5, 2018, 42. 
  39. Ling, C. X. and C. Li, "Data mining for direct marketing: Problems and solutions", Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998, pp. 73-79. 
  40. Liu, Y., H. T. Loh, and A. Sun, "Imbalanced text classification: A term weighting approach", Expert Systems with Applications, Vol.36, 2009, pp. 690-701.  https://doi.org/10.1016/j.eswa.2007.10.042
  41. Mirza, M., and S. Osindero, "Conditional generative adversarial nets", arXiv, 2014, Available at https://doi.org/10.48550/arXiv.1411.1784. 
  42. Mohammed, R., J. Rawashdeh, and M. Abdullah, "Machine learning with oversampling and under-sampling techniques: Overview study and experimental results", Proceedings of the 11th International Conference on Information and Communication Systems, 2020, pp. 243-248. 
  43. Mullick, S. S., S. Datta, and S. Das, "Generative adversarial minority oversampling", Proceedings of IEEE/CVF International Conference on Computing Vision, 2019, pp. 1695-1704. 
  44. Nazari, E., P. Branco, "On oversampling via generative adversarial networks under different data difficulty factors", Proceedings of Machine Learning Research, Vol.154, 2021, pp. 76-89. 
  45. Pathak, D., P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, "Context encoders: Feature learning by inpainting", Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536-2544. 
  46. Quintana, M., and C. Miller, "Towards class-balancing human comfort datasets with GANs", Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, 2019, pp. 391-392. 
  47. Radford, A., L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks", Proceedings of the International Conference on Learning Representations, 2016, Available at https://doi.org/10.48550/arXiv.1511.06434. 
  48. Refinitive, "Smarter humans. Smarter machines", 2019, Available at https://www.refinitiv.com/content/dam/marketing/en_us/documents/gated/reports/refinitiv-ai-ml-survey-report.pdf#form?utm_source=Press_release&utm_medium=web&utm campaign=107263 AISurveyReport&utm_term=&utm_content=Reglp&elqCampaignId=6848. 
  49. Saez, J. A., J. Luengo, J. Stefanowski, and F. Herrera, "SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering", Information Sciences, Vol.291, 2015, pp. 184-203.  https://doi.org/10.1016/j.ins.2014.08.051
  50. Sambasivan, N., S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo. ""Everyone wants to do the model work, not the data work": Data cascades in high-stakes AI", Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1-15. 
  51. Saxena, D., and J. Cao, "Generative adversarial networks (GANs) challenges, solutions, and future directions", ACM Computing Surveys, Vol.54, No.3, 2022, pp.1-42.  https://doi.org/10.1145/3446374
  52. Sharma, A., P. K. Singh and R. Chandra, "SMOTified-GAN for class imbalanced pattern classification problems", IEEE Access, Vol.10, 2022, pp. 30655-30665.  https://doi.org/10.1109/ACCESS.2022.3158977
  53. Silver, D., T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play", Science, Vol.362, No.6419, 2018, pp. 1140-1144.  https://doi.org/10.1126/science.aar6404
  54. Soltanzadeh, P., and M. Hashemzadeh, "RCSMOTE: Range-controlled synthetic minority over-sampling technique for handling the class imbalance problem", Information Science, Vol.542, 2020, pp. 92-111.  https://doi.org/10.1016/j.ins.2020.07.014
  55. Statista, "Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025", 2021, Available at https://www.statista.com/statistics/871513/worldwide-data-created/. 
  56. Tek, F. B., A. G. Dempster, and I. Kale, "Parasite detection and identification for automated thin blood film malaria diagnosis", Computer Vision and Image Understanding, Vol.114, 2010, pp. 21-32.  https://doi.org/10.1016/j.cviu.2009.08.003
  57. Thejas G. S., Y. Hariprasad, S. S. Iyengar, N. R. Sunitha, P. Badrinath, and S. Chennupati, "An extension of synthetic minority oversampling technique based on Kalman filter for imbalanced datasets", Machine Learning with Applications, Vol.8, 2022, 100267. 
  58. Tomek, I., "Two modifications of CNN", IEEE Transactions on Systems, Man, and Cybernetics, Vol.6, No.11, 1976, pp. 769-772. 
  59. Wang, J., and L. Yao, "Unrolled GAN-based oversampling of credit card dataset for fraud detection", Proceedings of the 2022 IEEE International Conference on Artificial Intelligence and Computer Applications, 2022, pp. 858-861. 
  60. Wang, Z., Q. She, and T. E. Ward, "Generative adversarial networks: A survey and taxonomy", ACM Computing Surveys, Vol.54, No.2, 2022, pp.1-38. 
  61. Wilson, D. L., "Asymptotic properties of nearest neighbor rules using edited data", IEEE Transactions on Systems, Man, and Cybernetics, Vol.2, No.3, 1972, pp. 408-421.  https://doi.org/10.1109/TSMC.1972.4309137
  62. Wise, J., "How much data is created every day in 2022?", 2022, Available at https://earthweb.com/how-much-data-is-created-every-day/#Key_Data_Creation_Statistics_2022. 
  63. Xu, L., Synthesizing Tabular Data using Conditional GAN (Master's thesis), Massachusetts Institute of Technology, 2020. 
  64. Yang, Y., K. Zheng, B. Wu, Y. Yang, X. Wang, "Network intrusion detection based on supervised adversarial variational auto-encoder with regularization", IEEE Access, Vol.8, 2020, pp. 42169-42184.  https://doi.org/10.1109/ACCESS.2020.2977007
  65. Yap, B. W., K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah, "An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets", Proceedings of the First International Conference on Advanced Data and Information Engineering, 2013, pp. 13-22. 
  66. Zhou, B., C. Yang, H. Guo, and J. Hu, "A quasi-linear SVM combined with assembled SMOTE for imbalanced data classification", Proceedings of the 2013 International Joint Conference on Neural Networks, 2013, pp. 1-7. 
  67. Zhu, B., X. Pin, S. van den Broucke, and J. Xiao, "A GAN-based hybrid sampling method for imbalanced customer classification", Information Science, Vol.609, 2022, pp. 1397-1411. 
  68. Zhu, J. Y., T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks", Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223-2232. 
  69. Zhu, J.-Y., P. Krahenbuhl, E. Shechtman, and A. A. Efros. "Generative visual manipulation on the natural image manifold", Proceedings of European Conference on Computer Vision, 2016, pp. 597-613.