DOI QR코드

DOI QR Code

Reinforced Generator GAN Model for Tabular Data Learning

Tabular Data 학습을 위한 강화형 생성자 GAN Mode

  • Chan-sik Sung (Department of IT Convergence, Gachon University) ;
  • Joon-sik Lim (Department of Computer Engineering, Gachon University)
  • Received : 2024.09.03
  • Accepted : 2024.09.29
  • Published : 2024.10.31

Abstract

Tabular Data is a mixture of numerical and categorical data, and machine learning models have been evaluated to be more suitable than generative models in performing learning using such tabular data. This evaluation is because the generative model had a problem of excessively increasing parameters or not finding the direction of learning due to the numerical multimodal distribution and categorical frequency imbalance, which are characteristics of Tabular Data. However, as data gradually becomes big data and becomes real-time, existing machine learning models have shown limitations in their application. In this paper, as a methodology for applying generative models to tabular data, we propose RGGAN (Reinforced Generator GAN), a reinforced generator adversarial neural network that Clustering sampling that leverages conjugate prior distributions and the loss function improved with Gower coefficients and mutual information. As a result of measuring the AUC by detecting fraudulent transactions in the IEEE-CIS Fraud Detection Dataset by constructing an anomaly detector with the discriminators learned from the RGGAN proposed in this paper, it showed a performance improvement effect of 1-7% over the existing generative models, proving that the proposed model is effective for learning tabular data and also effective in detecting fraudulent transactions.

Tabular Data는 수치형과 범주형 데이터의 혼합 데이터로, 이러한 Tabular Data를 이용한 학습을 수행함에 있어, 주로 머신러닝 모델이 생성형 모델보다 그 동안 적합하다고 평가되어 왔다. 이러한 평가는 생성형 모델이 Tabular Data의 특성인 수치형의 다봉분포와 범주형의 빈도 불균형 때문에 과도하게 매개변수가 많아지거나 학습의 방향을 찾지 못하는 문제가 있었기 때문이다. 그러나 데이터가 점차 빅데이터화 되고 실시간으로 이루어 지면서 기존의 머신러닝 모델들은 그 적용에 한계를 보여 왔다. 본 논문에서는 Tabular Data에 생성형 모델을 적용하기 위한 방법론으로, 켤레사전분포를 이용한 군집화 샘플링과 가워계수와 상호 정보량으로 손실함수를 개선한 생성자 강화형 적대적 신경망인 RGGAN(Reinforced Generator GAN)을 제안한다. 본 논문이 제안한 RGGAN으로 학습한 판별자들로 이상 탐지기를 구성하여, IEEE-CIS Fraud Detection Dataset에서의 사기거래를 탐지하여 AUC를 측정해본 결과, 기존 생성형 모델들 보다 1~7%의 성능 개선 효과를 보임으써, 제안된 모델이 Tabular Data 학습에 유효하고 또한 사기거래 탐지에 효과적인 모델임을 증명하였다.

Keywords

References

  1. Kyungeun Lee, Ye Seul Sim, HyeSeung Cho, Moon jung Eo, Suhee Yoon, Sanghyu Yoon, Woohyung Lim, "Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains," NeurIPS, 2023. https://doi.org/10.48550/arXiv.2405.07414
  2. Xin Huang, Ashish Khetan, Milan Cvitkovic, Zohar Karnin, "TabTransformer: Tabular Data Modeling Using Contextual Embeddings," NeurIPS, 2020. https://doi.org/10.48550/arXiv.2012.06678
  3. Lei Xu, Kalyan Veeramachaneni, "Synthesizing Tabular Data using Generative Adversarial Networks," arXiv 27 November 2018 Computer Science, 2018. https://doi.org/10.48550/arXiv.1811.11264
  4. Sercan O. Arik, Tomas Pfister, "TabNet: Attentive Interpretable Tabular Learning," Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 8, pp. 6679-6687, 2021. https://doi.org/10.1609/aaai.v35i8.16826
  5. Lei Xu, Kalyan Veeramachaneni, "Synthesizing Tabular Data using Generative Adversarial Networks," arXiv 27 November 2018 Computer Sience, 2018. https://doi.org/10.48550/arXiv.1811.11264
  6. Lei Xu, Maria Skoularidou, Alfredo CuestaInfante, Kalyan Veeramachaneni, "Modeling Tabular Data using Conditional GAN," Advances in Neural Information Processing Systems 32, 2019. https://doi.org/10.48550/arXiv.1907.00503
  7. HL Nakayiza, LAC Ahakonye, DS Kim, JM Lee, "Machine Learning Algorithms for Detecting Intra-Vehicular Data Falsification," ResearchGate, 2024. https://www.researchgate.net/publication/382330545_Machine_Learning_Algorithms_for_Detecting_Intra-Vehicular_Data_Falsification
  8. V Borisov, Leemann, K Sessler, J Haug, "Deep Neural Networks and Tabular Data: A Survey," IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 6, pp. 7499-7519, June 2024,. https://doi.org/10.1109/TNNLS.2022.3229161
  9. Agresti, "Categorical Data Analysis," Wiley-Interscience, 2002. https://onlinelibrary.wiley.com/doi/book/10.1002/0471249688
  10. Mehdi Mirzaet, Simon Osindero, "Conditional Generative Adversarial Nets," arXiv.org, 6 November 2014. https://doi.org/10.48550/arXiv.1411.1784
  11. Augustus Odena, "Semi-Supervised Learning with Generative Adversarial Networks," arXiv:16-06.01583 Statistics, 2017. https://doi.org/10.48550/arXiv.1606.01583
  12. Minjung Kyung and Jeff Gill, George Casella, "Estimation in Dirichlet random effects models," Ann. Statist, 38(2), pp. 979-1009. April 2010. https://doi.org/10.1214/09-AOS731
  13. Marcello D'Orazio, "Distances with Mixed-Type Variables, some Modified Gower's Coefficients," arXiv:2101.02481, 2021. https://arxiv.org/abs/2101.02481
  14. Monia Ranalli, Roberto Rocci, "Applying Gower distance as a dissimilarity measure for mixed type data in a clustering problem," Springer Nature Switzerland AG, 2021. https://arxiv.org/pdf/2101.02481
  15. Z. Zhang, "Generalized Mutual Information," MDPI, 3(2), pp. 158-165, 2020. https://doi.org/10.3390/stats3020013
  16. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel, "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets," Advances in Neural Information Processing Systems 29, 2016. https://doi.org/10.48550/arXiv.1606.03657
  17. Dilan Gorur1, Carl Edward Rasmussen, "Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution," Journal of Computer Science and Technology, Vol. 25, pp. 615-626, 2010. https://doi.org/10.1007/s11390-010-9355-8
  18. Minjung Kyung and Jeff Gill, George Casella, "Estimation in Dirichlet random effects models," Ann. Statist., 38(2), pp. 979-1009. April 2010. https://doi.org/10.1214/09-AOS731
  19. Shanshan Jiang, Ruiting Dong, Jie Wang, Min Xia, "Credit Card Fraud Detection Based on Unsupervised Attentional Anomaly Detection Network," Systems 2023, MDPI, 11(6), 305, 2023. https://doi.org/10.3390/systems11060305
  20. Kingma, D. P., Ba, J, "Adam: A Method for Stochastic Optimization," arXiv preprint arXi-v: 1412.6980, 2014. https://doi.org/10.48550/arXiv.1412.6980
  21. J. Zou, J. Zhang, P. Jiang, "Credit Card Fraud Detection Using Autoencoder Neural Network," arXiv: 1908. 11553, 2019. https://arxiv.org/abs/1908.11553
  22. Lei Zhang, Fang Yuan, KaiFeng Ma, We-nJun Fang, "A Tabnet based Card Fraud detetion Algorithm with Feature Engineering," 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, pp. 911-914, 2022. https://doi.org/10.1109/ICCECE54139.2022.9712822
  23. Chew Chee Meng, Kian Ming Lim, Chin Poo Lee, Jit Yan Lim, "Credit Card Fraud Detection using TabNet," 23 11th International Conference on Information and Communication Technology (ICoICT), Melaka, Malaysia, pp. 394-399, 2023. https://doi.org/10.1109/ICoICT58202.2023.10262711