DOI QR코드

DOI QR Code

Generating and Validating Synthetic Training Data for Predicting Bankruptcy of Individual Businesses

  • Hong, Dong-Suk (Big data Center, KCIS (Korea Credit Information Services)) ;
  • Baik, Cheol (Big data Center, KCIS (Korea Credit Information Services))
  • Received : 2021.08.31
  • Accepted : 2021.11.02
  • Published : 2021.12.31

Abstract

In this study, we analyze the credit information (loan, delinquency information, etc.) of individual business owners to generate voluminous training data to establish a bankruptcy prediction model through a partial synthetic training technique. Furthermore, we evaluate the prediction performance of the newly generated data compared to the actual data. When using conditional tabular generative adversarial networks (CTGAN)-based training data generated by the experimental results (a logistic regression task), the recall is improved by 1.75 times compared to that obtained using the actual data. The probability that both the actual and generated data are sampled over an identical distribution is verified to be much higher than 80%. Providing artificial intelligence training data through data synthesis in the fields of credit rating and default risk prediction of individual businesses, which have not been relatively active in research, promotes further in-depth research efforts focused on utilizing such methods.

Keywords

References

  1. J. Y. Kang, S. Y. Jeong, D. W. Hong, and C. H. Seo, "A study on synthetic data generation based safe differentially private GAN," Journal of The Korea Institute of Information Security & Cryptology, vol. 30, no. 5, pp. 945-956, 2020. DOI: 10.13089/JKIISC.2020.30.5.945.
  2. D. S. Hong and C. Baik, "Comparison of resampling methods for generating learning data for predicting the default of individual business," Proceedings of KIIS Spring Conference 2021, vol. 31, no. 1, pp. 61-62, 2021.
  3. J. P. Reiter, "Using CART to generate partially synthetic public use microdata," Journal of Official Statistics, vol. 21, no. 3, pp. 441-462, 2005.
  4. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," In Proceedings of the 27th Neural Information Processing Systems, vol. 2, pp. 2672-2680, 2014.
  5. D. P. Kingma and M. Welling, "Auto-encoding variational bayes," In International Conference on Learning Representations, pp. 1-14, 2013.
  6. T. Chen and C. Guestrin, "Xgboost: Ascalable tree boosting system," In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016. DOI: 10.1145/2939672.2939785.
  7. L. Xu and K. Veeramachaneni, "Synthesizing tabular data using generative adversarial networks," arXiv:1811.11264, 2018.
  8. SDV (Synthetic Data Vault) CopulaGAN Model [Internet], Available: https://sdv.dev/SDV/user_guides/single_table/copulagan.html.
  9. L. Xu, Modeling tabular data using conditional GAN. Massachusetts Institute of Technology [Online], 2017, Available: https://dai.lids.mit.edu/wp-content/uploads/2020/02/Lei_SMThesis_neo.pdf.
  10. L. Zhou, "Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods," Knowledge-Based Systems, vol. 41, pp. 16-25, 2013. DOI: 10.1016/j.knosys.2012.12.007.
  11. E. I. Altman, "Financial ratios, discriminant analysis and the predication of corporate bankruptcy," Journal of Finance, vol. 23, no. 4, pp. 589-609, 1968. DOI: 10.1111/j.1540-6261.1968.tb00843.x.
  12. Korea Credit Information Services (KCIS), [Internet], Available: http://www.kcredit.or.kr/eng/index.do.
  13. D. S. Hong, H. J. Baeck, and H. J. Shin, "The credit information feature selection method in default rate prediction model for individual businesses," Journal of The Korea Society for Simulation, vol. 30, no. 1, pp. 75-85, 2021. DOI: 10.9709/JKSS.2021.30.1.075.