Application of Random Over Sampling Examples(ROSE) for an Effective Bankruptcy Prediction Model

Ahn, Cheolhwi;Ahn, Hyunchul;

doi:10.5392/JKCA.2018.18.08.525

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Volume 18 Issue 8
/
Pages.525-535
/
2018
/
1598-4877(pISSN)
/
2508-6723(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

Application of Random Over Sampling Examples(ROSE) for an Effective Bankruptcy Prediction Model

효과적인 기업부도 예측모형을 위한 ROSE 표본추출기법의 적용

안철휘 (국민대학교 비즈니스IT전문대학원) ;
안현철 (국민대학교 비즈니스IT전문대학원)

Received : 2018.07.09
Accepted : 2018.08.21
Published : 2018.08.28

https://doi.org/10.5392/JKCA.2018.18.08.525 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

If the frequency of a particular class is excessively higher than the frequency of other classes in the classification problem, data imbalance problems occur, which make machine learning distorted. Corporate bankruptcy prediction often suffers from data imbalance problems since the ratio of insolvent companies is generally very low, whereas the ratio of solvent companies is very high. To mitigate these problems, it is required to apply a proper sampling technique. Until now, oversampling techniques which adjust the class distribution of a data set by sampling minor class with replacement have popularly been used. However, they are a risk of overfitting. Under this background, this study proposes ROSE(Random Over Sampling Examples) technique which is proposed by Menardi and Torelli in 2014 for the effective corporate bankruptcy prediction. The ROSE technique creates new learning samples by synthesizing the samples for learning, so it leads to better prediction accuracy of the classifiers while avoiding the risk of overfitting. Specifically, our study proposes to combine the ROSE method with SVM(support vector machine), which is known as the best binary classifier. We applied the proposed method to a real-world bankruptcy prediction case of a Korean major bank, and compared its performance with other sampling techniques. Experimental results showed that ROSE contributed to the improvement of the prediction accuracy of SVM in bankruptcy prediction compared to other techniques, with statistical significance. These results shed a light on the fact that ROSE can be a good alternative for resolving data imbalance problems of the prediction problems in social science area other than bankruptcy prediction.

분류 문제에서 특정 범주의 빈도가 다른 범주에 비해 과도하게 높은 경우, 왜곡된 기계 학습을 유발할 수 있는 데이터 불균형(imbalanced data) 문제가 발생한다. 기업부도 예측 문제도 그 중 하나인데, 일반적으로 금융기관과 거래하는 기업들의 부도율은 대단히 낮아서, 부도 사례보다 정상 사례의 빈도가 월등히 높은 데이터 불균형 문제가 발생하고 있다. 이러한 데이터 불균형 문제를 해결하기 위해서는 적절한 표본추출 기법이 적용될 필요가 있으며, 지금껏 소수 범주 데이터를 복원 추출함으로써 다수 범주 데이터와 비율을 맞추어 데이터 불균형을 해결하는 오버 샘플링(oversampling) 기법이 주로 활용되어 왔다. 그러나 전통적인 오버 샘플링은 과적합화(overfitting)가 발생할 위험이 높아질 수 있는 단점이 있다. 이러한 배경에서 본 연구는 효과적인 기업부도 예측 모형 학습을 위한 표본추출 기법으로 2014년에 Menardi와 Torelli가 제안한 ROSE(random over sampling examples) 기법을 제안한다. ROSE 기법은 학습에 사용될 사례를 반복적으로 새롭게 합성하여 생성(synthetic generation)하는 기법으로, 과적합화 문제를 회피하면서도 분류 예측 정확도 개선에 도움을 줄 수 있다. 이에 본 연구에서는 ROSE 기법을 가장 성능이 우수한 이분류기로 알려진 SVM(support vector machine)과 결합하여 국내 한 대형 은행의 기업부도 예측에 적용해 보고, 다른 표본추출 기법들과의 비교연구를 수행하였다. 실험 결과, ROSE 기법이 다른 기법에 비해 통계적으로 유의한 수준으로 SVM의 예측정확도 개선에 기여할 수 있음을 확인하였다. 이러한 본 연구의 결과는 부도예측 외에 다른 사회과학 분야 예측문제의 데이터 불균형 문제 해결에도 ROSE가 우수한 대안이 될 수 있다는 사실을 시사한다.

Keywords

References

G. Menardi and N. Torelli, "Training and assessing classification rules with imbalanced data," Data Mining and Knowledge Discovery, Vol.28, No.1 pp.92-122, 2014. https://doi.org/10.1007/s10618-012-0295-5
W. H. Beaver, "Financial ratios as predictors of failure, Journal of Accounting Research," Vol.4, pp.71-111, 1966. https://doi.org/10.2307/2490171
E. I. Altman, "Financial ratios discriminant analysis and the prediction of corporate bankruptcy," The journal of finance, Vol.23, No.4, pp.589-609, 1968. https://doi.org/10.1111/j.1540-6261.1968.tb00843.x
J. A. Ohlson, "Financial ratios and the probabilistic prediction of bankruptcy," Journal of accounting research, Vol.18, No.1, pp.109-131, 1980. https://doi.org/10.2307/2490395
M. E. Zmijewski, "Methodological issues related to the estimation of financial distress prediction models," Journal of Accounting Research, Vol.22, pp.59-82, 1984. https://doi.org/10.2307/2490859
R. O. Edmister, "An empirical test of financial ratio analysis for small business failure prediction," Journal of Financial and Quantitative Analysis, Vol.7, No.2, pp.1477-1493, 1972. https://doi.org/10.2307/2329929
M. D. Odom and R. Sharda, "A neural network model for bankruptcy prediction. In Proceedings of the International Joint Conference on Neural networks," Vol.2, pp.163-168, 1990.
K. Y. Tam and M. Y. Kiang, "Managerial applications of neural networks: the case of bank failure predictions," Management Science, Vol.38, No.7, pp.926-947, 1992. https://doi.org/10.1287/mnsc.38.7.926
C. Serrano-Cinca, "Self-organizing neural networks for financial diagnosis," Decision Support Systems, Vol.17, No.3, pp.227-238, 1996. https://doi.org/10.1016/0167-9236(95)00033-X
J. Yang and V. Honavar, "Feature subset selection using a genetic algorithm," IEEE Intelligent Systems and their Applications, Vol.13, No.2, pp.44-49, 1998. https://doi.org/10.1109/5254.671091
김경재, 한인구, "퍼지 신경망을 이용한 기업부도예측," 지능정보연구, 제7권, 제1호, pp.135-146, 2001.
이영찬, "인공신경망과 Support Vector Machine의 기업부도예측 성과 비교," 한국지능정보시스템학회 춘계학술대회논문집, pp.211-218, 2004.
강필성, 조성준, "데이터 불균형 해결을 위한 Under-Sampling 기반 앙상블 SVMs," 대한산업공학회 춘계공동학술대회 논문집, pp.291-298, 2006.
이재동, 이지형, "데이터 불균형 문제 해결을 위한 K-means Clustering 기반 SVM앙상블 기법," 한국정보과학회 한국컴퓨터종합학술대회 논문집, pp.297-799, 2014.
김태훈, 안현철, "A Hybrid Under-sampling Approach for Better Bankruptcy Prediction," 지능정보연구, 제21권, 제2호, pp.173-190, 2015. https://doi.org/10.13088/jiis.2015.21.2.173
N. Japkowicz, "The Class Imbalance Problem:Significance and Strategies," In Proceedings of the International Conference on Artificial Intelligence, pp.111-114, 2000.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, Vol.16, pp.321-357, 2002. https://doi.org/10.1613/jair.953
이재동, 이지형, "데이터 불균형의 효과적인 학습을 위한 딥러닝 기법," 한국지능시스템학회 춘계학술대회 학술발표논문집, 제25권, 제1호, pp.113-114, 2015.
G. E. Batista, R. C. Prati, and M. C. Monard, "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data," ACM SIGKDD Explorations Newsletter, Vol.6, No.1, pp.20-29, 2004. https://doi.org/10.1145/1007730.1007735
M. Kubat and S. Matwin, "Addressing the curse of imbalanced training sets: one-sided selection," Proceedings of the Fourteenth International Conference on Machine Learning, pp.179-186, 1997.
N. Lunardon, G. Menardi, and N. Torelli, ROSE: A Package for Binary Imbalanced Learning, r-project.org, 2014.
B. Efron and R. Tibshirani, An introduction to the bootstrap, Chapman and Hall, 1993.
F. E. J. Tay and L. J. Cao, "Modified support vector machines in financial time series forecasting," Neurocomputing, Vol.48, pp.847-861, 2002. https://doi.org/10.1016/S0925-2312(01)00676-2