DOI QR코드

DOI QR Code

GIR-based canonical forest: An ensemble method for imbalanced big data

불균형 데이터의 분류 성능 향상을 위한 일반화된 불균형 비율(GIR) 기반의 과소 표집 canonical forest (GC-Forest)

  • Solji Han (Department of Statistics and Data Science, Yonsei University) ;
  • Jaesung Myung (AI advanced technology, SK hynix) ;
  • Hyunjoong Kim (Department of Statistics and Data Science, Yonsei University)
  • 한솔지 (연세대학교 통계데이터사이언스학과) ;
  • 명재성 (SK하이닉스) ;
  • 김현중 (연세대학교 통계데이터사이언스학과)
  • Received : 2024.08.01
  • Accepted : 2024.08.25
  • Published : 2024.10.31

Abstract

In the field of big data mining, the challenge of imbalanced classification problem has been actively researched for decades. While imbalanced data issues manifest in various forms, past research mainly focused on addressing sample size imbalance between classes. However, recent studies have revealed that rather than the imbalance in sample size alone, the degradation of classification performance significantly worsens when the class overlap is combined. In response, this study introduces GC-Forest (GIR-based canonical forest), an effective ensemble classification method that utilizes weighted resampling technique considering the degrees of overlap between classes. This method measures the imbalance ratio in terms of class overlap at each stage of ensemble and balances the classes by increasing the representativeness of the minority class. Additionally, to improve overall classification performance, the GC-Forest method adopts the canonical forest method as an ensemble classifier, which is designed to enhance both the performance and diversity of individual classifiers. The performance of the proposed method was compared and verified through experiments using 14 different types of real imbalanced data. GC-Forest showed very competitive classification performance in terms of AUC, PR-AUC, G-mean, and F1-score compared to 7 other ensemble methods.

빅데이터 마이닝 분야에서 불균형 분류 문제의 도전 과제는 수십 년 동안 활발히 연구되어 왔다. 불균형 데이터 문제는 그 양상과 형태가 매우 다양한데, 과거 연구는 주로 클래스 간 데이터 크기 불균형 해결에 초점을 두었다. 그러나 최근 연구에 따르면 데이터 수의 불균형만이 아니라, 클래스 간 중첩이 결합된 경우에 분류 성능의 저하가 더 심각해진다는 것이 밝혀졌다. 이에 따라 본 연구에서는 클래스 간 중첩 정도를 고려한 가중치 재샘플링 기법을 활용하는 효율적인 앙상블 분류 방법인 GC-Forest (GIR-based canonical forest)를 제안한다. 이 방법은 앙상블의 각 단계에서 데이터 개수의 불균형이 아닌 클래스 중첩 면에서 불균형 비율을 측정하고 소수 클래스의 대표성을 증가시킴으로써 클래스를 균형있게 맞춘다. 또한, 전체 분류 성능을 향상시키기 위해 GC-Forest 방법은 개별 분류기의 성능과 다양성을 모두 향상시키는 것으로 설계된 canonical forest 방법을 앙상블 분류기로 채택한다. 제안된 방법의 성능은 14개의 다양한 실제 불균형 데이터를 사용한 실험을 통해 비교 및 검증되었다. GC-Forest는 AUC, PR-AUC, G-mean, F1-score 측면에서 7개의 다른 앙상블 방법과 비교하여 매우 경쟁력 있는 분류 성능을 보여주었다.

Keywords

Acknowledgement

김현중의 연구는 과학기술정보통신부 및 정보통신기획평가원의 학석사연계ICT핵심인재양성사업 (IITP-2023-00259934)과 한국연구재단(NRF) 연구비 (No. 2016R1D1A1B02011696)의 연구결과로 수행되었음.

References

  1. Alcal-Fdez J, Fernndez A, Luengo J, Derrac J, Garca S, Snchez L, and Herrera F (2011). Keel datamining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing, 17, 255-287. 
  2. Alfaro E, Gamez M, and Garcia N (2013). Adabag: An r package for classification with boosting and bagging, Journal of Statistical Software, 54, 1-35. 
  3. Anand R, Mehrotra K, Mohan C, and Ranka S (1993). An improved algorithm for neural network classification of imbalanced training sets, IEEE Transactions on Neural Networks, 4, 962-969. 
  4. Boyd K, Eng KH, and Page CD (2013). Area under the precision-recall curve: Point estimates and confidence intervals. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 (pp. 451-466), Springer, Berlin. 
  5. Buda M, Maki A, and Mazurowski MA (2018). A systematic study of the class imbalance problem in convolutional neural networks, Neural Networks, 106, 249-259. 
  6. Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP (2002). Smote: Synthetic minority oversampling technique, Journal of Artificial Intelligence Research, 16, 321-357. 
  7. Chen YC, Ha H, Kim H, and Ahn H (2013). Canonical forest, Computational Statistics, 29, 849-867. 
  8. Cheng F, Zhang J, Wen C, Liu Z, and Li Z (2017). Large cost-sensitive margin distribution machine for imbalanced data classification, Neurocomputing, 224, 45-57. 
  9. Fan W, Stolfo S, Zhang J, and Chan P (1999). Adacost: Misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML'99), San Francisco, CA, USA, 97-105. 
  10. Fernandez A, Garcia S, Herrera F, and Chawla NV (2018). Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research, 61, 863-905. 
  11. Garcia V, Sanchez J, and Mollineda R (2007). An empirical study of the behavior of classifiers on imbalanced ' and overlapped data sets. In Rueda L, Mery D, and Kittler J (Eds), Progress in Pattern Recognition, Image Analysis and Applications (pp. 397-406), Springer Berlin Heidelberg, Berlin, Heidelberg. 
  12. Gong J and Kim H (2017). Rhsboost: Improving classification performance in imbalance data, Computational Statistics and Data Analysis, 111, 1-13. 
  13. He H, Bai Y, Garcia EA, and Li S (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1322-1328. 
  14. Huang J and Ling CX (2005). Using auc and accuracy in evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering, 17, 299-310. 
  15. Japkowicz N (2003). Class imbalances: Are we focusing on the right issue, Workshop on Learning from Imbalanced Data Sets II, 1723, 63. 
  16. Jo T and Japkowicz N (2004). Class imbalances versus small disjuncts, SIGKDD Explorations Newsletter, 6, 40-49. 
  17. Liaw A and Wiener M (2002). Classification and regression by randomforest, R news, 2, 18-22. 
  18. Lichman M (2013). UCI machine learning repository, Available from: http://archive.ics.uci.edu/ml 
  19. Lunardon N, Menardi G, and Torelli N (2014). Rose: A package for binary imbalanced learning, R Journal, 6, 79-89. 
  20. Lopez V, Fernandez A, Garcia S, Palade V, and Herrera F (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences, 250, 113-141. 
  21. Manning CD (2008). Introduction to Information Retrieval, Cambridge University Press, Cambridge, England. 
  22. Menardi G and Torelli N (2014). Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery, 28, 92-122. 
  23. R Core Team (2022). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. 
  24. Rayhan F, Ahmed S, Mahbub A, Jani R, Shatabda S, and Farid DM (2017). Cusboost: Cluster-based under-sampling with boosting for imbalanced classification. In Proceedings of 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Bengaluru, 1-5.
  25. Ridgeway G and GBM Developers (2024). gbm: Generalized Boosted Regression Models. R package version 2.1.9, Available from: ¡https://CRAN.R-project.org/package=gbm¿ 
  26. Seiffert C, Khoshgoftaar TM, Van Hulse J, and Napolitano A (2010). Rusboost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 40, 185-197. 
  27. Sun Y, Wong AKC, and Kamel MS (2009). Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence, 23, 687-719. 
  28. Tang B and He H (2017). Gir-based ensemble sampling approaches for imbalanced learning, Pattern Recognition, 71, 306-319. 
  29. Therneau T, Atkinson B, and Ripley B (2015). rpart: Recursive partitioning and regression trees, r package version 4.1-15, Retrieved, 13:2015, Available from: https://cran.r-project.org/web/packages/rpart/rpart.pdf 
  30. Vuttipittayamongkol P, Elyan E, and Petrovski A (2021). On the class overlap problem in imbalanced data classification, Knowledge-Based Systems, 212, 106631.