클래스 불균형 문제에서 베이지안 알고리즘의 학습 행위 분석

Learning Behavior Analysis of Bayesian Algorithm Under Class Imbalance Problems

  • 황두성 (단국대학교 컴퓨터과학과)
  • 발행 : 2008.11.25

초록

본 논문에서는 베이지안 알고리즘이 불균형 데이터의 학습 시 나타나는 현상을 분석하고 성능 평가 방법을 비교하였다. 사전 데이터 분포를 가정하고 불균형 데이터 비율과 분류 복잡도에 따라 발생된 분류 문제에 대해 베이지안 학습을 수행하였다. 실험 결과는 ROC(Receiver Operator Characteristic)와 PR(Precision-Recall) 평가 방법의 AUC(Area Under the Curve)를 계사하여 불균형 데이터 비율과 분류 복잡도에 따라 분석되었다. 비교 분석에서 불균형 비율은 기 수행된 연구 결과와 같이 베이지안 학습에 영향을 주었으며, 높은 분류 복잡도로부터 나타나는 데이터 중복은 학습 성능을 방해하는 요인으로 확인되었다. PR 평가의 AUC는 높은 분류 복잡도와 높은 불균형 데이터 비율에서 ROC 평가의 AUC보다 학습 성능의 차이가 크게 나타났다. 그러나 낮은 분류 복잡도와 낮은 불균형 데이터 비율의 문제에서 두 측정 방법의 학습 성능의 차이는 미비하거나 비슷하였다. 이러한 결과로부터 PR 평가의 AUC는 클래스 불균형 문제의 학습 모델의 설계와 오분류 비용을 고려한 최적의 학습기를 결정하는데 도움을 줄 수 있다.

In this paper we analyse the effects of Bayesian algorithm in teaming class imbalance problems and compare the performance evaluation methods. The teaming performance of the Bayesian algorithm is evaluated over the class imbalance problems generated by priori data distribution, imbalance data rate and discrimination complexity. The experimental results are calculated by the AUC(Area Under the Curve) values of both ROC(Receiver Operator Characteristic) and PR(Precision-Recall) evaluation measures and compared according to imbalance data rate and discrimination complexity. In comparison and analysis, the Bayesian algorithm suffers from the imbalance rate, as the same result in the reported researches, and the data overlapping caused by discrimination complexity is the another factor that hampers the learning performance. As the discrimination complexity and class imbalance rate of the problems increase, the learning performance of the AUC of a PR measure is much more variant than that of the AUC of a ROC measure. But the performances of both measures are similar with the low discrimination complexity and class imbalance rate of the problems. The experimental results show 4hat the AUC of a PR measure is more proper in evaluating the learning of class imbalance problem and furthermore gets the benefit in designing the optimal learning model considering a misclassification cost.

키워드

참고문헌

  1. Japkowicz N. and Stephen S., "The Class Imbalance Problem: A Systematic Study," Intelligent Data Analysis, Vol. 6, no. 5, pp. 429-450, November 2002
  2. Ronaldo C. Prati, Gustavo E. A. P. A. Batista and Maria Carolina Monard, "Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior," MICAI, pp, 312-321, 2004
  3. Jie Gu, Yuanbing Zhou and Xianqiang Zuo, "Making Class Bias Useful: A Strategy of Learning from Imbalanced Data," Intelligent Data Engineering and Automated Learninghttp://kamje.kisti.re.kr/new/confirm/confirm_0601.jsp?art_seq=DHJJMM_2008_v45n6_179&vn=%EC%A0%9C45%EA%B6%8C6%ED%98%B8&art_pg=179-186&ref_cnt=16(IDEAL), pp.287-295, 2007
  4. Maciej A. Mazurowski, Piotr A. Habas, Jacek M. Zurada, Joseph Y. Lo, Jay A. Baker and Georgia D. Tourassi, "Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance," Neural Networks, Vol. 21, no. 2-3, pp.427-436, 2008 https://doi.org/10.1016/j.neunet.2007.12.031
  5. Yuchun Tang, Sven Krasser, Paul Judge and Yan-Qing Zhang, "Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data," Collaborative Computing: Networking, Applications and Worksharing, pp.1-6, 2006
  6. Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen and Wei-Ying Ma, "Support Vector Machines Classification with A Very Large-scale Taxonomy," SIGKDD Explorations, Vol. 7, no. 1, 2005
  7. Gary M. Weiss and Foster J. Provost, "Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction." J. Artif. Intell. Res.(JAIR), Vol. 19, pp. 315-354, 2003 https://doi.org/10.1613/jair.1199
  8. Vicente Garca and Ram-n Alberto Mollineda, "An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets," CIARP, pp. 397-406, 2007
  9. Gustavo E. A. P. A. Batista, Ronaldo C. Prati and Maria Carolina Monard, "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data," SIGKDD Explorations, Vol. 6, 2004
  10. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
  11. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition, Elsevier, 2005
  12. C. Ferri, P. Flach and J. Hernndez-Orallo, "Learning Decision Trees Using the Area Under ROC Curve," Proceedings of the 19th International Conference on Machine Learning(ICML-2002), pp. 139-146, 2002
  13. Jin Huang and Charles X. Ling, "Using AUC and Accuracy in Evaluating Learning Algorithms," IEEE Trans. Knowl. Data Eng., Vol. 17, no. 3, pp. 299-310, 2005 https://doi.org/10.1109/TKDE.2005.50
  14. Jesse Davis and Mark Goadrich, "The relationship between Precision-Recall and ROC curves," Proceedings of the 23th International Conference on Machine Learning(ICML-2006), pp. 233-240, 2006
  15. Visa, S. and Ralescue, A., "The effect of imbalanced data class distribution on fuzzy classifiers-experimental study," Proceedings of the FUZZ-IEEE Conference, 2005
  16. Dimitriadou E, Hornik K, Leisch F, Meyer D and Weingessel A, "e1071: Misc Functions of the Department of Statistics(e1071)", Version 1.5-11, TU Wien, 2007