DOI QR코드

DOI QR Code

Accelerating the EM Algorithm through Selective Sampling for Naive Bayes Text Classifier

나이브베이즈 문서분류시스템을 위한 선택적샘플링 기반 EM 가속 알고리즘

  • 장재영 (한성대학교 컴퓨터공학과) ;
  • 김한준 (서울시립대학교 전자전기컴퓨터공학부)
  • Published : 2006.06.01

Abstract

This paper presents a new method of significantly improving conventional Bayesian statistical text classifier by incorporating accelerated EM(Expectation Maximization) algorithm. EM algorithm experiences a slow convergence and performance degrade in its iterative process, especially when real online-textual documents do not follow EM's assumptions. In this study, we propose a new accelerated EM algorithm with uncertainty-based selective sampling, which is simple yet has a fast convergence speed and allow to estimate a more accurate classification model on Naive Bayesian text classifier. Experiments using the popular Reuters-21578 document collection showed that the proposed algorithm effectively improves classification accuracy.

본 논문은 온라인 전자문서환경에서 전통적 베이지안 통계기반 문서분류시스템의 분류성능을 개선하기 위해 EM(Expectation Maximization) 가속 알고리즘을 접목한 방법을 제안한다. 기계학습 기반의 문서분류시스템의 중요한 문제 중의 하나는 양질의 학습문서를 확보하는 것이다. EM 알고리즘은 소량의 학습문서집합으로 베이지안 문서분류 알고리즘의 성능을 높이는데 활용된다. 그러나 EM 알고리즘은 최적화 과정에서 느린 수렴성과 성능 저하 현상을 나타내는데, EM 알고리즘의 기본 가정을 따르지 않는 온라인 전자문서환경에서 특히 그러하다. 제안 기법의 주요 아이디어는 전통적 EM 알고리즘을 개선하기 위해 불확정성도 기반 선택적 샘플링 기법을 활용한 것이다. 성능평가를 위해 Reuter-21578 문서집합을 사용하여, 제안 알고리즘의 빠른 수렴성을 보이고 전통적 베이지안 알고리즘의 분류 정확성을 향상시켰음을 보인다.

Keywords

References

  1. T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991
  2. R. Aggrawal, R.J. Bayardo, and R Srikant, 'Athena: Mining-based Interactive Management of Text Databases,' Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), pp.365-379, 2000
  3. V. Castelli, and T.M. Cover, 'On the Exponential Value of Labeled Samples,' Pattern Recognition Letters, Vol.16, No.1, pp.105-111, 1995 https://doi.org/10.1016/0167-8655(94)00074-D
  4. A. P. Dempster, N. Laird, and D.B. Rubin, 'Maximum Likelihood from Incomplete Data via the EM Algorithm,' Journal of the Royal Statistical Society, Vol.B39, pp.1-38, 1977
  5. E. Han, G. Karypis, and V. Kumar, 'Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,' Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '91), pp.53-65, 1991
  6. T. Joachims, 'Text categorization with support vector machines: learning with many relevant features,' Proceedings of the 10th European Conference on Machine Learning (ECML '98), pp.137-142, 1998
  7. D. Johnson and S. Sinanovic, 'Symmetrizing the Kullback-Leibler Distance', submitted to IEEE Transactions on Information Theory, 2001
  8. M. Lindenbaum, S. Markovitch, and D. Rusakov, 'Selective Sampling for Nearest Neighbor Classifiers,' Proceedings of the 16th National Conference on Artificial Intelligence (AAAI '99), pp.366-371, 1999
  9. T. M. Mitchell, 'Bayesian Learning,' Machine Learning, McGraw-Hill, New York, pp.154-200, 1997
  10. K. Nigam. Using Unlabeled Data to Improve Text Classification. Ph.D thesis, Carnegie Mellon University, 2001
  11. D. D Lewis, 'Reuters-21578 text categorization test collection,' http://www.daviddlewis.com/resources/testcollections/reuters21578/, 1997
  12. L. Ralaivola and F. d'Alche-Buc, 'Incremental Support Vector Machine Learning: A Local Approach', Lecture Notes in Computer Science, Vol.2130, pp.322-328, 2001
  13. J. Rennie, L. Shih, J. Teevan and D. Karger, 'Tackling the poor assumptions of Naive Bayes text classifiers', Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp.616-623, 2003
  14. M. Sahami, S. Yusufali, and M.Q. Baldonado, 'SONIA: A Service for Organizing Networked Information Autonomously,' Proceedings of ACM Conference on Digital Library (ADL '98), pp.200-209, 1998
  15. K.M. Schneider, 'Techniques for Improving the Performance of Naive Bayes for Text Classification', Proceedings of the 6th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), pp.682-693, 2005
  16. Y. Yang, and J.O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the 14th International Conference of Machine Learning (ICML '97), pp.412-420, 1997