Browse > Article
http://dx.doi.org/10.3745/KIPSTD.2006.13D.3.369

Accelerating the EM Algorithm through Selective Sampling for Naive Bayes Text Classifier  

Chang Jae-Young (한성대학교 컴퓨터공학과)
Kim Han-Joon (서울시립대학교 전자전기컴퓨터공학부)
Abstract
This paper presents a new method of significantly improving conventional Bayesian statistical text classifier by incorporating accelerated EM(Expectation Maximization) algorithm. EM algorithm experiences a slow convergence and performance degrade in its iterative process, especially when real online-textual documents do not follow EM's assumptions. In this study, we propose a new accelerated EM algorithm with uncertainty-based selective sampling, which is simple yet has a fast convergence speed and allow to estimate a more accurate classification model on Naive Bayesian text classifier. Experiments using the popular Reuters-21578 document collection showed that the proposed algorithm effectively improves classification accuracy.
Keywords
Text Classification; Machine Learning; EM Algorithm; Naive Bayes; Uncertainty; Selective Sampling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Y. Yang, and J.O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the 14th International Conference of Machine Learning (ICML '97), pp.412-420, 1997
2 K.M. Schneider, 'Techniques for Improving the Performance of Naive Bayes for Text Classification', Proceedings of the 6th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), pp.682-693, 2005
3 M. Sahami, S. Yusufali, and M.Q. Baldonado, 'SONIA: A Service for Organizing Networked Information Autonomously,' Proceedings of ACM Conference on Digital Library (ADL '98), pp.200-209, 1998
4 D. Johnson and S. Sinanovic, 'Symmetrizing the Kullback-Leibler Distance', submitted to IEEE Transactions on Information Theory, 2001
5 J. Rennie, L. Shih, J. Teevan and D. Karger, 'Tackling the poor assumptions of Naive Bayes text classifiers', Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp.616-623, 2003
6 L. Ralaivola and F. d'Alche-Buc, 'Incremental Support Vector Machine Learning: A Local Approach', Lecture Notes in Computer Science, Vol.2130, pp.322-328, 2001
7 K. Nigam. Using Unlabeled Data to Improve Text Classification. Ph.D thesis, Carnegie Mellon University, 2001
8 D. D Lewis, 'Reuters-21578 text categorization test collection,' http://www.daviddlewis.com/resources/testcollections/reuters21578/, 1997
9 M. Lindenbaum, S. Markovitch, and D. Rusakov, 'Selective Sampling for Nearest Neighbor Classifiers,' Proceedings of the 16th National Conference on Artificial Intelligence (AAAI '99), pp.366-371, 1999
10 T. M. Mitchell, 'Bayesian Learning,' Machine Learning, McGraw-Hill, New York, pp.154-200, 1997
11 T. Joachims, 'Text categorization with support vector machines: learning with many relevant features,' Proceedings of the 10th European Conference on Machine Learning (ECML '98), pp.137-142, 1998
12 T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991
13 E. Han, G. Karypis, and V. Kumar, 'Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,' Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '91), pp.53-65, 1991
14 A. P. Dempster, N. Laird, and D.B. Rubin, 'Maximum Likelihood from Incomplete Data via the EM Algorithm,' Journal of the Royal Statistical Society, Vol.B39, pp.1-38, 1977
15 R. Aggrawal, R.J. Bayardo, and R Srikant, 'Athena: Mining-based Interactive Management of Text Databases,' Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), pp.365-379, 2000
16 V. Castelli, and T.M. Cover, 'On the Exponential Value of Labeled Samples,' Pattern Recognition Letters, Vol.16, No.1, pp.105-111, 1995   DOI   ScienceOn