[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KIPSTD.2006.13D.3.369

Accelerating the EM Algorithm through Selective Sampling for Naive Bayes Text Classifier

Chang Jae-Young (한성대학교 컴퓨터공학과)
Kim Han-Joon (서울시립대학교 전자전기컴퓨터공학부)

Publication Information

The KIPS Transactions:PartD / v.13D, no.3, 2006 , pp. 369-376 More about this Journal

Abstract

This paper presents a new method of significantly improving conventional Bayesian statistical text classifier by incorporating accelerated EM(Expectation Maximization) algorithm. EM algorithm experiences a slow convergence and performance degrade in its iterative process, especially when real online-textual documents do not follow EM's assumptions. In this study, we propose a new accelerated EM algorithm with uncertainty-based selective sampling, which is simple yet has a fast convergence speed and allow to estimate a more accurate classification model on Naive Bayesian text classifier. Experiments using the popular Reuters-21578 document collection showed that the proposed algorithm effectively improves classification accuracy.

Keywords

Text Classification; Machine Learning; EM Algorithm; Naive Bayes; Uncertainty; Selective Sampling;

Citations & Related Records

Reference

1	Y. Yang, and J.O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the 14th International Conference of Machine Learning (ICML '97), pp.412-420, 1997
2	K.M. Schneider, 'Techniques for Improving the Performance of Naive Bayes for Text Classification', Proceedings of the 6th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), pp.682-693, 2005
3	M. Sahami, S. Yusufali, and M.Q. Baldonado, 'SONIA: A Service for Organizing Networked Information Autonomously,' Proceedings of ACM Conference on Digital Library (ADL '98), pp.200-209, 1998
4	D. Johnson and S. Sinanovic, 'Symmetrizing the Kullback-Leibler Distance', submitted to IEEE Transactions on Information Theory, 2001
5	J. Rennie, L. Shih, J. Teevan and D. Karger, 'Tackling the poor assumptions of Naive Bayes text classifiers', Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp.616-623, 2003
6	L. Ralaivola and F. d'Alche-Buc, 'Incremental Support Vector Machine Learning: A Local Approach', Lecture Notes in Computer Science, Vol.2130, pp.322-328, 2001
7	K. Nigam. Using Unlabeled Data to Improve Text Classification. Ph.D thesis, Carnegie Mellon University, 2001
8	D. D Lewis, 'Reuters-21578 text categorization test collection,' http://www.daviddlewis.com/resources/testcollections/reuters21578/, 1997
9	M. Lindenbaum, S. Markovitch, and D. Rusakov, 'Selective Sampling for Nearest Neighbor Classifiers,' Proceedings of the 16th National Conference on Artificial Intelligence (AAAI '99), pp.366-371, 1999
10	T. M. Mitchell, 'Bayesian Learning,' Machine Learning, McGraw-Hill, New York, pp.154-200, 1997
11	T. Joachims, 'Text categorization with support vector machines: learning with many relevant features,' Proceedings of the 10th European Conference on Machine Learning (ECML '98), pp.137-142, 1998
12	T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991
13	E. Han, G. Karypis, and V. Kumar, 'Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,' Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '91), pp.53-65, 1991
14	A. P. Dempster, N. Laird, and D.B. Rubin, 'Maximum Likelihood from Incomplete Data via the EM Algorithm,' Journal of the Royal Statistical Society, Vol.B39, pp.1-38, 1977
15	R. Aggrawal, R.J. Bayardo, and R Srikant, 'Athena: Mining-based Interactive Management of Text Databases,' Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), pp.365-379, 2000
16	V. Castelli, and T.M. Cover, 'On the Exponential Value of Labeled Samples,' Pattern Recognition Letters, Vol.16, No.1, pp.105-111, 1995 DOI ScienceOn

KSCI

Accelerating the EM Algorithm through Selective Sampling for Naive Bayes Text Classifier 나이브베이즈 문서분류시스템을 위한 선택적샘플링 기반 EM 가속 알고리즘

Accelerating the EM Algorithm through Selective Sampling for Naive Bayes Text Classifier