Browse > Article

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems  

김제욱 (대우정보시스템 기술연구소)
김한준 (서울대학교 공과대학 컴퓨터공학부)
이상구 (서울대학교 공과대학 컴퓨터공학부)
Abstract
There are two important problems in improving text classification systems based on machine learning approach. The first one, called "selection problem", is how to select a minimum number of informative documents from a given document collection. The second one, called "composition problem", is how to reorganize selected training documents so that they can fit an adopted learning method. The former problem is addressed in "active learning" algorithms, and the latter is discussed in "boosting" algorithms. This paper proposes a new learning method, called AdaBUS, which proactively solves the above problems in the context of Naive Bayes classification systems. The proposed method constructs more accurate classification hypothesis by increasing the valiance in "weak" hypotheses that determine the final classification hypothesis. Consequently, the proposed algorithm yields perturbation effect makes the boosting algorithm work properly. Through the empirical experiment using the Routers-21578 document collection, we show that the AdaBUS algorithm more significantly improves the Naive Bayes-based classification system than other conventional learning methodson system than other conventional learning methods
Keywords
composing train document set; Naive Bayes text classifier; boosting algorithm; uncertainty-based sampling algorithm; AdaBUS algorithm;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Tom M. Mitchell. Machine Learning. McGraw-Hill International Editions, chapter 6, 1997
2 R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text Databases. In Proceedings of the 7th International Conference on Extending Database Technology, pages 365-379, 2000
3 Pedro Domingos and Michael Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pages 105-112, 1996
4 김제욱, 김한준, 이상구, Naive Bayes 문서 분류기를 위한 점진적 학습 모델 연구, 정보기술과 데이타베이스 저널, 8(1), pages 95-104, 2001   과학기술학회마을
5 Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the 13th International Conference on Machine earning, pages 148-156, 1996
6 David D. Lewis and Jason Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the 11th international Conference on Machine Learning, pages 148-156, 1994
7 M. Trensh, N. Palmer, and A. Luniewski. Type Classification of Semi-structured Documents. In Proceedings of the 21st ACM SIGMOD International Conference on Management of Data, 1995
8 Yoav Freund and Robert E. Schapire, A Decisiontheoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), pages 119-139, 1997   DOI   ScienceOn
9 David D. Lewis and William A. Gale. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994
10 Robert E. Schapire and Yoram Singer. Boos Texter: A Boosting-based System for Text Categorization. Machine Learning, 39(2), pages 135-168, 2000   DOI
11 J. R. QuinJan. Bagging, Boosting, and c4.5. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 725-730. 1996
12 Leo Breiman. Arcing Classifiers. The Annals of Statistics, 26(3), pages 801-849, 1998   DOI
13 Robert E. Schapire. The Strength of Weak Learnability, Machine Learning, 5(2), pages 197-227, 1990   DOI
14 Zijian Zheng. Naive Bayesian Classifier Committees. In Proceedings of European Conference on Machine Learning, pages 196-207, 1998   DOI   ScienceOn
15 Robert E. Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-orated Predictions. Machine Learning, 37(3), pages 297-336, 1999   DOI
16 Ron Kohavi, David H. Wolpert. Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the 13th International Conference on Machine Learning, pages 275-283, 1996
17 Kai Ming Ting and Zijian Zheng. Improving the Performance of Boosting for Naive Bayesian Classification. In Proceedings of the 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1999
18 Yiming Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 42-420, 1997
19 Yiming Yang. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1), pages 67-88, 1999   DOI