[KSCI] Korea Science Citation Index Service

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems

김제욱 (대우정보시스템 기술연구소)
김한준 (서울대학교 공과대학 컴퓨터공학부)
이상구 (서울대학교 공과대학 컴퓨터공학부)

Publication Information

Journal of KIISE:Software and Applications / v.29, no.12, 2002 , pp. 966-978 More about this Journal

Abstract

There are two important problems in improving text classification systems based on machine learning approach. The first one, called "selection problem", is how to select a minimum number of informative documents from a given document collection. The second one, called "composition problem", is how to reorganize selected training documents so that they can fit an adopted learning method. The former problem is addressed in "active learning" algorithms, and the latter is discussed in "boosting" algorithms. This paper proposes a new learning method, called AdaBUS, which proactively solves the above problems in the context of Naive Bayes classification systems. The proposed method constructs more accurate classification hypothesis by increasing the valiance in "weak" hypotheses that determine the final classification hypothesis. Consequently, the proposed algorithm yields perturbation effect makes the boosting algorithm work properly. Through the empirical experiment using the Routers-21578 document collection, we show that the AdaBUS algorithm more significantly improves the Naive Bayes-based classification system than other conventional learning methodson system than other conventional learning methods

Keywords

composing train document set; Naive Bayes text classifier; boosting algorithm; uncertainty-based sampling algorithm; AdaBUS algorithm;

Citations & Related Records

Times Cited By KSCI : 3 (Citation Analysis)

Reference
Cited By KSCI

1	Tom M. Mitchell. Machine Learning. McGraw-Hill International Editions, chapter 6, 1997
2	R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text Databases. In Proceedings of the 7th International Conference on Extending Database Technology, pages 365-379, 2000
3	Pedro Domingos and Michael Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pages 105-112, 1996
4	김제욱, 김한준, 이상구, Naive Bayes 문서 분류기를 위한 점진적 학습 모델 연구, 정보기술과 데이타베이스 저널, 8(1), pages 95-104, 2001 과학기술학회마을
5	Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the 13th International Conference on Machine earning, pages 148-156, 1996
6	David D. Lewis and Jason Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the 11th international Conference on Machine Learning, pages 148-156, 1994
7	M. Trensh, N. Palmer, and A. Luniewski. Type Classification of Semi-structured Documents. In Proceedings of the 21st ACM SIGMOD International Conference on Management of Data, 1995
8	Yoav Freund and Robert E. Schapire, A Decisiontheoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), pages 119-139, 1997 DOI ScienceOn
9	David D. Lewis and William A. Gale. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994
10	Robert E. Schapire and Yoram Singer. Boos Texter: A Boosting-based System for Text Categorization. Machine Learning, 39(2), pages 135-168, 2000 DOI
11	J. R. QuinJan. Bagging, Boosting, and c4.5. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 725-730. 1996
12	Leo Breiman. Arcing Classifiers. The Annals of Statistics, 26(3), pages 801-849, 1998 DOI
13	Robert E. Schapire. The Strength of Weak Learnability, Machine Learning, 5(2), pages 197-227, 1990 DOI
14	Zijian Zheng. Naive Bayesian Classifier Committees. In Proceedings of European Conference on Machine Learning, pages 196-207, 1998 DOI ScienceOn
15	Robert E. Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-orated Predictions. Machine Learning, 37(3), pages 297-336, 1999 DOI
16	Ron Kohavi, David H. Wolpert. Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the 13th International Conference on Machine Learning, pages 275-283, 1996
17	Kai Ming Ting and Zijian Zheng. Improving the Performance of Boosting for Naive Bayesian Classification. In Proceedings of the 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1999
18	Yiming Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 42-420, 1997
19	Yiming Yang. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1), pages 67-88, 1999 DOI

1	Reinforcement Method for Automated Text Classification using Post-processing and Training with Definition Criteria / [Choi, Yun-Jeong;Park, Seung-Soo;] / The KIPS Transactions:PartB
2	A performance improvement methodology of web document clustering using FDC-TCT / [Ko, Suc-Bum;Youn, Sung-Dae;] / The KIPS Transactions:PartD
3	Design and Implementation of Text Classification System based on ETOM+RPost / [Choi, Yun-Jeong;] / Journal of the Korea Academia-Industrial cooperation Society

KSCI

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems 베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems