Browse > Article

Text Filtering using Iterative Boosting Algorithms  

Hahn, Sang-Youn (University of Washingten 전산학)
Zang, Byoung-Tak (서울대학교 컴퓨터공학부)
Abstract
Text filtering is a task of deciding whether a document has relevance to a specified topic. As Internet and Web becomes wide-spread and the number of documents delivered by e-mail explosively grows the importance of text filtering increases as well. The aim of this paper is to improve the accuracy of text filtering systems by using machine learning techniques. We apply AdaBoost algorithms to the filtering task. An AdaBoost algorithm generates and combines a series of simple hypotheses. Each of the hypotheses decides the relevance of a document to a topic on the basis of whether or not the document includes a certain word. We begin with an existing AdaBoost algorithm which uses weak hypotheses with their output of 1 or -1. Then we extend the algorithm to use weak hypotheses with real-valued outputs which was proposed recently to improve error reduction rates and final filtering performance. Next, we attempt to achieve further improvement in the AdaBoost's performance by first setting weights randomly according to the continuous Poisson distribution, executing AdaBoost, repeating these steps several times, and then combining all the hypotheses learned. This has the effect of mitigating the ovefitting problem which may occur when learning from a small number of data. Experiments have been performed on the real document collections used in TREC-8, a well-established text retrieval contest. This dataset includes Financial Times articles from 1992 to 1994. The experimental results show that AdaBoost with real-valued hypotheses outperforms AdaBoost with binary-valued hypotheses, and that AdaBoost iterated with random weights further improves filtering accuracy. Comparison results of all the participants of the TREC-8 filtering task are also provided.
Keywords
TREC; Text Filtering; AdaBoost; Iterative Boosting Algorithms; TREC;
Citations & Related Records
연도 인용수 순위
  • Reference
1 G. Salton, M. J. McGill, 'Introduction to modern information retrieval,' McGraw-Hill, 1983
2 R. E. Schapire, Y. Freund, P. Bartlett, W. S. Lee, 'Boosting the margin:A new explanation for the effectiveness of voting methods,' The Annals of Statistics, Vol. 26, No. 5, pp.1651-1686, 1998   DOI
3 R. E. Schapire, Y. Singer, and A. Singhal, 'Boosting and Rocchio applied to text filtering,' Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Query and Profile Modification, pp. 215-223, 1998
4 R. E. Schapire, Y. Singer, 'Improved boosting algorithms using confidence-rated Predictions,' Machine Learning, Vol. 37, No. 3, pp. 297-336, 1999   DOI
5 L. Breiman, 'Bias, variance and arcing classifiers,' Technical Report 460, Berkeley, CA: University of California: Department of statistics, 1996
6 R. E. Schapire, 'The stength of weak learnability,' Machine Learning, Vol. 5, No. 2, pp. 197-227, 1990   DOI
7 G. I. Webb, 'MultiBoosting: A technique for combining Boosting and Wagging,' Machine Learning, to appear, 2000   DOI
8 D. Hull, 'The TREC-7 filtering track: Description and analysis,' Proceedings of the 7th Text Retrieval Conference (TREC-7), pp. 33-56, 1998
9 E. Bauer and R. Kohavi, 'An empirical comparison of voting classification algorithms:bagging, boosting, and variants,' Machine Learning, Vol. 36, No. 1, pp. 105-139, 1999   DOI
10 L. Breiman, 'Bagging predictors,' Machine Learning, Vol. 24, No. 2, pp. 123-140, 1996   DOI
11 S. Haykin, Neural Network, Prentice-Hall, 1999
12 Y. Freund, R. E. Schapire, 'Experiments with a new boosting algorithm,' Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148-156, 1996