Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2004.11B.4.501

Performance Improvement by a Virtual Documents Technique in Text Categorization  

Lee, Kyung-Soon (전북대학교 전자정보공학부)
An, Dong-Un (전북대학교 전자정보공학부)
Abstract
This paper proposes a virtual relevant document technique in the teaming phase for text categorization. The method uses a simple transformation of relevant documents, i.e. making virtual documents by combining document pairs in the training set. The virtual document produced by this method has the enriched term vector space, with greater weights for the terms that co-occur in two relevant documents. The experimental results showed a significant improvement over the baseline, which proves the usefulness of the proposed method: 71% improvement on TREC-11 filtering test collection and 11% improvement on Routers-21578 test set for the topics with less than 100 relevant documents in the micro average F1. The result analysis indicates that the addition of virtual relevant documents contributes to the steady improvement of the performance.
Keywords
Virtual Relevant Document; Discourse Unit; Support Vector Machine; Text Categorization;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Allan, J., Ballesteros, L., Callan, J., Croft, W. and Lu, Z., Recent experiments with INQUERY. In Proc. of the Fourth Text REtrieval Conference (TREC-4), 1996
2 Joachims, T., Making large-scale support vector machine learning practical. In Advances in Kernel Methods : Support Vector Machines (Scholkopf et al., 1999), MIT Press, 1999
3 Kawatani, T., Topic Difference Factor Extraction between Two Document Sets and its Application to Text Categorization. In International ACM-SIGIR Conference on Research and Development in Information Retrieval. 2002   DOI
4 Kwok, K. and Grunfeld, L., TREC-5 English and Chinese retrieval experiments using PIRCS. In the Proc. of the Fifth Text RErieval Conference, 1997
5 Robertson, S. and Soboroff, I., The TREC 2002 Filtering Track Report. In Proc. of the Eleventh Text Retrieval Conference, 2002
6 Rocchio, J., Relevance feedback information retrieval. In Gerard Salton (ed.), The Smart retrieval system experiments in automatic document processing, Prentice Hall, 1971
7 Rose, T. G., Stevenson, M. and Whitehead, M., The Reuters Corpus Volume 1 - from Yesterday's News to Tomorrow's Language Resources. In Proc. of the Third International Conference on Language Resources and Evaluation, 2002, http://about.reuters.com/researchandstandards/corpus
8 DeCoste, D. and Scholkopf, B., Training invariant support vector machines. Machine Learning, 46(1), pp.161-190. 2002   DOI
9 Scholkopf, B. Support Vector Learning. R., Oldenbourg Verlag, Munchen. Doktorarbeit, TU Berlin, http://www.kernel-machines.org, 1997
10 Lewis, D., Reuters-21578 text categorization test collection distribution 1.0., http://www.daviddlewis.com/, 1999
11 Poggio, T. and Vetter, T., Recognition and structure from one 2D model view : observations on prototypes, object classes and symmetries. A. I. Memo No. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992
12 Yang, Y. and Liu, X., A re-examination of text categorization methods. In Proc. of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, 1999   DOI
13 Vapnick, V., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995
14 Scholkopf, B., Burges, C. and Vapnik, V., Extracting support data for a given task. In Proc. of the First International Conference on Knowledge Discovery & Data Mining, Menlo Park, AAAI Press, 1995
15 Singhal, A., Mitra, M. and Buckley, C., Learning routing queries in a query zone. In Proc. of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval, pp.21-29, 1997