Browse > Article

Developing a Text Categorization System Based on Unsupervised Learning Using an Information Retrieval Technique  

Noh, Dae-Wook (연세대학교 정보통신공학부)
Lee, Soo-Yong (연세대학교 정보통신공학부)
Ra, Dong-Yul (연세대학교 정보통신공학부)
Abstract
For developing a text classifier using supervised learning, a manually labeled corpus of large size is required. However, it takes a lot of time and human effort. Recently a research paradigm was proposed to use a raw corpus and a small amount of seed information instead of manually labeled corpus. In this paper we introduce an unsupervised learning method that makes it possible to achieve better performance than other related works. The characteristics of our approach is that average mutual information is used to learn representative words and their weights and then update of the weights is done using a technique inspired by the works in information retrieval. By iterating this teaming process it was shown that a high performance system can be developed.
Keywords
Text classification; unsupervised learning; representative words; mutual information; information retrieval;
Citations & Related Records
연도 인용수 순위
  • Reference
1 C. Manning and H. Schutze, 1999. Foundations of Statistical Natural Language Processing. The MIT Press
2 D. Lewis and W. Gale. 1994. A sequential algorithm for training text classifiers, In Proc. of SIGIR-94
3 S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science
4 A. Dempster, N. M. Laird and D. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. of the Royal Stat. Society, B:39, Pages 1-38
5 A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation,. Computer Speech and Language, 18:275-299   DOI   ScienceOn
6 V. Vapnik. 1995. The nature of statistical learning theory
7 C. Burges, 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, vol. 2, no. 2
8 A. McCallum and K. Nigam. 1999. Text classification by bootstrapping with keywords, EM and shrinkage. In ACL-99-Workshop on Unsupervised Learning in Natural Language Processing
9 T. Joachims, 1999. Estimating the Generalization Performance of an SVM Efficiently. In Proc. of ICML' 2000, Pages 431-438
10 A. A. Gliozzo, C. Strapparava, and I. Dagan. 2005. Investigating unsupervised learning for text categorization bootstrapping. In Proc. of HLT-2005, October. Pages 129-136
11 G. Salton and M. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill
12 A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proc. COLT-98
13 N. Slonim, N. Friedman, and N. Tishby, 2002. Unsupervised document classification using sequential information maximization, In Proc. of SIGIR '02, Pages 129-136
14 Y. Ko and J. Seo. 2000. Automatic text categorization by unsupervised learning. In Proc. of COLING 2000
15 Y. Ko and J. Seo. 2004. Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proc. of the ACL-04, Barcelona, Spain, July
16 T. Joachims, 1998. Text categorization with support vector machines: learning with many relevant features. In Proc. of ECML '98, Pages 137-142
17 N., Cristianini J. Shawe-Taylor2000. An introduction to Support Vector and other kernel-based learning methods. Cambridge Univ. Press
18 Y. Yang and J.P. Pederson. 1997. Feature selection in statistical learning of text categorization. In Proc. of ICML '97, Pages 412-420
19 K.P. Nigam, A. McCallum, S. Thrun, and T. Mitchell. 1998. Learning to classify text from labeled and unlabeled documents. In Proc. of AAAI-98
20 A. McCallum and K. Nigam. 1998. A comparison of event models for naive Bayes text classification. In Proc. of AAAI-98 Workshop on Learning for Text Categorization
21 B. Liu, X. Li, W.S. Lee, and P.S. Yu. 2004. Text classification by labeling words. In Proc. of AAAI-04, San Jose, July
22 R. Ghani. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proc. of ICML-02
23 A.K. Jain and R.C. Dubes. 1988. Algorithms for Clustering Data. Engle-wood Cliffs, NJ: Prentice Hall