[KSCI] Korea Science Citation Index Service

Automatic Text Categorization based on Semi-Supervised Learning

Ko, Young-Joong (동아대학교 컴퓨터공학과)
Seo, Jung-Yun (서강대학교 컴퓨터학과/바이오 융합기술 협동과정)

Publication Information

Journal of KIISE:Software and Applications / v.35, no.5, 2008 , pp. 325-334 More about this Journal

Abstract

The goal of text categorization is to classify documents into a certain number of pre-defined categories. The previous studies in this area have used a large number of labeled training documents for supervised learning. One problem is that it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating training documents. In this paper, we propose a new text categorization method based on semi-supervised learning. The proposed method uses only unlabeled documents and keywords of each category, and it automatically constructs training data from them. Then a text classifier learns with them and classifies text documents. The proposed method shows a similar degree of performance, compared with the traditional supervised teaming methods. Therefore, this method can be used in the areas where low-cost text categorization is needed. It can also be used for creating labeled training documents.

Keywords

Text Categorization; Semi-Supervised Learning; Bootstrapping Techniques;

Citations & Related Records

Reference

1	Y. Yang and J. O. Pederson, "A Comparative study on feature selection in text categorization," Proceedings of the 14th International Conference on Machine Learning, 1997
2	C. Languillon, Partially Supervised Text Categorization: Combining Labeled and Unlabeled Documents Using an EM-like Scheme, Proceedings of the 11th Conference on Machine Learning, (ECML 2000), Vol.1810, LNCS, Springer Verlag, pp. 229- 237, 2000
3	A. McCallum, K. Nigam, J. Rennie, and K. Seymore, Automatic the Construction of Internet Portals with Machine Learning, Information Retrieval, Vol.3, No.2, pp. 127-163, 2000 DOI ScienceOn
4	조광제, 김준태. "역카테고리 빈도에 의한 계층적 분류체계에서의 문서의 자동분류", 한국 정보과학회 봄 학술발표논문집(B), pp. 507-510, 1997
5	Y. Karov and S. Edelman, "Similarity-based Word Sense Disambiguation," Computational Lin- guistics, Vol.24, No.1, pp. 41-60, March 1998
6	오효정, 임정묵, 이만호, 맹성현, "점진적으로 계산되는 분류정보와 링크정보를 이용한 하이퍼텍스트 문서 분류 모델", 한글 및 한국어 정보처리 학술 대회 논문집, pp. 89-96. 1999
7	D. Yarowsky, "Unsupervised word sense disambiguation rivaling supervised methods," Proceeding of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-196
8	Y. Ko, J. Park, and J. Seo, "Automatic Text Categorization using the Importance of Sentences," Proceedings of the 19th International Conference on Computational Lin- guistics (COLING'2002), pp. 474-480, 2002
9	S. Park, H. Kim, Y. Ko, and J. Seo, "Implementation of an efficient requirements analysis supporting system using similarity measure techniques," Information and Software Technology, Elseviser, Vol.42, No.6, pp. 429-438, 15 April, 2000 DOI ScienceOn
10	D. D. Lewis, R. E. Schapire, J. P. Callan and R. Papka, "Training Algorithms for Linear Text Classifiers," Proceedings of the 19th International Conference on Research and Development in Information Retrieval (SIGIR'96), pp. 289-297, 1996
11	고영중, 비지도 학습을 기반으로 한 자동 문서 범주화, 서강대 석사학위 논문, 1999
12	C. Cortes and V. Vapnik. "Support vector networks," Machine Learning, 20:273-297, 1995
13	Y. Yang, "An Evaluation of statistical approaches to text categorization," Information Retrieval Journal, May, 1999
14	E. Wiener, J. O. Pedersen, and A. S. Weigend. "A neural network approach to topic spotting," Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95), 1995
15	M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, "Learning to Extract Symbolic Knowledge from the World Wide Web," Proceedings of the International Workshop on AAAI'98, 1998
16	D. D. Lewis. "Naive (bayes) at forty: The independence assumption in information retrieval," European Conference on Machine Learning, 1998
17	K. Nigam, A. McCallum, S. Thrun, T. Mitchell, "Learning to Classify Text from Labeled and Unlabeled Documents," Proceedings of 15th National Conference on Artificial Intelligence (AAAI-98), 1998
18	A. McCallum and K. Nigram, "A comparison of Event Models for Naive Bayes Text Classification," AAAI '98 workshop on Learning for Text Categorization, 1998
19	T. Joachims. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," European Conference on Machine Learning (ECML), 1998
20	김상범, 윤보현, 백대호, 한경수, 임해창, "문서 범주화를 위한 선형 분류기와 kNN의 결합 모델", 한국 인지 과학회 춘계 학술대회 논문집, pp. 255-231, 1999
21	Y. Maarek, D. Berry, and G. Kaiser, "An Information Retrieval Approach for Automatically Construction Software Libraires," IEEE Transaction On Software Engineering, Vol.17, No,8, pp. 800- 813, August 1991 DOI ScienceOn
22	Y. Yang. "Expert netword: Effective and efficient learning from human decisions in text categorizatin and retrieval," 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 13-22, 1994
23	D. D. Lewis and M. Ringuette, "A comparison of Two Learning Algorithms for Text categorization," Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994

KSCI

Automatic Text Categorization based on Semi-Supervised Learning 준지도 학습 기반의 자동 문서 범주화

Automatic Text Categorization based on Semi-Supervised Learning