Automatic Text Categorization using the Importance of Sentences

Ko, Young-Joong;Park, Jin-Woo;Seo, Jung-Yun;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 29 Issue 6
/
Pages.417-424
/
2002
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Automatic Text Categorization using the Importance of Sentences

문장 중요도를 이용한 자동 문서 범주화

Ko, Young-Joong (Dept.of Computer, Sogang University) ;
Park, Jin-Woo ;
Seo, Jung-Yun

고영중 (서강대학교 컴퓨터학과) ;
박진우 ((주)다이퀘스트) ;
서정연 ((주)다이퀘스트)

Published : 2002.06.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Automatic text categorization is a problem of assigning predefined categories to free text documents. In order to classify text documents, we have to extract good features from them. In previous researches, a text document is commonly represented by the frequency of each feature. But there is a difference between important and unimportant sentences in a text document. It has an effect on the importance of features in a text document. In this paper, we measure the importance of sentences in a text document using text summarizing techniques. A text document is represented by features with different weights according to the importance of each sentence. To verify the new method, we constructed Korean news group data set and experiment our method using it. We found that our new method gale a significant improvement over a basis system for our data sets.

자동 문서 범주화란 문서의 내용에 기반하여 미리 정의되어 있는 범주에 문서를 자동으로 분류하는 작업이다. 문서 분류를 위해서는 문서들을 가장 잘 표현할 수 있는 자질들을 정하고, 이러한 자질들을 통해 분류할 문서를 표현해야 한다. 기존의 연구들은 문장간의 구분 없이, 문서 전체에 나타난 각 자질의 빈도수를 이용하여 문서를 표현 한다. 그러나, 하나의 문서 내에서도 중요한 문장과 그렇지 못한 문장의 구분이 있으며, 이러한 문장 중요도의 차이는 각각의 문장에 나타나는 자질의 중요도에도 영향을 미친다. 본 논문에서는 문서 요약에서 사용되는 중요 문장 추출 기법을 문서 분류에 적용하여, 문서 내에 나타나는 각 문장들의 문장 중요도를 계산하고 문서의 내용을 잘 나타내는 문장들과 그렇지 못한 문장들을 구분하여 각 문장에서 출현하는 자질들의 가중치를 다르게 부여하여 문서를 표현한다. 이렇게 문장들의 중요도를 고려하여 문서를 표현한 기법의 성능을 평가하기 위해서 뉴스 그룹 데이타를 구축하고 실험하였으며 문장 중요도를 사용하지 않은 시스템 보다 향상된 성능을 얻을 수 있었다.

Keywords

References

Yang, Y, and Xin Liu, 'A re-examination of text categorization methods,' In Proc. of Conference on Research and Development in Information Retrieval (SIGIR 99), pp.42-49, 1999 https://doi.org/10.1145/312624.312647
Yang, Y. 'An evaluation of statistical approaches to text categorization,' Journal of Imformation Retrieval, Vol 1, No. 1/2, pp 67-88, 1999 https://doi.org/10.1023/A:1009982220290
Yang, Y., Pedersen, J.O., 'A Comparative Study on Feature Selection in Text Categorization,' In Proc. of The 14th International Conference on Machine Learning (ICML'97), pp.412-420, 1997
Salton G., Fox E. A. and Wu H., 'Extended boolean information retrieval,' Communications of the ACM, Vol. 26, No. 12, pp.1022-1036, 1983 https://doi.org/10.1145/182.358466
Ko Y. and Seo J., 'Automatic Text Categorization by Unsupervised Learning,' In Proc. of the 18th International Conference on Computational Linguistics (COLING 2000), pp.453-459, 2000 https://doi.org/10.3115/990820.990886
Murata M., Ma Q., Uchimoto K., Ozaku H., Isahara H., and Utiyama M., 'Information retrieval using location and category information,' Journal of the Association for Natural Language Processing, Vol. 7, No. 2, 2000
Mock, K. J. 'Hybrid hill-climbing and knowledge-based techniques for intelligent news filtering,' In Proc. of The National Conference on Artificial Intelligence (AAAI'96), 1996
Goldstein J., Kantrowitz M., Mittal V. O., and Car-bonell J., 'Summarizing Text Documents: Sentence Selection and Evalution Metrics,' In Proc. of SIGIR'99, 1999 https://doi.org/10.1145/312624.312665
Radev, D. R, Jing, H., and Stys-Budzikowska, M., 'summarization of multiple documents: clustering, sentence extraction, and evaluation,' Proceedings, ANLP-NAACL Workshop on Automatic Summarization, April, 2000
Marcu D., 'Discourse trees are good indicators of importance in text,' Advances in Automatic Text Summarization, pp.123-136 The MIT Press, 1999
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S., 'Learning to Construct Knowledge Bases from the World Wide Web,' In Proc. of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 509-516, 1998
Manning C. D. and Schutze H., Foundations of Statistical Natural Language Processing, The MIT Press, 1999, Second Edition
Li H. and Yamanjshi K., 'Document Classification Using a Finite Mixture Model,' The Association for Computa-tional Linguistics (ACL'97), 1997 https://doi.org/10.3115/976909.979623