Development and Evaluation of a Document Summarization System using Features and a Text Component Identification Method

Jang, Dong-Hyun;Myaeng, Sung-Hyon;

한국정보과학회논문지:소프트웨어및응용 (Journal of KIISE:Software and Applications)

제27권6호
/
Pages.678-689
/
2000
/
1229-6848(pISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

텍스트 구성요소 판별 기법과 자질을 이용한 문서 요약 시스템의 개발 및 평가

Development and Evaluation of a Document Summarization System using Features and a Text Component Identification Method

장동현 (충남대학교 컴퓨터과학과) ;
맹성현 (충남대학교 컴퓨터과학과)

발행 : 2000.06.15

PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

논 본문은 문서의 주요 내용을 나타내는 문장을 추출함으로써 요약문을 작성하는 자동 요약 기법에 대해 기술하고 있다. 개발한 시스템은 문서 집합으로부터 추출한 어휘적, 통계적 정보를 고려하여 요약 문장을 작성하는 모델이다. 시스템은 크게 두 부분, 학습과정과 요약과정으로 구성이 된다. 학습 과정은 수동으로 작성한 요약문장으로부터 다양한 통계적인 정보를 추출하는 단계이며, 요약 과정은 학습 과정에서 추출한 정보를 이용하여 각 문장이 요약문장에 포함될 가능성을 계산하는 과정이다. 본 연구는 크게 세 가지 의의를 갖는다. 첫째, 개발된 시스템은 각 문장을 텍스트 구성 요소의 하나로 분류하는 텍스트 구성 요소 판별 모델을 사용한다. 이 과정을 통해 요약 문장에 포함될 가능성이 없는 문장을 미리 제거하는 효과를 얻게 된다. 둘째, 개발한 시스템이 영어 기반의 시스템을 발전시킨 것이지만, 각각의 자질을 독립적으로 요약에 적용시켰으며, Dempster-Shafer 규칙을 사용해서 다양한 자질의 확률 값을 혼합함으로써 문장이 요약문에 포함될 최종 확률을 계산하게 된다. 셋째, 기존의 시스템에서 사용하지 않은 새로운 자질 (feature)을 사용하였으며, 실험을 통하여 각각의 자질이 요약 시스템의 성능에 미치는 효과를 알아보았다.

This paper describes an automatic summarization approach that constructs a summary by extracting sentences that are likely to represent the main theme of a document. As a way of selecting summary sentences, the system uses a model that takes into account lexical and statistical information obtained from a document corpus. As such, the system consists of two parts: the training part and the summarization part. The former processes sentences that have been manually tagged for summary sentences and extracts necessary statistical information of various kinds, and the latter uses the information to calculate the likelihood that a given sentence is to be included in the summary. There are at least three unique aspects of this research. First of all, the system uses a text component identification model to categorize sentences into one of the text components. This allows us to eliminate parts of text that are not likely to contain summary sentences. Second, although our statistically-based model stems from an existing one developed for English texts, it applies the framework to individual features separately and computes the final score for each sentence by combining the pieces of evidence using the Dempster-Shafer combination rule. Third, not only were new features introduced but also all the features were tested for their effectiveness in the summarization framework.

키워드

참고문헌

Kupiec, J., Pedersen, J., and Chen, F., 'A Trainable Document Summarizer,' Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73, 1995 https://doi.org/10.1145/215206.215333
Shafer, G., A Mathematical Theory of Evidence, p., Princeton University Press, 1976
Abracos, J., and Lopez, G. P., 'Statistical Methods for Retrieving Most Significant Paragraphs in Newspaper Articles,' Proceedings of Workshop in Intelligent Scalable Summarization, pp. 51-57, 1997
Mitra, M., Singhal, A., and Buckley, C., 'Automatic Text Summarization by Paragraph Extraction,' Proceedings of Workshop in Intelligent Scalable Text Summarization, pp. 39-46, 1997
Jacobs, P. S., and Rau, L. F., 'Natural Language Techniques for Intelligent Information Retrieval,' Proceedings of the Eleventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 85-99, 1998 https://doi.org/10.1145/62437.62442
McKeown, K., and Radev, D. R., 'Generating Summaries of Multiple News Articles,' Proceedings of Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74-82, 1995 https://doi.org/10.1145/215206.215334
Miller, G., George, A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, J., 'Introduction to WordNet: an On-line Lexical Datbase,' International Journal of Lexicography, Vol.3, No.4, pp. 235-312, 1990 https://doi.org/10.1093/ijl/3.4.235
Hovy, E., and Lin, C. Y., 'Automated Text Summarization in SUMMARIST,' Proceedings of Workshop on Intelligent Scalable Summarization, pp. 18-24, 1997
Paice, C. D., and Jones, P. A., 'The Identification of Important Concepts in Highly Structured Technical Papers,' Proceedings Of Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 69-78, 1993 https://doi.org/10.1145/160688.160696
Miike, S., Itoh, E., Ono, K., and Sumita, K., 'A Full-Text Retrieval System with a Dynamic Abstract Generation,' Proceedings of Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 152-161, 1994
Salton, G and McGill, M. J., Introduction to Modern Information Retrieval, p.123, McGraw-Hill, New York, 1983
Rich, E. and Knight, K., Artificial Intelligence, 2nd Ed., p.242, McGraw-Hill, New York, 1991
Jang, D. H., and Myaeng, S. H., 'Development of a Document Summarization System for Effective Information Services,' Proceedings of RIAO 97 Conference, pp. 101-111, 1997
김철완, 장재우, '형태소 네트웍을 이용한 한글 문헌의 자동 키워드 추출', 제 6 회 한글 및 한국어 정보처리 학회, 1994

한국정보과학회논문지:소프트웨어및응용 (Journal of KIISE:Software and Applications)

텍스트 구성요소 판별 기법과 자질을 이용한 문서 요약 시스템의 개발 및 평가

Development and Evaluation of a Document Summarization System using Features and a Text Component Identification Method

초록

키워드

참고문헌

자세히 찾기