• Title/Summary/Keyword: Document summarization

Search Result 114, Processing Time 0.024 seconds

Development and Evaluation of a Document Summarization System using Features and a Text Component Identification Method (텍스트 구성요소 판별 기법과 자질을 이용한 문서 요약 시스템의 개발 및 평가)

  • Jang, Dong-Hyun;Myaeng, Sung-Hyon
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.6
    • /
    • pp.678-689
    • /
    • 2000
  • This paper describes an automatic summarization approach that constructs a summary by extracting sentences that are likely to represent the main theme of a document. As a way of selecting summary sentences, the system uses a model that takes into account lexical and statistical information obtained from a document corpus. As such, the system consists of two parts: the training part and the summarization part. The former processes sentences that have been manually tagged for summary sentences and extracts necessary statistical information of various kinds, and the latter uses the information to calculate the likelihood that a given sentence is to be included in the summary. There are at least three unique aspects of this research. First of all, the system uses a text component identification model to categorize sentences into one of the text components. This allows us to eliminate parts of text that are not likely to contain summary sentences. Second, although our statistically-based model stems from an existing one developed for English texts, it applies the framework to individual features separately and computes the final score for each sentence by combining the pieces of evidence using the Dempster-Shafer combination rule. Third, not only were new features introduced but also all the features were tested for their effectiveness in the summarization framework.

  • PDF

Document Summarization Using Mutual Recommendation with LSA and Sense Analysis (LSA를 이용한 문장 상호 추천과 문장 성향 분석을 통한 문서 요약)

  • Lee, Dong-Wook;Baek, Seo-Hyeon;Park, Min-Ji;Park, Jin-Hee;Jung, Hye-Wuk;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.5
    • /
    • pp.656-662
    • /
    • 2012
  • In this paper, we describe a new summarizing method based on a graph-based and a sense-based analysis. In the graph-based analysis, we convert sentences in a document into word vectors and calculate the similarity between each sentence using LSA. We reflect this similarity of sentences and the rarity scores of words in sentences to define weights of edges in the graph. Meanwhile, in the sense-based analysis, in order to determine the sense of words, subjectivity or objectivity, we built a database which is extended from the golden standards using Wordnet. We calculate the subjectivity of sentences from the sense of words, and select more subjective sentences. Lastly, we combine the results of these two methods. We evaluate the performance of the proposed method using classification games, which are usually used to measure the performances of summarization methods. We compare our method with the MS-Word auto-summarization, and verify the effectiveness of ours.

Multi-Document Summarization Method of Reviews Using Word Embedding Clustering (워드 임베딩 클러스터링을 활용한 리뷰 다중문서 요약기법)

  • Lee, Pil Won;Hwang, Yun Young;Choi, Jong Seok;Shin, Young Tae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.11
    • /
    • pp.535-540
    • /
    • 2021
  • Multi-document refers to a document consisting of various topics, not a single topic, and a typical example is online reviews. There have been several attempts to summarize online reviews because of their vast amounts of information. However, collective summarization of reviews through existing summary models creates a problem of losing the various topics that make up the reviews. Therefore, in this paper, we present method to summarize the review with minimal loss of the topic. The proposed method classify reviews through processes such as preprocessing, importance evaluation, embedding substitution using BERT, and embedding clustering. Furthermore, the classified sentences generate the final summary using the trained Transformer summary model. The performance evaluation of the proposed model was compared by evaluating the existing summary model, seq2seq model, and the cosine similarity with the ROUGE score, and performed a high performance summary compared to the existing summary model.

End-to-end Document Summarization using Copy Mechanism and Input Feeding (Copy Mechanism과 Input Feeding을 이용한 End-to-End 한국어 문서요약)

  • Choi, Kyoungho;Lee, Changki
    • 한국어정보학회:학술대회논문집
    • /
    • 2016.10a
    • /
    • pp.56-61
    • /
    • 2016
  • 본 논문에서는 Sequence-to-sequence 모델을 생성요약의 방법으로 한국어 문서요약에 적용하였으며, copy mechanism과 input feeding을 적용한 RNN search 모델을 사용하여 시스템의 성능을 높였다. 인터넷 신문기사를 수집하여 구축한 한국어 문서요약 데이터 셋(train set 30291 문서, development set 3786 문서, test set 3705문서)으로 실험한 결과, input feeding과 copy mechanism을 포함한 모델이 형태소 기준으로 ROUGE-1 35.92, ROUGE-2 15.37, ROUGE-L 29.45로 가장 높은 성능을 보였다.

  • PDF

Information Retrieval System for Mobile Devices (모바일 기기를 위한 정보검색 시스템)

  • Kim, Jae-Hoon;Kim, Hyung-Chul
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.33 no.4
    • /
    • pp.569-577
    • /
    • 2009
  • Mobile information retrieval is an evolving branch of information retrieval that is centered on mobile and ubiquitous environments. In general, mobile devices are characterized by lightweight, low power, small memory, small display, limited input/output, low bandwidth, and so on. Some of these characteristics make it impossible to apply general information retrieval to mobile environments without any modification. In order to relieve this problem, we design and implement an information retrieval system for mobile devices like wireless phones, PDA and handheld devices. We use document summarization techniques to alleviate the limitation of small display and user profiles to retrieve the most proper documents for each individual user for personalized search. Futhermore we use meta-search to lighten some burdens visiting several portal sites. In this paper, we have implemented and demonstrated the proposed mobile information retrieval system on the domain of travel and received good evaluation from users subjectively.

Information Retrieval System : Condor (콘도르 정보 검색 시스템)

  • 박순철;안동언
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.8 no.4
    • /
    • pp.31-37
    • /
    • 2003
  • This paper is a review of the large-scale information retrieval system, CONDOR. This system was developed by the consortium that consists of Chonbuk National University, Searchline Co. and Carnegie Mellon University. This system is based on the probabilistic model of information retrieval systems. The multi-language query processing, online document summarization based on query and dynamic hierarchy clustering of this system make difference of other systems. We test this system with 30 million web documents successfully.

  • PDF

Building a Korean Text Summarization Dataset Using News Articles of Social Media (신문기사와 소셜 미디어를 활용한 한국어 문서요약 데이터 구축)

  • Lee, Gyoung Ho;Park, Yo-Han;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.8
    • /
    • pp.251-258
    • /
    • 2020
  • A training dataset for text summarization consists of pairs of a document and its summary. As conventional approaches to building text summarization dataset are human labor intensive, it is not easy to construct large datasets for text summarization. A collection of news articles is one of the most popular resources for text summarization because it is easily accessible, large-scale and high-quality text. From social media news services, we can collect not only headlines and subheads of news articles but also summary descriptions that human editors write about the news articles. Approximately 425,000 pairs of news articles and their summaries are collected from social media. We implemented an automatic extractive summarizer and trained it on the dataset. The performance of the summarizer is compared with unsupervised models. The summarizer achieved better results than unsupervised models in terms of ROUGE score.

Generic Summarization Using Generic Important of Semantic Features (의미특징의 포괄적 중요도를 이용한 포괄적 문서 요약)

  • Park, Sun;Lee, Jong-Hoon
    • Journal of Advanced Navigation Technology
    • /
    • v.12 no.5
    • /
    • pp.502-508
    • /
    • 2008
  • With the increased use of the internet and the tremendous amount of data it transfers, it is more necessary to summarize documents. We propose a new method using the Non-negative Semantic Variable Matrix (NSVM) and the generic important of semantic features obtained by Non-negative Matrix Factorization (NMF) to extract the sentences for automatic generic summarization. The proposed method use non-negative constraints which is more similar to the human's cognition process. As a result, the proposed method selects more meaningful sentences for summarization than the unsupervised method used the Latent Semantic Analysis (LSA) or clustering methods. The experimental results show that the proposed method archives better performance than other methods.

  • PDF

An Efficient Machine Learning-based Text Summarization in the Malayalam Language

  • P Haroon, Rosna;Gafur M, Abdul;Nisha U, Barakkath
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.6
    • /
    • pp.1778-1799
    • /
    • 2022
  • Automatic text summarization is a procedure that packs enormous content into a more limited book that incorporates significant data. Malayalam is one of the toughest languages utilized in certain areas of India, most normally in Kerala and in Lakshadweep. Natural language processing in the Malayalam language is relatively low due to the complexity of the language as well as the scarcity of available resources. In this paper, a way is proposed to deal with the text summarization process in Malayalam documents by training a model based on the Support Vector Machine classification algorithm. Different features of the text are taken into account for training the machine so that the system can output the most important data from the input text. The classifier can classify the most important, important, average, and least significant sentences into separate classes and based on this, the machine will be able to create a summary of the input document. The user can select a compression ratio so that the system will output that much fraction of the summary. The model performance is measured by using different genres of Malayalam documents as well as documents from the same domain. The model is evaluated by considering content evaluation measures precision, recall, F score, and relative utility. Obtained precision and recall value shows that the model is trustable and found to be more relevant compared to the other summarizers.