• Title/Summary/Keyword: Summarization Model

Search Result 89, Processing Time 0.028 seconds

Training Techniques for Data Bias Problem on Deep Learning Text Summarization (딥러닝 텍스트 요약 모델의 데이터 편향 문제 해결을 위한 학습 기법)

  • Cho, Jun Hee;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.7
    • /
    • pp.949-955
    • /
    • 2022
  • Deep learning-based text summarization models are not free from datasets. For example, a summarization model trained with a news summarization dataset is not good at summarizing other types of texts such as internet posts and papers. In this study, we define this phenomenon as Data Bias Problem (DBP) and propose two training methods for solving it. The first is the 'proper nouns masking' that masks proper nouns. The second is the 'length variation' that randomly inflates or deflates the length of text. As a result, experiments show that our methods are efficient for solving DBP. In addition, we analyze the results of the experiments and present future development directions. Our contributions are as follows: (1) We discovered DBP and defined it for the first time. (2) We proposed two efficient training methods and conducted actual experiments. (3) Our methods can be applied to all summarization models and are easy to implement, so highly practical.

Document Thematic words Extraction using Principal Component Analysis (주성분 분석을 이용한 문서 주제어 추출)

  • Lee, Chang-Beom;Kim, Min-Soo;Lee, Ki-Ho;Lee, Guee-Sang;Park, Hyuk-Ro
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.10
    • /
    • pp.747-754
    • /
    • 2002
  • In this paper, We propose a document thematic words extraction by using principal component analysis(PCA) which is one of the multivariate statistical methods. The proposed PCA model understands the flow of words in the document by using an eigenvalue and an eigenvector, and extracts thematic words. The proposed model is estimated by applying to document summarization. Experimental results using newspaper articles show that the proposed model is superior to the model using either word frequency or information retrieval thesaurus. We expect that the Proposed model can be applied to information retrieval , information extraction and document summarization.

Fine-tuning of Attention-based BART Model for Text Summarization (텍스트 요약을 위한 어텐션 기반 BART 모델 미세조정)

  • Ahn, Young-Pill;Park, Hyun-Jun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.12
    • /
    • pp.1769-1776
    • /
    • 2022
  • Automatically summarizing long sentences is an important technique. The BART model is one of the widely used models in the summarization task. In general, in order to generate a summarization model of a specific domain, fine-tuning is performed by re-training a language model trained on a large dataset to fit the domain. The fine-tuning is usually done by changing the number of nodes in the last fully connected layer. However, in this paper, we propose a fine-tuning method by adding an attention layer, which has been recently applied to various models and shows good performance. In order to evaluate the performance of the proposed method, various experiments were conducted, such as accumulating layers deeper, fine-tuning without skip connections during the fine tuning process, and so on. As a result, the BART model using two attention layers with skip connection shows the best score.

Document Summarization Considering Entailment Relation between Sentences (문장 수반 관계를 고려한 문서 요약)

  • Kwon, Youngdae;Kim, Noo-ri;Lee, Jee-Hyong
    • Journal of KIISE
    • /
    • v.44 no.2
    • /
    • pp.179-185
    • /
    • 2017
  • Document summarization aims to generate a summary that is consistent and contains the highly related sentences in a document. In this study, we implemented for document summarization that extracts highly related sentences from a whole document by considering both similarities and entailment relations between sentences. Accordingly, we proposed a new algorithm, TextRank-NLI, which combines a Recurrent Neural Network based Natural Language Inference model and a Graph-based ranking algorithm used in single document extraction-based summarization task. In order to evaluate the performance of the new algorithm, we conducted experiments using the same datasets as used in TextRank algorithm. The results indicated that TextRank-NLI showed 2.3% improvement in performance, as compared to TextRank.

Single Document Extractive Summarization Based on Deep Neural Networks Using Linguistic Analysis Features (언어 분석 자질을 활용한 인공신경망 기반의 단일 문서 추출 요약)

  • Lee, Gyoung Ho;Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.8
    • /
    • pp.343-348
    • /
    • 2019
  • In recent years, extractive summarization systems based on end-to-end deep learning models have become popular. These systems do not require human-crafted features and adopt data-driven approaches. However, previous related studies have shown that linguistic analysis features such as part-of-speeches, named entities and word's frequencies are useful for extracting important sentences from a document to generate a summary. In this paper, we propose an extractive summarization system based on deep neural networks using conventional linguistic analysis features. In order to prove the usefulness of the linguistic analysis features, we compare the models with and without those features. The experimental results show that the model with the linguistic analysis features improves the Rouge-2 F1 score by 0.5 points compared to the model without those features.

Empirical Study for Automatic Evaluation of Abstractive Summarization by Error-Types (오류 유형에 따른 생성요약 모델의 본문-요약문 간 요약 성능평가 비교)

  • Seungsoo Lee;Sangwoo Kang
    • Korean Journal of Cognitive Science
    • /
    • v.34 no.3
    • /
    • pp.197-226
    • /
    • 2023
  • Generative Text Summarization is one of the Natural Language Processing tasks. It generates a short abbreviated summary while preserving the content of the long text. ROUGE is a widely used lexical-overlap based metric for text summarization models in generative summarization benchmarks. Although it shows very high performance, the studies report that 30% of the generated summary and the text are still inconsistent. This paper proposes a methodology for evaluating the performance of the summary model without using the correct summary. AggreFACT is a human-annotated dataset that classifies the types of errors in neural text summarization models. Among all the test candidates, the two cases, generation summary, and when errors occurred throughout the summary showed the highest correlation results. We observed that the proposed evaluation score showed a high correlation with models finetuned with BART and PEGASUS, which is pretrained with a large-scale Transformer structure.

Development and Evaluation of a Document Summarization System using Features and a Text Component Identification Method (텍스트 구성요소 판별 기법과 자질을 이용한 문서 요약 시스템의 개발 및 평가)

  • Jang, Dong-Hyun;Myaeng, Sung-Hyon
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.6
    • /
    • pp.678-689
    • /
    • 2000
  • This paper describes an automatic summarization approach that constructs a summary by extracting sentences that are likely to represent the main theme of a document. As a way of selecting summary sentences, the system uses a model that takes into account lexical and statistical information obtained from a document corpus. As such, the system consists of two parts: the training part and the summarization part. The former processes sentences that have been manually tagged for summary sentences and extracts necessary statistical information of various kinds, and the latter uses the information to calculate the likelihood that a given sentence is to be included in the summary. There are at least three unique aspects of this research. First of all, the system uses a text component identification model to categorize sentences into one of the text components. This allows us to eliminate parts of text that are not likely to contain summary sentences. Second, although our statistically-based model stems from an existing one developed for English texts, it applies the framework to individual features separately and computes the final score for each sentence by combining the pieces of evidence using the Dempster-Shafer combination rule. Third, not only were new features introduced but also all the features were tested for their effectiveness in the summarization framework.

  • PDF

Semantic Pre-training Methodology for Improving Text Summarization Quality (텍스트 요약 품질 향상을 위한 의미적 사전학습 방법론)

  • Mingyu Jeon;Namgyu Kim
    • Smart Media Journal
    • /
    • v.12 no.5
    • /
    • pp.17-27
    • /
    • 2023
  • Recently, automatic text summarization, which automatically summarizes only meaningful information for users, is being studied steadily. Especially, research on text summarization using Transformer, an artificial neural network model, has been mainly conducted. Among various studies, the GSG method, which trains a model through sentence-by-sentence masking, has received the most attention. However, the traditional GSG has limitations in selecting a sentence to be masked based on the degree of overlap of tokens, not the meaning of a sentence. Therefore, in this study, in order to improve the quality of text summarization, we propose SbGSG (Semantic-based GSG) methodology that selects sentences to be masked by GSG considering the meaning of sentences. As a result of conducting an experiment using 370,000 news articles and 21,600 summaries and reports, it was confirmed that the proposed methodology, SbGSG, showed superior performance compared to the traditional GSG in terms of ROUGE and BERT Score.

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

  • Al-Sabahi, Kamal;Zuping, Zhang;Kang, Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.1
    • /
    • pp.254-276
    • /
    • 2019
  • Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, the researchers are paying much attention to Document Summarization. The key point in any successful document summarizer is a good document representation. The traditional approaches based on word overlapping mostly fail to produce that kind of representation. Word embedding has shown good performance allowing words to match on a semantic level. Naively concatenating word embeddings makes common words dominant which in turn diminish the representation quality. In this paper, we employ word embeddings to improve the weighting schemes for calculating the Latent Semantic Analysis input matrix. Two embedding-based weighting schemes are proposed and then combined to calculate the values of this matrix. They are modified versions of the augment weight and the entropy frequency that combine the strength of traditional weighting schemes and word embedding. The proposed approach is evaluated on three English datasets, DUC 2002, DUC 2004 and Multilingual 2015 Single-document Summarization. Experimental results on the three datasets show that the proposed model achieved competitive performance compared to the state-of-the-art leading to a conclusion that it provides a better document representation and a better document summary as a result.

End-to-end Korean Document Summarization using Copy Mechanism and Input-feeding (복사 방법론과 입력 추가 구조를 이용한 End-to-End 한국어 문서요약)

  • Choi, Kyoung-Ho;Lee, Changki
    • Journal of KIISE
    • /
    • v.44 no.5
    • /
    • pp.503-509
    • /
    • 2017
  • In this paper, the copy mechanism and input feeding are applied to recurrent neural network(RNN)-search model in a Korean-document summarization in an end-to-end manner. In addition, the performances of the document summarizations are compared according to the model and the tokenization format; accordingly, the syllable-unit, morpheme-unit, and hybrid-unit tokenization formats are compared. For the experiments, Internet newspaper articles were collected to construct a Korean-document summary data set (train set: 30291 documents; development set: 3786 documents; test set: 3705 documents). When the format was tokenized as the morpheme-unit, the models with the input feeding and the copy mechanism showed the highest performances of ROUGE-1 35.92, ROUGE-2 15.37, and ROUGE-L 29.45.