DOI QR코드

DOI QR Code

Empirical Study for Automatic Evaluation of Abstractive Summarization by Error-Types

오류 유형에 따른 생성요약 모델의 본문-요약문 간 요약 성능평가 비교

  • 이승수 (가천대학교 AI.소프트웨어학부) ;
  • 강상우 (가천대학교 AI.소프트웨어학부)
  • Received : 2023.04.17
  • Accepted : 2023.07.28
  • Published : 2023.09.30

Abstract

Generative Text Summarization is one of the Natural Language Processing tasks. It generates a short abbreviated summary while preserving the content of the long text. ROUGE is a widely used lexical-overlap based metric for text summarization models in generative summarization benchmarks. Although it shows very high performance, the studies report that 30% of the generated summary and the text are still inconsistent. This paper proposes a methodology for evaluating the performance of the summary model without using the correct summary. AggreFACT is a human-annotated dataset that classifies the types of errors in neural text summarization models. Among all the test candidates, the two cases, generation summary, and when errors occurred throughout the summary showed the highest correlation results. We observed that the proposed evaluation score showed a high correlation with models finetuned with BART and PEGASUS, which is pretrained with a large-scale Transformer structure.

텍스트 생성요약은 자연어처리의 과업 중 하나로 긴 텍스트의 내용을 보존하면서 짧게 축약된 요약문을 생성한다. 생성요약 과업의 특성 상 본문의 핵심내용을 요약문에서 보존하는 것은 매우 중요하다. 기존의 생성요약 방법론은 정답요약과의 어휘 중첩도(Lexical-Overlap)를 기반으로 본문의 내용과 유창성을 측정했다. ROUGE는 생성요약 요약모델의 평가지표로 많이 사용하는 어휘 중첩도 기반의 평가지표이다. 생성요약 벤치마크에서 ROUGE가 49점대로 매우 높은 성능을 보임에도 불구하고, 생성한 요약문과 본문의 내용이 불일치하는 경우가 30% 가량 존재한다. 본 연구에서는 정답요약의 도움 없이 본문만을 활용해 생성요약 모델의 성능을 평가하는 방법론을 제안한다. 본 연구에서 제안한 평가점수를 AggreFACT의 라벨과 상관도 분석결과, 다음의 두 가지 경우 가장 높은 상관관계를 보였다. 첫 번째는 Transformer 구조의 인코더-디코더 구조에 대규모 사전학습을 진행한 BART와 PEGASUS 등을 생성요약 모델의 베이스라인으로 사용한 경우이고, 두 번째는 요약문 전체에 걸쳐 오류가 발생한 경우이다.

Keywords

Acknowledgement

이 성과는 2023년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. NRF-2022R1A2C1005316).

References

  1. Amplayo, R. K., Liu, P. J., Zhao, Y., & Narayan, S. (2022). Smart: Sentences as basic units for text evaluation. arXiv preprint arXiv:2208.01030.
  2. Arumae, K., & Liu, F. (2019, June). Guiding extractive summarization with question-answering rewards. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 2566-2577). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1264 doi: 10.18653/v1/N19-1264
  3. Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv:2004.05150.
  4. Berry, M. W., Dumais, S. T., & O' Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573-595. Retrieved from https://doi.org/10.1137/1037127 doi: 10.1137/1037127
  5. Cachola, I., Lo, K., Cohan, A., & Weld, D. (2020, November). TLDR: Extreme summarization of scientific documents. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 4766-4777). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.428 doi: 10.18653/v1/2020 .findings-emnlp.428
  6. Cao, S., & Wang, L. (2021). Cliff: Contrastive learning for improving faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2109.09209.
  7. Cao, Z., Wei, F., Li, W., & Li, S. (2018). Faithful to the original: Fact-aware neural abstractive summarization. In Proceedings of the thirty-second aaai conference on artificial intelligence and thirtieth innovative applications of artificial intelligence conference and eighth aaai symposium on educational advances in artificial intelligence. AAAI Press.
  8. Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., & Goharian, N. (2018, June). A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 615-621). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-2097 doi: 10.18653/v1/N18-2097
  9. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Durmus, E., He, H., & Diab, M. (2020, July). FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5055-5070). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/ 2020.acl-main.454 doi: 10.18653/v1/2020.acl-main.454
  11. Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research, 22, 457-479.
  12. Fabbri, A., Wu, C.-S., Liu, W., & Xiong, C. (2022, July). QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 2587-2601). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.187 doi: 10.18653/v1/2022.naacl-main.187
  13. Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I., & Gurevych, I. (2019, July). Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2214-2220). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1213 doi: 10.18653/v1/P19-1213
  14. Gabriel, S., Celikyilmaz, A., Jha, R., Choi, Y., & Gao, J. (2021, August). GO FIGURE: A meta evaluation of factuality in summarization. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 478-487). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.42 doi: 10.18653/v1/2021.findings-acl.42
  15. Gliwa, B., Mochol, I., Biesek, M., & Wawer, A. (2019, November). SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd workshop on new frontiers in summarization (pp. 70-79). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-5409 doi:10.18653/v1/D19-5409
  16. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.
  17. Goyal, T., & Durrett, G. (2021). Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies.
  18. Gudivada, V. N. (2018). Chapter 12 - natural language core tasks and applications. In V. N. Gudivada & C. Rao (Eds.), Computational analysis and understanding of natural languages: Principles, methods and applications (Vol. 38, p. 403-428). Elsevier. Retrieved from https://www .sciencedirect.com/science/article/pii/S0169716118300257 doi: https://doi.org/10.1016/bs.host.2018.07.010
  19. Gupta, P., Wu, C.-S., Liu, W., & Xiong, C. (2021). Dialfact: A benchmark for fact-checking in dialogue. arXiv preprint arXiv:2110.08222.
  20. Hardy, Narayan, S., & Vlachos, A. (2019). Highres: Highlight-based referenceless evaluation of summarization.
  21. Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  22. Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., Cohen, V., … Matias, Y. (2022). True: Re-evaluating factual consistency evaluation. In Workshop on document-grounded dialogue and conversational question answering.
  23. Huang, L., Cao, S., Parulian, N., Ji, H., & Wang, L. (2021, June). Efficient attentions for long document summarization. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 1419-1436). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.112 doi: 10.18653/v1/ 2021.naacl-main.112
  24. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., … Fung, P. (2023, mar). Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12). Retrieved from https://doi.org/10.1145/3571730 doi: 10.1145/3571730
  25. Kageback, M., Mogren, O., Tahmasebi, N., & Dubhashi, D. (2014). Extractive summarization using continuous vector space models. In Proceedings of the 2nd workshop on continuous vector space models and their compositionality (cvsc) (pp. 31-39).
  26. Kim, B., Kim, H., & Kim, G. (2019, June). Abstractive summarization of Reddit posts with multi-level memory networks. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 2519-2531). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1260 doi: 10.18653/v1/N19-1260
  27. Kim, D.-H., Lee, S.-W., & Lee, G. G.-B. (2002). Query-based document summarization using important sentence selection heuristics and mmr. In Annual conference on human and language technology (pp. 285-291).
  28. Koto, F., Lau, J. H., & Baldwin, T. (2020). Ffci: A framework for interpretable automatic evaluation of summarization. J. Artif. Intell. Res., 73.
  29. Kryscinski, W., McCann, B., Xiong, C., & Socher, R. (2020, November). Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 9332-9346). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/ 2020.emnlp-main.750 doi: 10.18653/v1/2020.emnlp-main.750
  30. Laban, P., Schnabel, T., Bennett, P. N., & Hearst, M. A. (2022). SummaC: Revisiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10, 163-177. Retrieved from https://aclanthology.org/2022.tacl-1.10 doi: 10.1162/tacl_a_00453
  31. Ladhak, F., Durmus, E., Cardie, C., & McKeown, K. (2020, November). WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 4034-4048). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.360 doi: 10.18653/v1/2020.findings-emnlp.360
  32. Lavie, A., & Agarwal, A. (2007, June). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the second workshop on statistical machine translation (pp. 228-231). Prague, Czech Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W07-0734
  33. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2020, July). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7871-7880). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.703 doi: 10.18653/v1/2020.acl-main.703
  34. Lin, C.-Y. (2004, July). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). Barcelona, Spain: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W04-1013
  35. Liu, Y. (2019). Fine-tune bert for extractive summarization. arXiv preprint arXiv:1903.10318.
  36. Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345.
  37. Liu, Y., & Liu, P. (2021). Simcls: A simple framework for contrastive learning of abstractive summarization. arXiv preprint arXiv:2106.01890.
  38. Liu, Y., Liu, P., Radev, D., & Neubig, G. (2022). Brio: Bringing order to abstractive summarization. arXiv preprint arXiv:2203.16804.
  39. Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020, July). On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906-1919). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.173 doi: 10.18653/v1/2020.acl-main.173
  40. Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 404-411).
  41. Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
  42. Pagnoni, A., Balachandran, V., & Tsvetkov, Y. (2021, June). Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 4812-4829). Online: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2021.naacl-main.383 doi: 10.18653/v1/2021.naacl-main.383
  43. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311-318). Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P02-1040 doi: 10.3115/1073083.1073135
  44. Pasunuru, R., & Bansal, M. (2018, June). Multi-reward reinforced summarization with saliency and entailment. In Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 646-653). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-2102 doi: 10.18653/v1/N18-2102
  45. Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization. ArXiv, abs/1705.04304.
  46. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified textto- text transformer. Journal of Machine Learning Research, 21(140), 1-67. Retrieved from http://jmlr.org/papers/v21/20-074.html
  47. Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, November). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2383-2392). Austin, Texas: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D16-1264 doi: 10.18653/v1/D16-1264
  48. Rothe, S., Narayan, S., & Severyn, A. (2020). Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8, 264-280. Retrieved from https:// aclanthology.org/2020.tacl-1.18 doi: 10.1162/tacl_a_00313
  49. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/ 1910.01108. Retrieved from http://arxiv.org/abs/1910.01108
  50. Schuster, T., Fisch, A., & Barzilay, R. (2021, June). Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 624-643). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.52 doi: 10.18653/v1/2021.naacl-main.52
  51. Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2020, November). MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 8051-8067). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.647 doi: 10.18653/v1/2020.emnlp-main.647
  52. Scialom, T., Dray, P.-A., Patrick, G., Sylvain, L., Benjamin, P., Jacopo, S., & Alex, W. (2021). Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693.
  53. Scialom, T., Lamprier, S., Piwowarski, B., & Staiano, J. (2019, November). Answers unite! unsupervised metrics for reinforced summarization models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 3246-3256). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1320 doi: 10.18653/v1/D19-1320
  54. See, A., Liu, P. J., & Manning, C. D. (2017, July). Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1073-1083). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P17-1099 doi: 10.18653/v1/P17-1099
  55. Sizov, G. (2010). Extraction-based automatic summarization: Theoretical and empirical investigation of summarization techniques..
  56. Sun, T., He, J., Qiu, X., & Huang, X. (2022, December). BERTScore is unfair: On social bias in language model-based metrics for text generation. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 3726-3739). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.emnlp-main.245
  57. Tang, L., Goyal, T., Fabbri, A. R., Laban, P., Xu, J., Yahvuz, S., … Durrett, G. (2022). Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. arXiv preprint arXiv:2205.12854.
  58. Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018, June). FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (pp. 809-819). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-1074 doi: 10.18653/v1/N18-1074
  59. Vasilyev, O., Dharnidharka, V., & Bohannon, J. (2020). Fill in the blanc: Human-free quality estimation of document summaries. arXiv preprint arXiv:2002.09836.
  60. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  61. Wang, A., Cho, K., & Lewis, M. (2020, July). Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5008-5020). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.450 doi: 10.18653/v1/2020.acl-main.450
  62. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018, November). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP (pp. 353-355). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W18-5446 doi: 10.18653/v1/W18-5446
  63. Yu, W., Zhu, C., Li, Z., Hu, Z., Wang, Q., Ji, H., & Jiang, M. (2022). A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s), 1-38.
  64. Yuan, W., Neubig, G., & Liu, P. (2021). Bartscore: Evaluating generated text as text generation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems (Vol. 34, pp. 27263-27277). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf
  65. Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., … others (2020). Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33.
  66. Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th international conference on machine learning. JMLR.org.
  67. Zhang*, T., Kishore*, V., Wu*, F., Weinberger, K. Q., & Artzi, Y. (2020). Bertscore: Evaluating text generation with bert. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=SkeHuCVFDr
  68. Zhu, C. (2021). Chapter 8 - applications and future of machine reading comprehension. In C. Zhu (Ed.), Machine reading comprehension (p. 185-207). Elsevier. Retrieved from https://www.sciencedirect.com/science/article/pii/B9780323901185000084 doi: https://doi.org/10.1016/B978-0-323-90118-5.00008-4
  69. Zhu, C., Hinthorn, W., Xu, R., Zeng, Q., Zeng, M., Huang, X., & Jiang, M. (2021, June). Enhancing factual consistency of abstractive summarization. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 718-733). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.58 doi: 10.18653/v1/2021.naacl-main.58
  70. 김탁영, 김지나, 강형원, 김수빈, & 강필성 (2022). 한국어 문서요약 및 음성합성 통합 프레임워크 구축. 대한산업공학회지, 48(1), 80-90.
  71. 박은환, 나승훈, 신동욱, 김선훈, & 강인호 (2021). Summary-to-document 를 이용한 텍스트 생성요약. 한국정보과학회 학술발표논문집, 308-310.
  72. 최경호, & 이창기 (2016). Copy mechanism과 input feeding을 이용한 end-to-end 한국어 문서요약. 한국어정보학회 학술대회, 56-61.
  73. Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2018. Hallucinations in neural machine translation. In NIPS 2018 Interpretability and Robustness for Audio, Speech and Language Workshop