DOI QR코드

DOI QR Code

Building a Korean Text Summarization Dataset Using News Articles of Social Media

신문기사와 소셜 미디어를 활용한 한국어 문서요약 데이터 구축

  • Received : 2020.05.18
  • Accepted : 2020.06.10
  • Published : 2020.08.31

Abstract

A training dataset for text summarization consists of pairs of a document and its summary. As conventional approaches to building text summarization dataset are human labor intensive, it is not easy to construct large datasets for text summarization. A collection of news articles is one of the most popular resources for text summarization because it is easily accessible, large-scale and high-quality text. From social media news services, we can collect not only headlines and subheads of news articles but also summary descriptions that human editors write about the news articles. Approximately 425,000 pairs of news articles and their summaries are collected from social media. We implemented an automatic extractive summarizer and trained it on the dataset. The performance of the summarizer is compared with unsupervised models. The summarizer achieved better results than unsupervised models in terms of ROUGE score.

문서 요약을 위한 학습 데이터는 문서와 그 요약으로 구성된다. 기존의 문서 요약 데이터는 사람이 수동으로 요약을 작성하였기 때문에 대량의 데이터 확보가 어려웠다. 그렇기 때문에 온라인으로 쉽게 수집 가능하며 문서의 품질이 우수한 인터넷 신문기사가 문서 요약 연구에 많이 활용되어 왔다. 본 연구에서는 언론사가 소셜 미디어에 게시한 설명글과 제목, 부제를 본문의 요약으로 사용하여 한국어 문서 요약 데이터를 구성하는 것을 제안한다. 약 425,000개의 신문기사와 그 요약데이터를 구축할 수 있었다. 구성한 데이터의 유용성을 보이기 위해 추출 요약 시스템을 구현하였다. 본 연구에서 구축한 데이터로 학습한 교사 학습 모델과 비교사 학습 모델의 성능을 비교하였다. 실험 결과 제안한 데이터로 학습한 모델이 비교사 학습 알고리즘에 비해 더 높은 ROUGE 점수를 보였다.

Keywords

References

  1. P. Over, H. Dang, and D. Harman, “DUC in context,” Information Processing & Management, Vol. 43, No. 6, pp. 1506-1520, 2007. https://doi.org/10.1016/j.ipm.2007.01.019
  2. C. Napoles, M. Gormley, and B. Van Durme, "Annotated gigaword," in Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, Association for Computational Linguistics, 2012.
  3. J. G. Carbonell and J. Goldstein, "The use of MMR, diversity-based reranking for reordering documents and producing summaries," in SIGIR, 1998.
  4. J. Cheng and M. Lapata, "Neural summarization by extracting sentences and words," arXiv preprint arXiv: 1603.07252, 2016.
  5. M. Grusky, M. Naaman, and Y. Artzi, "Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies," arXiv preprint arXiv:1804.11283, 2018.
  6. E. Sandhaus, "The new york times annotated corpus. Linguistic Data Consortium," Philadelphia, Vol. 6, No. 12, p.e26752, 2008.
  7. T. Kodaira and M. Komachi, "The Rule of Three: Abstractive Text Summarization in Three Bullet Points," arXiv preprint arXiv:1809.10867, 2018.
  8. B. Hu, Q. Chen, and F. Zhu, "Lcsts: A large scale chinese short text summarization dataset," arXiv preprint arXiv: 1506.05865, 2015.
  9. M. Straka, N. Mediankin, T. Kocmi, Z. Zabokrtsky, V. Hudecek, and J. Hajic "SumeCzech: Large Czech News-Based Summarization Dataset," in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). 2018.
  10. K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom "Teaching machines to read and comprehend," in Advances in Neural Information Processing Systems., 2015.
  11. Su-Jin Baek, “Multi-Document Summarization Method Based on Semantic Relationship using VAE,” Journal of Digital Convergence, Vol. 15, No. 12, pp. 341-347, 2017. https://doi.org/10.14400/JDC.2017.15.12.341
  12. Kyoung-Ho Choi and Chang-Ki Lee, “End-to-end Korean Document Summarization using Copy Mechanism and Input-feeding,” Journal of KIISE, Vol. 44, No. 5, pp. 503-509, 2017. https://doi.org/10.5626/JOK.2017.44.5.503
  13. Tae-Hyeong Kim, Ahyoung Kim, Yunseok Noh, Seong-Bae Park, and Seyoung Park "Generation of News Article Dataset Using LEAD for Neural Summarization Model," Korea Software Congress 2017, pp. 688-690, 2017.
  14. M. Grusky, M. Naaman and Y. Artzi, "Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). 2018.
  15. Yeo-Hoon Jeong, “A Study on the Types of Newspaper Headlines and their Realizations,” The Sociolinguistic Journal of Korea, Vol. 14, No. 1, pp. 85-113, 2006.
  16. K. Woodsend and M. Lapata, "Automatic generation of story highlights," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010.
  17. R. Nallapati, F. Zhai and B. Zhou, "Summarunner: A recurrent neural network based sequence model for extractive summarization of documents," in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  18. F. Barrios, F. Lopez, L. Argerich and R. Wachenchauzer "Variations of the similarity function of textrank for automated summarization," arXiv preprint arXiv:1602.03606, 2016.
  19. Gensim [Internet], https://github.com/summ anlp/gensim.
  20. G. H. Lee and K. J. Lee, “Single Document Extractive Summarization Based on Deep Neural Networks Using Linguistic Analysis Features,” KIPS Transactions on Software and Data Engineering, Vol. 8, No. 8, pp. 343-348, 2019. https://doi.org/10.3745/KTSDE.2019.8.8.343