Browse > Article
http://dx.doi.org/10.3745/KTSDE.2020.9.8.251

Building a Korean Text Summarization Dataset Using News Articles of Social Media  

Lee, Gyoung Ho (드라마앤컴퍼니)
Park, Yo-Han (충남대학교 전파정보통신공학과)
Lee, Kong Joo (충남대학교 전파정보통신공학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.9, no.8, 2020 , pp. 251-258 More about this Journal
Abstract
A training dataset for text summarization consists of pairs of a document and its summary. As conventional approaches to building text summarization dataset are human labor intensive, it is not easy to construct large datasets for text summarization. A collection of news articles is one of the most popular resources for text summarization because it is easily accessible, large-scale and high-quality text. From social media news services, we can collect not only headlines and subheads of news articles but also summary descriptions that human editors write about the news articles. Approximately 425,000 pairs of news articles and their summaries are collected from social media. We implemented an automatic extractive summarizer and trained it on the dataset. The performance of the summarizer is compared with unsupervised models. The summarizer achieved better results than unsupervised models in terms of ROUGE score.
Keywords
Korean Text Summarization Dataset; Description; Headline; Subhead; Automatic Extractive Summarization;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Yeo-Hoon Jeong, “A Study on the Types of Newspaper Headlines and their Realizations,” The Sociolinguistic Journal of Korea, Vol. 14, No. 1, pp. 85-113, 2006.
2 K. Woodsend and M. Lapata, "Automatic generation of story highlights," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010.
3 R. Nallapati, F. Zhai and B. Zhou, "Summarunner: A recurrent neural network based sequence model for extractive summarization of documents," in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
4 F. Barrios, F. Lopez, L. Argerich and R. Wachenchauzer "Variations of the similarity function of textrank for automated summarization," arXiv preprint arXiv:1602.03606, 2016.
5 Gensim [Internet], https://github.com/summ anlp/gensim.
6 G. H. Lee and K. J. Lee, “Single Document Extractive Summarization Based on Deep Neural Networks Using Linguistic Analysis Features,” KIPS Transactions on Software and Data Engineering, Vol. 8, No. 8, pp. 343-348, 2019.   DOI
7 P. Over, H. Dang, and D. Harman, “DUC in context,” Information Processing & Management, Vol. 43, No. 6, pp. 1506-1520, 2007.   DOI
8 J. G. Carbonell and J. Goldstein, "The use of MMR, diversity-based reranking for reordering documents and producing summaries," in SIGIR, 1998.
9 J. Cheng and M. Lapata, "Neural summarization by extracting sentences and words," arXiv preprint arXiv: 1603.07252, 2016.
10 M. Grusky, M. Naaman, and Y. Artzi, "Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies," arXiv preprint arXiv:1804.11283, 2018.
11 E. Sandhaus, "The new york times annotated corpus. Linguistic Data Consortium," Philadelphia, Vol. 6, No. 12, p.e26752, 2008.
12 K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom "Teaching machines to read and comprehend," in Advances in Neural Information Processing Systems., 2015.
13 T. Kodaira and M. Komachi, "The Rule of Three: Abstractive Text Summarization in Three Bullet Points," arXiv preprint arXiv:1809.10867, 2018.
14 B. Hu, Q. Chen, and F. Zhu, "Lcsts: A large scale chinese short text summarization dataset," arXiv preprint arXiv: 1506.05865, 2015.
15 M. Straka, N. Mediankin, T. Kocmi, Z. Zabokrtsky, V. Hudecek, and J. Hajic "SumeCzech: Large Czech News-Based Summarization Dataset," in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). 2018.
16 Su-Jin Baek, “Multi-Document Summarization Method Based on Semantic Relationship using VAE,” Journal of Digital Convergence, Vol. 15, No. 12, pp. 341-347, 2017.   DOI
17 C. Napoles, M. Gormley, and B. Van Durme, "Annotated gigaword," in Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, Association for Computational Linguistics, 2012.
18 Kyoung-Ho Choi and Chang-Ki Lee, “End-to-end Korean Document Summarization using Copy Mechanism and Input-feeding,” Journal of KIISE, Vol. 44, No. 5, pp. 503-509, 2017.   DOI
19 Tae-Hyeong Kim, Ahyoung Kim, Yunseok Noh, Seong-Bae Park, and Seyoung Park "Generation of News Article Dataset Using LEAD for Neural Summarization Model," Korea Software Congress 2017, pp. 688-690, 2017.
20 M. Grusky, M. Naaman and Y. Artzi, "Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). 2018.