[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2019.01.015

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

Al-Sabahi, Kamal (School of Information Science and Engineering, Central South University)
Zuping, Zhang (School of Information Science and Engineering, Central South University)
Kang, Yang (School of Information Science and Engineering, Central South University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.1, 2019 , pp. 254-276 More about this Journal

Abstract

Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, the researchers are paying much attention to Document Summarization. The key point in any successful document summarizer is a good document representation. The traditional approaches based on word overlapping mostly fail to produce that kind of representation. Word embedding has shown good performance allowing words to match on a semantic level. Naively concatenating word embeddings makes common words dominant which in turn diminish the representation quality. In this paper, we employ word embeddings to improve the weighting schemes for calculating the Latent Semantic Analysis input matrix. Two embedding-based weighting schemes are proposed and then combined to calculate the values of this matrix. They are modified versions of the augment weight and the entropy frequency that combine the strength of traditional weighting schemes and word embedding. The proposed approach is evaluated on three English datasets, DUC 2002, DUC 2004 and Multilingual 2015 Single-document Summarization. Experimental results on the three datasets show that the proposed model achieved competitive performance compared to the state-of-the-art leading to a conclusion that it provides a better document representation and a better document summary as a result.

Keywords

Word Embedding; Augment Weight; Entropy Frequency; Word2Vec; Document Summarization; Latent Semantic analysis;

Citations & Related Records

Reference

1	C. Napoles, M. Gormley, and B. V. Durme, "Annotated Gigaword," in Proc. of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, Montreal, Canada, 2012. Article (Google Scholar).
2	R. Paulus, C. Xiong, and R. Socher, "A Deep Reinforced Model for Abstractive Summarization," CoRR, vol. abs/1705.04304, 2017. http://arxiv.org/abs/1705.04304
3	C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau, "How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2122-2132: Association for Computational Linguistics, 2016.
4	S. Chopra, M. Auli, and A. M. Rush, "Abstractive Sentence Summarization with Attentive Recurrent Neural Networks," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 93-98: Association for Computational Linguistics, 2016.
5	R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, "Abstractive text summarization using sequence-to-sequence rnns and beyond," in Proc. of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), Berlin, Germany, pp. 280-290: Association for Computational Linguistics, 2016.
6	Y. Sankarasubramaniam, K. Ramanathan, and S. Ghosh, "Text summarization using Wikipedia," INFORMATION PROCESSING & MANAGEMENT, vol. 50, no. 3, pp. 443-461, 2014. DOI
7	Y.-H. Hu, Y.-L. Chen, and H.-L. Chou, "Opinion mining from online hotel reviews- A text summarization approach," Information Processing & Management, vol. 53, no. 2, pp. 436-449, 2017/03/01/ 2017. DOI
8	S. Xiong and D. Ji, "Query-focused multi-document summarization using hypergraph-based ranking," Information Processing & Management, vol. 52, no. 4, pp. 670-681, 2016/07/01/ 2016. DOI
9	J.-g. Yao, X. Wan, and J. Xiao, "Recent advances in document summarization," Knowledge and Information Systems, journal article March 28 2017.
10	D. Sarkar, "Text Summarization," in Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your DataBerkeley, CA: Apress, 2016, pp. 217-263.
11	M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From Word Embeddings To Document Distances," presented at the Proceedings of the 32Nd International Conference on International Conference on Machine Learning, Lille, France, 2015. Available: http://dl.acm.org/citation.cfm?id=3045118.3045221.
12	T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," CoRR, vol. abs/1301.3781, 2013. http://arxiv.org/abs/1301.3781
13	G. Rossiello, P. Basile, and G. Semeraro, "Centroid-based Text Summarization through Compositionality of Word Embeddings," in Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, Valencia, Spain, 2017, pp. 12-21: Association for Computational Linguistics.
14	H. Kobayashi, M. Noguchi, and T. Yatsuka, "Summarization Based on Embedding Distributions," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015, pp. 1984-1989: Association for Computational Linguistics.
15	J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, "Towards universal paraphrastic sentence embeddings," CoRR, vol. abs/1511.08198, 2015. http://arxiv.org/abs/1511.08198
16	K. Al-Sabahi, Z. Zhang, and M. Nadher, "A Hierarchical Structured Self-Attentive Model for Extractive Document Summarization (HSSAS)," IEEE Access, vol. 6, pp. 24205-24212, 2018. DOI
17	Z. Cao, W. Li, S. Li, and F. Wei, "AttSum: Joint Learning of Focusing and Summarization with Neural Attention," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 2016, vol. abs/1604.0, pp. 547--556: The COLING 2016 Organizing Committee.
18	R. Nallapati, F. Zhai, and B. Zhou, "SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents," presented at the Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017. Available: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14636.
19	M. Yousefi-Azar and L. Hamey, "Text summarization using unsupervised deep learning," Expert Systems with Applications, vol. 68, pp. 93-105, 2017/02/01/ 2017. DOI
20	M. Isonuma, T. Fujino, J. Mori, Y. Matsuo, and I. Sakata, "Extractive Summarization Using Multi-Task Learning with Document Classification," in Proc. of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2091-2100, 2017. .
21	A. M. Rush, S. Chopra, and J. Weston, "A neural attention model for abstractive sentence summarization," CoRR, vol. abs/1509.00685, 2015. http://arxiv.org/abs/1509.00685
22	A. See, P. J. Liu, and C. D. Manning, "Get To The Point: Summarization with Pointer-Generator Networks," in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, vol. 1, pp. 1073-1083: Association for Computational Linguistics, 2017.
23	M. Fuentes, E. Gonzalez, D. FERRes, and H. RODRiguez, "QASUM-TALP at DUC 2005 Automatically Evaluated with a Pyramid based Metric," in Proc. of Document Understanding Workshop (DUC). Vancouver, BC, Canada, 2005. Article (Google Scholar)
24	D. Kim and J. H. Lee, "Multi-document Summarization by Creating Synthetic Document Vector Based on Language Model," in Proc. of 2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS), pp. 605-609, 2016.
25	J.-H. Lee, S. Park, C.-M. Ahn, and D. Kim, "Automatic generic document summarization based on non-negative matrix factorization," Information Processing & Management, vol. 45, no. 1, pp. 20-34, 1// 2009. DOI
26	A. Kontostathis, "Essential Dimensions of Latent Semantic Indexing (LSI)," in Proc. of System Sciences, 2007. HICSS 2007. 40th Annual Hawaii International Conference on, pp. 73-73: IEEE, 2007.
27	C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, p. 496, 2008.
28	E. Triantafillou, J. R. Kiros, R. Urtasun, and R. Zemel, "Towards generalizable sentence embeddings," in Proc. of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 239-248, 2016..
29	K. Al-Sabahi, Z. Zhang, J. Long, and K. Alwesabi, "An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization," Arabian Journal for Science and Engineering, journal article May 05 2018.
30	X. Han, T. Lv, Q. Jiang, X. Wang, and C. Wang, "Text summarization using Sentence-Level Semantic Graph Model," in 2016 4th International Conference on Cloud Computing and Intelligence Systems (CCIS), 2016, pp. 171-176: IEEE.
31	G. Giannakopoulos et al., "Multiling 2015: multilingual summarization of single and multi-documents, on-line fora, and call-center conversations," Proceedings of SIGDIAL, Prague, pp. 270-274, 2015.
32	J. Cheng and M. Lapata, "Neural Summarization by Extracting Sentences and Words," in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 484-494: Association for Computational Linguistics, 2016.
33	Z. Cao, F. Wei, L. Dong, S. Li, and M. Zhou, "Ranking with recursive neural networks and its application to multi-document summarization," presented at the Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, 2015. Available: Article (Google Scholar).
34	M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan, and D. Radev, "Graph-based Neural Multi-Document Summarization," in Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, pp. 452-462: Association for Computational Linguistics, 2017.
35	J. Steinberger and K. Jezek, "Using latent semantic analysis in text summarization and summary evaluation," in Proc. ISIM'04, 2004, pp. 93-100. Article (Google Scholar)
36	C. D. Boom, S. V. Canneyt, S. Bohez, T. Demeester, and B. Dhoedt, "Learning Semantic Similarity for Very Short Texts," in 2015 IEEE International Conference on Data Mining Workshop (ICDMW), 2015, pp. 1229-1234.
37	T. Kenter and M. d. Rijke, "Short Text Similarity with Word Embeddings," presented at the Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 2015.
38	Z. Wu et al., "A topic modeling based approach to novel document automatic summarization," Expert Systems with Applications, vol. 84, pp. 12-23, 10/30/ 2017. DOI
39	Y. Shen et al., "A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval," in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, China, 2014, pp. 101-110, 2661935: ACM..
40	J. Pennington, R. Socher, and C. D. Manning, "GloVe: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532-1543: Association for Computational Linguistics..
41	C.-Y. Lin and E. Hovy, "Automatic evaluation of summaries using N-gram co-occurrence statistics," in Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology- Volume 1, Edmonton, Canada, pp. 71-78, 1073465: Association for Computational Linguistics, 2003.
42	C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries," in "Text summarization branches out: Proceedings of the ACL-04 workshop," Association for Computational Linguistics, Barcelona, SpainJuly, vol. 8, 2004. Available: http://aclweb.org/anthology/W04-1013.