[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.4218/etrij.17.0116.0074

Sentence-Chain Based Seq2seq Model for Corpus Expansion

Chung, Euisok (SW & Contents Research Laboratory, ETRI)
Park, Jeon Gue (SW & Contents Research Laboratory, ETRI)

Publication Information

ETRI Journal / v.39, no.4, 2017 , pp. 455-466 More about this Journal

Abstract

This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of n-grams with superior performance for English text.

Keywords

Sentence chain; Lexical chain; Seq2seq model; Corpus expansion;

Citations & Related Records

Reference

1	Y. Ma and A. Way, "Bilingually Motivated Domain- Adapted Word Segmentation for Statistical Machine Translation," Int. Conf. EACL, Athens, Greece, Mar. 30- Apr. 3, 2009, pp. 549-557.
2	V.L. Colson, B. Mandalia, and R.D. Swan, Automated Call Center Transcription Services, US Patent 7,184,539, filed Apr. 29, 2003, issued Feb. 27, 2007.
3	R. Barzilay and M. Elhadad, "Using Lexical Chains for Text Summarization," Int. Workshop Intell. Scalable Text Summarization, Madrid, Spain, July 11, 1997, pp. 10-17.
4	M. Marathe and G. Hirst, "Lexical Chains Using Distributional Measures of Concept Distance," Int. Conf. CICLing, Iasi, Romania, Mar. 21-27, 2010, pp. 291-302.
5	T. Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality," Int. Conf. NIPS, Lake Tahoe, NV, USA, Dec. 5-10, 2013, pp. 3111-3119.
6	A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," Int. Conf. NIPS, Lake Tahoe, NV, USA, Dec. 3-8, 2012, pp. 1097-1105.
7	X. Cui, V. Goel, and B. Kingsbury, "Data Augmentation for Deep Neural Network Acoustic Modeling," IEEE/ACM Trans. Audio, Speech Language Proc., vol. 23, no. 9, 2015, pp. 1469-1477. DOI
8	X. Zhang, J. Zhao, and Y. LeCun, "Character-Level Convolutional Networks for Text Classification," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 649- 657.
9	R. Sennrich, B. Haddow, A. Birch, "Improving Neural Machine Translation Models with Monolingual Data," arXiv preprint arXiv:1511.06709, 2015.
10	S. Remus and C. Biemann, "Domain-Specific Corpus Expansion with Focused Webcrawling," Int. Conf. LREC, Portoroz, Slovenia, May 23-28, 2016.
11	D. Zhang and G. Lu, "Evaluation of similarity measurement for image retrieval," Int. Workshop NNSP, Toulouse, France, Sept. 17-19, 2003, pp. 928-931.
12	A. Sordoni et al., "A Neural Network Approach to Context-Sensitive Generation of Conversational Responses," Int. Conf. NAACL-HLT, Denver, CO, USA, May 31-June 5, 2015, pp. 196-205.
13	O. Vinyals and V.L. Quoc, "A Neural Conversational Model," Int. Workshop Deep Learning, Lille, France, July 10-11, 2015.
14	K.M. Hermann et al., "Teaching Machines to Read and Comprehend," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 1693-1701.
15	J.R. Smith, "Integrated Spatial and Feature Image System: Retrieval, Analysis and Compression," Ph.D. Dissertation, School arts Sci., Columbia Univ., 1997.
16	S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Comput., vol. 9, no. 8, 1997, pp. 1735- 1780. DOI
17	K. Greff et al., "LSTM: A Search Space Odyssey," arXiv preprint arXiv:1503.04069, 2015.
18	I. Sutskever, O. Vinyals, Q.V. Le, "Sequence to sequence learning with neural networks," Int. Conf. NIPS, Montreal, Canada, Dec. 8-13, 2014, pp. 3104-3112.
19	K. Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," Int. Conf. EMNLP, Doha, Qatar, Oct. 25-29, 2014, pp. 1724- 1734.
20	A.M. Dai and Q.V. Le, "Semi-Supervised Sequence Learning," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 3079-3087.
21	R. Kiros et al., "Skip-Thought Vectors," Int. Conf. NIPS, Montreal Canada, Dec. 7-12, 2015, pp. 3294-3302.
22	J. Li et al., "A Diversity-Promoting Objective Function for Neural Conversation Models," Int. Conf. NAACL-HLT, San Diego, CA, USA, June 12-17, 2016, pp. 110-119.
23	Q. Gao and S. Vogel, "Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules," Int. Conf. ACL-HLT, Portland, OR, USA, June 19- 24, 2011, pp. 294-298.
24	X. Qiu, C.C. Huang, and X. Huang, "Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information," Int. Conf. COLING, Dublin, Ireland, Aug. 23-29, 2014, pp. 1154-1164.
25	A.F. Smeaton, F. Kelledy, and R. O'Donnell, "TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion With WordNet and POS Tagging of Spanish," Int. Conf. TREC-4, Gaithersburg, USA, Nov. 1-3, 1995, pp. 373-389.
26	F.H. Khan, U. Qamar, and S. Bashir, "SWIMS: Semi-Supervised Subjective Feature Weighting and Intelligent Model Selection for Sentiment Analysis," Knowl. Based Syst., vol. 100, May 2016, pp. 97-111. DOI
27	R. Bhagat and E. Hovy, "What is a Paraphrase?," Comput. Linguistics, vol. 39, no. 3, 2013, pp. 463-472. DOI
28	B. Dolan, C. Quirk, and C. Brockett, "Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources," Int. Conf. COLING, Geneva, Switzerland, Aug. 23-27, 2004, pp. 350-356.
29	R. Socher et al., "Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection," Int. Conf. NIPS, Granada, Spain, Dec. 12-17, 2011, pp. 801- 809.
30	J. Bradbury and R. Socher, "MetaMind Neural Machine Translation System for WMT 2016," Int. Conf. WMT16, Berlin, Germany, Aug. 11-12, 2016, pp. 264-267.
31	S. Zhao et al., "Application-Driven Statistical Paraphrase Generation," Int. Conf. ACL-IJCNLP, Suntec, Singapore, Aug. 2-7, 2009, pp. 834-842.
32	M. Negri et al., "Chinese Whispers: Cooperative Paraphrase Acquisition," Int. Conf. LREC, Istanbul, Turkey, May 21- 27, 2012, pp. 2659-2665.
33	E. Chung et al., "Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling," Int. Workshop on Spoken Dialog Systems, Granada, Spain, Sept. 1-3, 2011, pp. 63-73.
34	J. Zhao, M. Lan, and J.F. Tian, "ECNU: Using Traditional Similarity Measurements and Word Embedding for Semantic Textual Similarity Estimation," Int. Workshop on Semantic Evaluation, Denver, Colorado, June 4-5, 2015, pp. 117-122.
35	F. Jelinek, R.L. Mercer, L.R. Bahl, and J.K. Baker, "Perplexity-A Measure of Difficulty of Speech Recognition Tasks," 94th Meet. Acoustical Society of America, Miami Beach, FL, Dec. 15, 1977.
36	B. Harb et al., "Back-off Language Model Compression," Int. Conf. INTERSPEECH, Brighton, United Kingdom, Sept. 6-10, 2009, pp. 353-355.
37	M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
38	A. Stolcke, "SRILM - an Extensible Language Modeling Toolkit," Int. Conf. Spoken Language Processing, Denver, Colorado, Sep. 16-20, 2002, pp. 901-904
39	D. Rey, and M. Neuhauser, "Wilcoxon-signed-rank test," International Encyclopedia of Statistical Science, Springer Berlin Heidelberg, 2011, pp. 1658-1659.
40	B.J. Hsu, "Generalized linear interpolation of language models," Int. Workshop ASRU, Kyoto, Japan, Dec. 9-13, 2007, pp. 136-140.