Browse > Article
http://dx.doi.org/10.4218/etrij.17.0116.0074

Sentence-Chain Based Seq2seq Model for Corpus Expansion  

Chung, Euisok (SW & Contents Research Laboratory, ETRI)
Park, Jeon Gue (SW & Contents Research Laboratory, ETRI)
Publication Information
ETRI Journal / v.39, no.4, 2017 , pp. 455-466 More about this Journal
Abstract
This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of n-grams with superior performance for English text.
Keywords
Sentence chain; Lexical chain; Seq2seq model; Corpus expansion;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Y. Ma and A. Way, "Bilingually Motivated Domain- Adapted Word Segmentation for Statistical Machine Translation," Int. Conf. EACL, Athens, Greece, Mar. 30- Apr. 3, 2009, pp. 549-557.
2 V.L. Colson, B. Mandalia, and R.D. Swan, Automated Call Center Transcription Services, US Patent 7,184,539, filed Apr. 29, 2003, issued Feb. 27, 2007.
3 R. Barzilay and M. Elhadad, "Using Lexical Chains for Text Summarization," Int. Workshop Intell. Scalable Text Summarization, Madrid, Spain, July 11, 1997, pp. 10-17.
4 M. Marathe and G. Hirst, "Lexical Chains Using Distributional Measures of Concept Distance," Int. Conf. CICLing, Iasi, Romania, Mar. 21-27, 2010, pp. 291-302.
5 T. Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality," Int. Conf. NIPS, Lake Tahoe, NV, USA, Dec. 5-10, 2013, pp. 3111-3119.
6 A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," Int. Conf. NIPS, Lake Tahoe, NV, USA, Dec. 3-8, 2012, pp. 1097-1105.
7 X. Cui, V. Goel, and B. Kingsbury, "Data Augmentation for Deep Neural Network Acoustic Modeling," IEEE/ACM Trans. Audio, Speech Language Proc., vol. 23, no. 9, 2015, pp. 1469-1477.   DOI
8 X. Zhang, J. Zhao, and Y. LeCun, "Character-Level Convolutional Networks for Text Classification," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 649- 657.
9 R. Sennrich, B. Haddow, A. Birch, "Improving Neural Machine Translation Models with Monolingual Data," arXiv preprint arXiv:1511.06709, 2015.
10 S. Remus and C. Biemann, "Domain-Specific Corpus Expansion with Focused Webcrawling," Int. Conf. LREC, Portoroz, Slovenia, May 23-28, 2016.
11 D. Zhang and G. Lu, "Evaluation of similarity measurement for image retrieval," Int. Workshop NNSP, Toulouse, France, Sept. 17-19, 2003, pp. 928-931.
12 A. Sordoni et al., "A Neural Network Approach to Context-Sensitive Generation of Conversational Responses," Int. Conf. NAACL-HLT, Denver, CO, USA, May 31-June 5, 2015, pp. 196-205.
13 O. Vinyals and V.L. Quoc, "A Neural Conversational Model," Int. Workshop Deep Learning, Lille, France, July 10-11, 2015.
14 K.M. Hermann et al., "Teaching Machines to Read and Comprehend," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 1693-1701.
15 J.R. Smith, "Integrated Spatial and Feature Image System: Retrieval, Analysis and Compression," Ph.D. Dissertation, School arts Sci., Columbia Univ., 1997.
16 S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Comput., vol. 9, no. 8, 1997, pp. 1735- 1780.   DOI
17 K. Greff et al., "LSTM: A Search Space Odyssey," arXiv preprint arXiv:1503.04069, 2015.
18 I. Sutskever, O. Vinyals, Q.V. Le, "Sequence to sequence learning with neural networks," Int. Conf. NIPS, Montreal, Canada, Dec. 8-13, 2014, pp. 3104-3112.
19 K. Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," Int. Conf. EMNLP, Doha, Qatar, Oct. 25-29, 2014, pp. 1724- 1734.
20 A.M. Dai and Q.V. Le, "Semi-Supervised Sequence Learning," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 3079-3087.
21 R. Kiros et al., "Skip-Thought Vectors," Int. Conf. NIPS, Montreal Canada, Dec. 7-12, 2015, pp. 3294-3302.
22 J. Li et al., "A Diversity-Promoting Objective Function for Neural Conversation Models," Int. Conf. NAACL-HLT, San Diego, CA, USA, June 12-17, 2016, pp. 110-119.
23 Q. Gao and S. Vogel, "Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules," Int. Conf. ACL-HLT, Portland, OR, USA, June 19- 24, 2011, pp. 294-298.
24 X. Qiu, C.C. Huang, and X. Huang, "Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information," Int. Conf. COLING, Dublin, Ireland, Aug. 23-29, 2014, pp. 1154-1164.
25 A.F. Smeaton, F. Kelledy, and R. O'Donnell, "TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion With WordNet and POS Tagging of Spanish," Int. Conf. TREC-4, Gaithersburg, USA, Nov. 1-3, 1995, pp. 373-389.
26 F.H. Khan, U. Qamar, and S. Bashir, "SWIMS: Semi-Supervised Subjective Feature Weighting and Intelligent Model Selection for Sentiment Analysis," Knowl. Based Syst., vol. 100, May 2016, pp. 97-111.   DOI
27 R. Bhagat and E. Hovy, "What is a Paraphrase?," Comput. Linguistics, vol. 39, no. 3, 2013, pp. 463-472.   DOI
28 B. Dolan, C. Quirk, and C. Brockett, "Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources," Int. Conf. COLING, Geneva, Switzerland, Aug. 23-27, 2004, pp. 350-356.
29 R. Socher et al., "Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection," Int. Conf. NIPS, Granada, Spain, Dec. 12-17, 2011, pp. 801- 809.
30 J. Bradbury and R. Socher, "MetaMind Neural Machine Translation System for WMT 2016," Int. Conf. WMT16, Berlin, Germany, Aug. 11-12, 2016, pp. 264-267.
31 S. Zhao et al., "Application-Driven Statistical Paraphrase Generation," Int. Conf. ACL-IJCNLP, Suntec, Singapore, Aug. 2-7, 2009, pp. 834-842.
32 M. Negri et al., "Chinese Whispers: Cooperative Paraphrase Acquisition," Int. Conf. LREC, Istanbul, Turkey, May 21- 27, 2012, pp. 2659-2665.
33 E. Chung et al., "Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling," Int. Workshop on Spoken Dialog Systems, Granada, Spain, Sept. 1-3, 2011, pp. 63-73.
34 J. Zhao, M. Lan, and J.F. Tian, "ECNU: Using Traditional Similarity Measurements and Word Embedding for Semantic Textual Similarity Estimation," Int. Workshop on Semantic Evaluation, Denver, Colorado, June 4-5, 2015, pp. 117-122.
35 F. Jelinek, R.L. Mercer, L.R. Bahl, and J.K. Baker, "Perplexity-A Measure of Difficulty of Speech Recognition Tasks," 94th Meet. Acoustical Society of America, Miami Beach, FL, Dec. 15, 1977.
36 B. Harb et al., "Back-off Language Model Compression," Int. Conf. INTERSPEECH, Brighton, United Kingdom, Sept. 6-10, 2009, pp. 353-355.
37 M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
38 A. Stolcke, "SRILM - an Extensible Language Modeling Toolkit," Int. Conf. Spoken Language Processing, Denver, Colorado, Sep. 16-20, 2002, pp. 901-904
39 D. Rey, and M. Neuhauser, "Wilcoxon-signed-rank test," International Encyclopedia of Statistical Science, Springer Berlin Heidelberg, 2011, pp. 1658-1659.
40 B.J. Hsu, "Generalized linear interpolation of language models," Int. Workshop ASRU, Kyoto, Japan, Dec. 9-13, 2007, pp. 136-140.