DOI QR코드

DOI QR Code

Sentence-Chain Based Seq2seq Model for Corpus Expansion

  • Received : 2016.08.14
  • Accepted : 2017.05.22
  • Published : 2017.08.01

Abstract

This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of n-grams with superior performance for English text.

Keywords

References

  1. Y. Ma and A. Way, "Bilingually Motivated Domain- Adapted Word Segmentation for Statistical Machine Translation," Int. Conf. EACL, Athens, Greece, Mar. 30- Apr. 3, 2009, pp. 549-557.
  2. V.L. Colson, B. Mandalia, and R.D. Swan, Automated Call Center Transcription Services, US Patent 7,184,539, filed Apr. 29, 2003, issued Feb. 27, 2007.
  3. R. Barzilay and M. Elhadad, "Using Lexical Chains for Text Summarization," Int. Workshop Intell. Scalable Text Summarization, Madrid, Spain, July 11, 1997, pp. 10-17.
  4. M. Marathe and G. Hirst, "Lexical Chains Using Distributional Measures of Concept Distance," Int. Conf. CICLing, Iasi, Romania, Mar. 21-27, 2010, pp. 291-302.
  5. T. Mikolov et al., "Distributed Representations of Words and Phrases and their Compositionality," Int. Conf. NIPS, Lake Tahoe, NV, USA, Dec. 5-10, 2013, pp. 3111-3119.
  6. A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," Int. Conf. NIPS, Lake Tahoe, NV, USA, Dec. 3-8, 2012, pp. 1097-1105.
  7. X. Cui, V. Goel, and B. Kingsbury, "Data Augmentation for Deep Neural Network Acoustic Modeling," IEEE/ACM Trans. Audio, Speech Language Proc., vol. 23, no. 9, 2015, pp. 1469-1477. https://doi.org/10.1109/TASLP.2015.2438544
  8. X. Zhang, J. Zhao, and Y. LeCun, "Character-Level Convolutional Networks for Text Classification," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 649- 657.
  9. R. Sennrich, B. Haddow, A. Birch, "Improving Neural Machine Translation Models with Monolingual Data," arXiv preprint arXiv:1511.06709, 2015.
  10. J. Bradbury and R. Socher, "MetaMind Neural Machine Translation System for WMT 2016," Int. Conf. WMT16, Berlin, Germany, Aug. 11-12, 2016, pp. 264-267.
  11. S. Remus and C. Biemann, "Domain-Specific Corpus Expansion with Focused Webcrawling," Int. Conf. LREC, Portoroz, Slovenia, May 23-28, 2016.
  12. X. Qiu, C.C. Huang, and X. Huang, "Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information," Int. Conf. COLING, Dublin, Ireland, Aug. 23-29, 2014, pp. 1154-1164.
  13. A.F. Smeaton, F. Kelledy, and R. O'Donnell, "TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion With WordNet and POS Tagging of Spanish," Int. Conf. TREC-4, Gaithersburg, USA, Nov. 1-3, 1995, pp. 373-389.
  14. F.H. Khan, U. Qamar, and S. Bashir, "SWIMS: Semi-Supervised Subjective Feature Weighting and Intelligent Model Selection for Sentiment Analysis," Knowl. Based Syst., vol. 100, May 2016, pp. 97-111. https://doi.org/10.1016/j.knosys.2016.02.011
  15. Q. Gao and S. Vogel, "Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules," Int. Conf. ACL-HLT, Portland, OR, USA, June 19- 24, 2011, pp. 294-298.
  16. R. Bhagat and E. Hovy, "What is a Paraphrase?," Comput. Linguistics, vol. 39, no. 3, 2013, pp. 463-472. https://doi.org/10.1162/COLI_a_00166
  17. B. Dolan, C. Quirk, and C. Brockett, "Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources," Int. Conf. COLING, Geneva, Switzerland, Aug. 23-27, 2004, pp. 350-356.
  18. R. Socher et al., "Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection," Int. Conf. NIPS, Granada, Spain, Dec. 12-17, 2011, pp. 801- 809.
  19. S. Zhao et al., "Application-Driven Statistical Paraphrase Generation," Int. Conf. ACL-IJCNLP, Suntec, Singapore, Aug. 2-7, 2009, pp. 834-842.
  20. M. Negri et al., "Chinese Whispers: Cooperative Paraphrase Acquisition," Int. Conf. LREC, Istanbul, Turkey, May 21- 27, 2012, pp. 2659-2665.
  21. A. Sordoni et al., "A Neural Network Approach to Context-Sensitive Generation of Conversational Responses," Int. Conf. NAACL-HLT, Denver, CO, USA, May 31-June 5, 2015, pp. 196-205.
  22. O. Vinyals and V.L. Quoc, "A Neural Conversational Model," Int. Workshop Deep Learning, Lille, France, July 10-11, 2015.
  23. K.M. Hermann et al., "Teaching Machines to Read and Comprehend," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 1693-1701.
  24. D. Zhang and G. Lu, "Evaluation of similarity measurement for image retrieval," Int. Workshop NNSP, Toulouse, France, Sept. 17-19, 2003, pp. 928-931.
  25. J.R. Smith, "Integrated Spatial and Feature Image System: Retrieval, Analysis and Compression," Ph.D. Dissertation, School arts Sci., Columbia Univ., 1997.
  26. S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Comput., vol. 9, no. 8, 1997, pp. 1735- 1780. https://doi.org/10.1162/neco.1997.9.8.1735
  27. K. Greff et al., "LSTM: A Search Space Odyssey," arXiv preprint arXiv:1503.04069, 2015.
  28. I. Sutskever, O. Vinyals, Q.V. Le, "Sequence to sequence learning with neural networks," Int. Conf. NIPS, Montreal, Canada, Dec. 8-13, 2014, pp. 3104-3112.
  29. K. Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," Int. Conf. EMNLP, Doha, Qatar, Oct. 25-29, 2014, pp. 1724- 1734.
  30. A.M. Dai and Q.V. Le, "Semi-Supervised Sequence Learning," Int. Conf. NIPS, Montreal, Canada, Dec. 7-12, 2015, pp. 3079-3087.
  31. R. Kiros et al., "Skip-Thought Vectors," Int. Conf. NIPS, Montreal Canada, Dec. 7-12, 2015, pp. 3294-3302.
  32. J. Li et al., "A Diversity-Promoting Objective Function for Neural Conversation Models," Int. Conf. NAACL-HLT, San Diego, CA, USA, June 12-17, 2016, pp. 110-119.
  33. J. Zhao, M. Lan, and J.F. Tian, "ECNU: Using Traditional Similarity Measurements and Word Embedding for Semantic Textual Similarity Estimation," Int. Workshop on Semantic Evaluation, Denver, Colorado, June 4-5, 2015, pp. 117-122.
  34. F. Jelinek, R.L. Mercer, L.R. Bahl, and J.K. Baker, "Perplexity-A Measure of Difficulty of Speech Recognition Tasks," 94th Meet. Acoustical Society of America, Miami Beach, FL, Dec. 15, 1977.
  35. B. Harb et al., "Back-off Language Model Compression," Int. Conf. INTERSPEECH, Brighton, United Kingdom, Sept. 6-10, 2009, pp. 353-355.
  36. E. Chung et al., "Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling," Int. Workshop on Spoken Dialog Systems, Granada, Spain, Sept. 1-3, 2011, pp. 63-73.
  37. M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
  38. A. Stolcke, "SRILM - an Extensible Language Modeling Toolkit," Int. Conf. Spoken Language Processing, Denver, Colorado, Sep. 16-20, 2002, pp. 901-904
  39. D. Rey, and M. Neuhauser, "Wilcoxon-signed-rank test," International Encyclopedia of Statistical Science, Springer Berlin Heidelberg, 2011, pp. 1658-1659.
  40. B.J. Hsu, "Generalized linear interpolation of language models," Int. Workshop ASRU, Kyoto, Japan, Dec. 9-13, 2007, pp. 136-140.

Cited by

  1. Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector vol.10, pp.12, 2020, https://doi.org/10.3390/app10124091