Browse > Article
http://dx.doi.org/10.13064/KSSS.2018.10.1.039

An end-to-end synthesis method for Korean text-to-speech systems  

Choi, Yeunju (한국과학기술원 전기및전자공학부)
Jung, Youngmoon (한국과학기술원 전기및전자공학부)
Kim, Younggwan (한국과학기술원 전기및전자공학부)
Suh, Youngjoo (한국과학기술원 전기및전자공학부)
Kim, Hoirin (한국과학기술원)
Publication Information
Phonetics and Speech Sciences / v.10, no.1, 2018 , pp. 39-48 More about this Journal
Abstract
A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated in each module accumulate passing through each module. An end-to-end TTS system could avoid such problems by synthesizing voice signals directly from an input string. In this study, we implemented an end-to-end Korean TTS system using Google's Tacotron, which is an end-to-end TTS system based on a sequence-to-sequence model with attention mechanism. We used 4392 utterances spoken by a Korean female speaker, an amount that corresponds to 37% of the dataset Google used for training Tacotron. Our system obtained mean opinion score (MOS) 2.98 and degradation mean opinion score (DMOS) 3.25. We will discuss the factors which affected training of the system. Experiments demonstrate that the post-processing network needs to be designed considering output language and input characters and that according to the amount of training data, the maximum value of n for n-grams modeled by the encoder should be small enough.
Keywords
attention mechanism; end-to-end; Korean text-to-speech system; sequence-to-sequence; Tacotron;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Raffel, C., Luong, M.-T., Liu, P., Weiss, R., & Eck, D. (2017). Online and linear-time attention by enforcing monotonic alignments. Proceedings of the 34th International Conference on Machine Learning (pp. 2837-2846). 6-11 August, 2017.
2 Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. Retrieved from http://arxiv.org/abs/1712.05884 [Computing Research Repository] on March 1, 2018.
3 Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
4 Srivastava, R., Greef, K., & Schmidhuber, J. (2015). Highway networks. Retrieved from http://arxiv.org/abs/1505.00387 [Computing Research Repository] on January 9, 2018.
5 Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (pp. 3104-3112). 8-13 December, 2014.
6 Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on hidden markov models. Proceedings of IEEE, 101(5), 1234-1252.   DOI
7 Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems 28 (pp. 2773-2781). 7-12 December, 2015.
8 Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from http://arxiv.org/abs/1703.10135 [Computing Research Repository] on January 9, 2018.
9 Arik, S., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017a). Deep Voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning (pp. 195-204). Sydney, AU. 6-11 August, 2017.
10 Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 218-223). Sunnyvale, CA. 13-15 September, 2016.
11 Cho, K., Van Mrrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved from http://arxiv.org/abs/1406.1078 [Computing Research Repository] on January 9, 2018.
12 Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017b). Deep Voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems 30 (pp. 2966-2974). Long Beach, CA. 4-9 December, 2017.
13 Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Retrieved from http://arxiv.org/abs/1409.0473 [Computing Research Repository] on January 9, 2018.
14 Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41-48). 14-18 June, 2009.
15 Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from http://arxiv.org/abs/1412.3555 [Computing Research Repository] on January 9, 2018.
16 Collins, J., Sohl-Dickstein, J., & Sussillo, D. (2017). Capacity and trainability in recurrent neural networks. Proceedings of the 5th International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=BydARw9ex on January 9, 2018.
17 Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 373-376). 7-10 May, 1996.
18 Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.   DOI
19 He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778). 26 June-1 July, 2016.
20 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.   DOI
21 Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (pp. 448-456). 2 Mar, 2015.
22 Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1303-1306). 21-24 April, 1997.
23 Lee, J., Cho, K., & Hoffman, T. (2016). Fully character-level neural machine translation without explicit segmentation. Retrieved from http://arxiv.org/abs/1610.03017 [Computing Research Repository] on January 9, 2018.
24 Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
25 Rabiner, L., & Schafer, R. (2011). Theory and applications of digital speech processing. New Jersey: Pearson.
26 Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from http://arxiv.org/abs/1609.03499 [Computing Research Repository] on January 9, 2018.
27 Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. Retrieved from http://arxiv.org/abs/1710.07654 [Computing Research Repository] on January 9, 2018.