Browse > Article
http://dx.doi.org/10.13064/KSSS.2021.13.4.047

End-to-end non-autoregressive fast text-to-speech  

Kim, Wiback (Department of English Language and Literature, Korea University)
Nam, Hosung (Department of English Language and Literature, Korea University)
Publication Information
Phonetics and Speech Sciences / v.13, no.4, 2021 , pp. 47-53 More about this Journal
Abstract
Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t. In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.
Keywords
deep learning; neural network; speech synthesis; Text-to-Speech (TTS);
Citations & Related Records
연도 인용수 순위
  • Reference
1 Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. Retrieved from https://arxiv.org/abs/1702.07825
2 Cho, K. (2013). Boltzmann machines and denoising autoencoders for image denoising. Retrieved from https://arxiv.org/abs/1301.3468
3 Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. Retrieved from https://arxiv.org/abs/2005.05957
4 Yamamoto, R., Song, E., & Kim, J. M. (2019). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Retrieved from https://arxiv.org/abs/1910.11480
5 Holmes, J., & Holmes, W. (2002). Speech synthesis and recognition. London, UK: CRC Press.
6 Griffin, D., & Lim, J. (1983, April). Signal estimation from modified short-time Fourier transform. Proceedings of the 8th International Conference on Acoustics, Speech, and Signal Processing (pp. 804-807). Boston, MA.
7 Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2017). Natural TTS synthesis by conditioning Wavenet on mel spectrogram predictions. Retrieved from https://arxiv.org/abs/1712.05884
8 Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Retrieved from https://arxiv.org/abs/1409.3215
9 van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
10 Dvorak, J. L. (2011). Moving wearables into the mainstream: Taming the Borg. New York, NY: Springer.
11 Kumar, K., Kumar, R., de Boissiere, T. Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
12 Yarrington, D. (2007). Synthesizing speech for communication devices. In K. Greenebaum, & R. Barzel (Eds.), Audio anecdotes: Tools, tips and techniques for digital audio (Vol. 3, pp. 143-155). Wellesley, MA: AK Peters.
13 Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Retrieved from https://arxiv.org/abs/1905.09263
14 Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Retrieved from https://arxiv.org/abs/1710.08969
15 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
16 Wang, Y., Skerry-Ryan, RJ., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from https://arxiv.org/abs/1703.10135
17 van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., ... Hassabis, D. (2017). Parallel WaveNet: Fast high-fidelity speech synthesis. Retrieved from https://arxiv.org/abs/1711.10433
18 Wang, T., Liu, X., Tao, J., Yi, J., Fu, R., & Wen, Z. (2020, October). Non-autoregressive end-to-end TTS with coarse-to-fine decoding. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 3984-3988). Shanghai, China.