[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2021.13.4.047

End-to-end non-autoregressive fast text-to-speech

Kim, Wiback (Department of English Language and Literature, Korea University)
Nam, Hosung (Department of English Language and Literature, Korea University)

Publication Information

Phonetics and Speech Sciences / v.13, no.4, 2021 , pp. 47-53 More about this Journal

Abstract

Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t. In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.

Keywords

deep learning; neural network; speech synthesis; Text-to-Speech (TTS);

Citations & Related Records

Reference

1	Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. Retrieved from https://arxiv.org/abs/1702.07825
2	Cho, K. (2013). Boltzmann machines and denoising autoencoders for image denoising. Retrieved from https://arxiv.org/abs/1301.3468
3	Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. Retrieved from https://arxiv.org/abs/2005.05957
4	Yamamoto, R., Song, E., & Kim, J. M. (2019). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Retrieved from https://arxiv.org/abs/1910.11480
5	Holmes, J., & Holmes, W. (2002). Speech synthesis and recognition. London, UK: CRC Press.
6	Griffin, D., & Lim, J. (1983, April). Signal estimation from modified short-time Fourier transform. Proceedings of the 8th International Conference on Acoustics, Speech, and Signal Processing (pp. 804-807). Boston, MA.
7	Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2017). Natural TTS synthesis by conditioning Wavenet on mel spectrogram predictions. Retrieved from https://arxiv.org/abs/1712.05884
8	Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Retrieved from https://arxiv.org/abs/1409.3215
9	van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
10	Dvorak, J. L. (2011). Moving wearables into the mainstream: Taming the Borg. New York, NY: Springer.
11	Kumar, K., Kumar, R., de Boissiere, T. Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
12	Yarrington, D. (2007). Synthesizing speech for communication devices. In K. Greenebaum, & R. Barzel (Eds.), Audio anecdotes: Tools, tips and techniques for digital audio (Vol. 3, pp. 143-155). Wellesley, MA: AK Peters.
13	Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Retrieved from https://arxiv.org/abs/1905.09263
14	Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Retrieved from https://arxiv.org/abs/1710.08969
15	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
16	Wang, Y., Skerry-Ryan, RJ., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from https://arxiv.org/abs/1703.10135
17	van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., ... Hassabis, D. (2017). Parallel WaveNet: Fast high-fidelity speech synthesis. Retrieved from https://arxiv.org/abs/1711.10433
18	Wang, T., Liu, X., Tao, J., Yi, J., Fu, R., & Wen, Z. (2020, October). Non-autoregressive end-to-end TTS with coarse-to-fine decoding. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 3984-3988). Shanghai, China.

KSCI

End-to-end non-autoregressive fast text-to-speech End-to-end 비자기회귀식 가속 음성합성기

End-to-end non-autoregressive fast text-to-speech