[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.17703/JCCT.2021.7.1.640

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning

Kwon, Chul Hong (Dept. of Information, Communication, Electronics Engineering, Daejeon Univ)

Publication Information

The Journal of the Convergence on Culture Technology / v.7, no.1, 2021 , pp. 640-645 More about this Journal

Abstract

The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.

Keywords

deep learning; Text-to-Speech(TTS); real-time; non-autoregressive method;

Citations & Related Records

Reference

1	A. J. Hunt and A. W. Black, "Unit selection in a concatenative speech synthesis system using a large speech database", Proceedings of the International Conference on Acoustics, Speech, Signal Processing, pp. 373-376, 1996
2	T. Yoshimura, K. Tokuda, T. Masuko, T, Kobayashi, T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis". Proceedings of the Eurospeech 1999, pp. 2347-2350, 1999
3	Y. Wan, R J, Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, R. A. Saurous, "Tacotron: Towards end-to-end speech synthesis", arXiv preprint, https://arxiv.org/pdf/1703.10135.pdf, 2017 Apr.
4	J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, Y. Wu, "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions", arXiv preprint, https://arxiv.org/pdf/1712.05884.pdf, 2018 Feb.
5	N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, M. Zhou, "Neural speech synthesis with transformer network", arXiv preprint, https://arxiv.org/pdf/1809.08895.pdf, 2019 Jan.
6	Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu, "FastSpeech: Fast, robust and controllable text to speech", arXiv preprint, https://arxiv.org/pdf/1905.09263.pdf, 2019 Nov.
7	A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.Senior, K. Kavukcuoglu, "WaveNet: A generative model for raw audio," arXiv preprint, https://arxiv.org/pdf/1609.03499.pdf, 2016 Sep.
8	C. H. Kwon, "Performance comparison of state-of-the-art vocoder technology based on deep learning in a Korean TTS system", The Journal of the Convergence on Culture Technology (JCCT), Vol. 6, No. 2, pp. 509-514, 2020 DOI
9	N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, K. Kavukcuoglu. "Efficient neural audio synthesis", arXiv preprint. https://arxiv.org/pdf/1802.08435.pdf, 2018, Feb.
10	Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech", arXiv preprint, https://arxiv.org/pdf/2006.04558.pdf, 2020 Oct.
11	A. Lancucki, "FastPitch: Parallel text-to- speech with pitch prediction", arXiv preprint, https://arxiv.org/pdf/2006.06873.pdf, 2020 June
12	R. Yamamoto, E. W. Song, J. M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram", arXiv preprint, https://arxiv.org/pdf/1910.11480.pdf, 2020 Feb.
13	K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville, "MelGAN: Generative adversarial networks for conditional waveform synthesis", arXiv preprint, https://arxiv.org/pdf/1910.06711.pdf, 2018 Dec.
14	G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, L. Xie1, "Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech", arXiv preprint, https://arxiv.org/pdf/2005.05106.pdf, 2020 Nov.
15	R. Prenger, R. Valle, B. Catanzaro, "WaveGlow: A flow-based generative network for speech synthesis", arXiv preprint. https://arxiv.org/pdf/1811.00002.pdf, 2018 Oct.

KSCI

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning 딥러닝 기반 한국어 실시간 TTS 기술 비교

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning