[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2021.13.3.071

Text-to-speech with linear spectrogram prediction for quality and speed improvement

Yoon, Hyebin (Department of English Language and Literature, Korea University)

Publication Information

Phonetics and Speech Sciences / v.13, no.3, 2021 , pp. 71-78 More about this Journal

Abstract

Most neural-network-based speech synthesis models utilize neural vocoders to convert mel-scaled spectrograms into high-quality, human-like voices. However, neural vocoders combined with mel-scaled spectrogram prediction models demand considerable computer memory and time during the training phase and are subject to slow inference speeds in an environment where GPU is not used. This problem does not arise in linear spectrogram prediction models, as they do not use neural vocoders, but these models suffer from low voice quality. As a solution, this paper proposes a Tacotron 2 and Transformer-based linear spectrogram prediction model that produces high-quality speech and does not use neural vocoders. Experiments suggest that this model can serve as the foundation of a high-quality text-to-speech model with fast inference speed.

Keywords

speech synthesis; machine learning; artificial intelligence; text-to-speech (TTS);

Citations & Related Records

Reference

1	Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. Retrieved from https://arxiv.org/abs/1701.07875
2	Song, W., Xu, G., Zhang, Z., Zhang, C., He, X., & Zhou, B. (2020, October). Efficient WaveGlow: An improved WaveGlow vocoder with enhanced speed. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 225-229). Shanghai, China.
3	Sharma, A., Kumar, P., Maddukuri, V., Madamshetti, N., Kishore, K. G., Kavuru, S. S. S., Raman, B., ... Roy, P. P. (2020). Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis. Multimedia Tools and Applications, 79(41), 30205-30233. DOI
4	Chen, J., Tan, X., Luan, J., Qin, T., & Liu, T. Y. (2020). HiFiSinger: Towards high-fidelity neural singing voice synthesis. Retrieved from https://arxiv.org/abs/2009.01776
5	Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
6	Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4784-4788). Calgary, AB.
7	Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019, December). FastSpeech: Fast, robust and controllable text to speech. Proceedings of the 33rd Annual Conference on Neural Information Processing Systems(pp. 3156-3164). Vancouver, BC.
8	Hsu, P., Wang, C., Liu, A. T., & Lee, H. (2020). Towards robust neural vocoding for speech generation: A survey. Retrieved from https://arxiv.org/abs/1912.02461
9	Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2019). Neural speech synthesis with transformer network. Retrieved from https://arxiv.org/abs/1809.08895
10	Perraudin, N., Balazs, P., & Sondergaard, P. L. (2013, October). A fast Griffin-Lim algorithm. Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4). New Paltz, NY.
11	van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
12	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., ... Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
13	Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243. DOI
14	Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2018, April). Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4779-4783). Calgary, AB.
15	Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017, August). Tacotron: Towards end-to-end speech synthesis. Proceedings of the 18th Annual Conference of the International Speech Communication Association (pp. 4006-4010). Stockholm, Sweden.
16	Zhu, X., Beauregard, G. T., & Wyse, L. (2006, July). Real-time iterative spectrum inversion with look-ahead. Proceedings of the 2006 IEEE International Conference on Multimedia and Expo (pp. 229-232). Toronto, ON.
17	Prenger, R., Valle, R., & Catanzaro, B. (2018). WaveGlow: A flow-based generative network for speech synthesis. Retrieved from https://arxiv.org/abs/1811.00002

KSCI

Text-to-speech with linear spectrogram prediction for quality and speed improvement 음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech

Text-to-speech with linear spectrogram prediction for quality and speed improvement