DOI QR코드

DOI QR Code

End-to-end 비자기회귀식 가속 음성합성기

End-to-end non-autoregressive fast text-to-speech

  • 김위백 (고려대학교 영어영문학과) ;
  • 남호성 (고려대학교 영어영문학과)
  • Kim, Wiback (Department of English Language and Literature, Korea University) ;
  • Nam, Hosung (Department of English Language and Literature, Korea University)
  • 투고 : 2021.08.01
  • 심사 : 2021.10.04
  • 발행 : 2021.12.31

초록

Autoregressive한 TTS 모델은 불안정성과 속도 저하라는 본질적인 문제를 안고 있다. 모델이 time step t의 데이터를 잘못 예측했을 때, 그 뒤의 데이터도 모두 잘못 예측하는 것이 불안정성 문제이다. 음성 출력 속도 저하 문제는 모델이 time step t의 데이터를 예측하려면 time step 1부터 t-1까지의 예측이 선행해야 한다는 조건에서 발생한다. 본 연구는 autoregression이 야기하는 문제의 대안으로 end-to-end non-autoregressive 가속 TTS 모델을 제안한다. 본 연구의 모델은 Tacotron 2 - WaveNet 모델과 근사한 MOS, 더 높은 안정성 및 출력 속도를 보였다. 본 연구는 제안한 모델을 토대로 non-autoregressive한 TTS 모델 개선에 시사점을 제공하고자 한다.

Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t. In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.

키워드

참고문헌

  1. Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. Retrieved from https://arxiv.org/abs/1702.07825
  2. Cho, K. (2013). Boltzmann machines and denoising autoencoders for image denoising. Retrieved from https://arxiv.org/abs/1301.3468
  3. Dvorak, J. L. (2011). Moving wearables into the mainstream: Taming the Borg. New York, NY: Springer.
  4. Griffin, D., & Lim, J. (1983, April). Signal estimation from modified short-time Fourier transform. Proceedings of the 8th International Conference on Acoustics, Speech, and Signal Processing (pp. 804-807). Boston, MA.
  5. Holmes, J., & Holmes, W. (2002). Speech synthesis and recognition. London, UK: CRC Press.
  6. Kumar, K., Kumar, R., de Boissiere, T. Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
  7. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Retrieved from https://arxiv.org/abs/1905.09263
  8. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2017). Natural TTS synthesis by conditioning Wavenet on mel spectrogram predictions. Retrieved from https://arxiv.org/abs/1712.05884
  9. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Retrieved from https://arxiv.org/abs/1409.3215
  10. Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Retrieved from https://arxiv.org/abs/1710.08969
  11. Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. Retrieved from https://arxiv.org/abs/2005.05957
  12. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
  13. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
  14. van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., ... Hassabis, D. (2017). Parallel WaveNet: Fast high-fidelity speech synthesis. Retrieved from https://arxiv.org/abs/1711.10433
  15. Wang, T., Liu, X., Tao, J., Yi, J., Fu, R., & Wen, Z. (2020, October). Non-autoregressive end-to-end TTS with coarse-to-fine decoding. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 3984-3988). Shanghai, China.
  16. Wang, Y., Skerry-Ryan, RJ., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from https://arxiv.org/abs/1703.10135
  17. Yamamoto, R., Song, E., & Kim, J. M. (2019). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Retrieved from https://arxiv.org/abs/1910.11480
  18. Yarrington, D. (2007). Synthesizing speech for communication devices. In K. Greenebaum, & R. Barzel (Eds.), Audio anecdotes: Tools, tips and techniques for digital audio (Vol. 3, pp. 143-155). Wellesley, MA: AK Peters.