DOI QR코드

DOI QR Code

Performance comparison of various deep neural network architectures using Merlin toolkit for a Korean TTS system

Merlin 툴킷을 이용한 한국어 TTS 시스템의 심층 신경망 구조 성능 비교

  • 홍준영 (세림티에스지(주)) ;
  • 권철홍 (대전대학교 전자.정보통신공학과)
  • Received : 2019.01.31
  • Accepted : 2019.03.27
  • Published : 2019.06.30

Abstract

In this paper, we construct a Korean text-to-speech system using the Merlin toolkit which is an open source system for speech synthesis. In the text-to-speech system, the HMM-based statistical parametric speech synthesis method is widely used, but it is known that the quality of synthesized speech is degraded due to limitations of the acoustic modeling scheme that includes context factors. In this paper, we propose an acoustic modeling architecture that uses deep neural network technique, which shows excellent performance in various fields. Fully connected deep feedforward neural network (DNN), recurrent neural network (RNN), gated recurrent unit (GRU), long short-term memory (LSTM), bidirectional LSTM (BLSTM) are included in the architecture. Experimental results have shown that the performance is improved by including sequence modeling in the architecture, and the architecture with LSTM or BLSTM shows the best performance. It has been also found that inclusion of delta and delta-delta components in the acoustic feature parameters is advantageous for performance improvement.

본 논문에서는 음성 합성을 위한 오픈소스 시스템인 Merlin 툴킷을 이용하여 한국어 TTS 시스템을 구성한다. TTS 시스템에서 HMM 기반의 통계적 음성 합성 방식이 널리 사용되고 있는데, 이 방식에서 문맥 요인을 포함시키는 음향 모델링 구성의 한계로 합성 음성의 품질이 저하된다고 알려져 있다. 본 논문에서는 여러 분야에서 우수한 성능을 보여 주는 심층 신경망 기법을 적용하는 음향 모델링 아키텍처를 제안한다. 이 구조에는 전연결 심층 피드포워드 신경망, 순환 신경망, 게이트 순환 신경망, 단방향 장단기 기억 신경망, 양방향 장단기 기억 신경망 등이 포함되어 있다. 실험 결과, 문맥을 고려하는 시퀀스 모델을 아키텍처에 포함하는 것이 성능 개선에 유리하다는 것을 알 수 있고, 장단기 기억 신경망을 적용한 아키텍처가 가장 좋은 성능을 보여주었다. 그리고 음향 특징 파라미터에 델타와 델타-델타 성분을 포함하는 것이 성능 개선에 유리하다는 결과가 도출되었다.

Keywords

References

  1. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166. https://doi.org/10.1109/72.279181
  2. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1-127. https://doi.org/10.1561/2200000006
  3. Chung, J., Gulcehre, C., Cho, K. H., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from https://arxiv.org/abs/1412.3555
  4. CSTR [The Center for Speech Technology Research]. (2014). Festival:The festival speech synthesis system (version 2.4) [Computer program]. Retrieved from http://www.cstr.ed.ac.uk/projects/festival/
  5. CSTR [The Center for Speech Technology Research]. (2018a). Ossian:A python based tool for automatically building speech synthesis front ends [Computer program]. Retrieved from https://github.com/CSTR-Edinburgh/Ossian/
  6. CSTR [The Center for Speech Technology Researc]. (2018b). The Merlin toolkit [Computer program]. Retrieved from https://github.com/CSTR-Edinburgh/merlin/tree/master/egs/build_your_own_voice/
  7. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
  8. Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, Signal Processing (pp. 373-376).
  9. Imai, S., & Kobayashi, T. (2017). SPTK: Speech signal processing toolkit (version 3.11) [Computer program]. Retrieved from http://sp-tk.sourceforge.net/
  10. Kawahara, H., Masuda-Katsuse, I., & de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3-4), 187-207. https://doi.org/10.1016/S0167-6393(98)00085-5
  11. Kubichek, R. (1993). Mel-cepstral distance measure for objective speech quality assessment. Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing (pp. 125-128). Victoria, BC, Canada.
  12. Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., Meng, H. M., & Deng, L. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques & future trends. IEEE Signal Processing Magazine, 32(3), 35-52. https://doi.org/10.1109/MSP.2014.2359987
  13. Luo, Z., Takiguchi, T., & Ariki, Y. (2016). Emotional voice conversion using deep neural networks with MCC and F0 features. Proceedings of the IEEE/ACIS 15th International Conference on Computer and Information Science (pp. 1-5). Okayama, Japan.
  14. Merritt, T., Latorre, J., & King, S. (2015). Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech. Proceedings of the International Conference on Acoustics, Speech, Signal Processing (pp. 4220-4224). Brisbane, Australia.
  15. Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D(7), 1877-1884. https://doi.org/10.1587/transinf.2015EDP7457
  16. Najafabadi, M., Villanustre, F., Khoshgoftaar, T., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1-21.
  17. Nitech [Nagoya Institute of Technology]. (2015). HTS: HMM/DNNbased speech synthesis system (version 2.3) [Computer program]. Retrieved from http://hts.sp.nitech.ac.jp/
  18. Riedi, M. (1995). A neural-network-based model of segmental duration for speech synthesis. Proceedings of the Eurospeech 1995 (pp. 599-602).
  19. Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093
  20. Tokuda, K., Kobayashi, T., & Imai, S. (1995). Speech parameter generation from HMM using dynamic features. Proceedings of the 1995 International Conference on Acoustics, Speech, Signal Processing (pp. 660-663). Detroit, MI.
  21. Weijters, T., & Thole, J. (1993). Speech synthesis with artificial neural networks. Proceedings of the International Conference on Neural Networks (pp. 1764-1769). San Diego, CA.
  22. Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin, & D. E. Rumelhart (Eds.), Back-propagation:Theory, architectures and applications (pp. 433-486). Hillsdale, NJ: Lawrence Erlbaum Associates.
  23. Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 202-207).
  24. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis. Proceedings of the Eurospeech 1999 (pp. 2347-2350).
  25. Yu, K., Zen, H., Mairesse, F., & Young, S. (2011). Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis. Speech Communication, 53(6), 914-923. https://doi.org/10.1016/j.specom.2011.03.003
  26. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039-1064. https://doi.org/10.1016/j.specom.2009.04.004
  27. Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, Signal Processing (pp. 7962-7966). Vancouver, BC.