Browse > Article
http://dx.doi.org/10.13064/KSSS.2019.11.2.057

Performance comparison of various deep neural network architectures using Merlin toolkit for a Korean TTS system  

Hong, Junyoung (Selim TSG Co.)
Kwon, Chulhong (Department of Electronics, Information & Communication Engineering, Daejeon University)
Publication Information
Phonetics and Speech Sciences / v.11, no.2, 2019 , pp. 57-64 More about this Journal
Abstract
In this paper, we construct a Korean text-to-speech system using the Merlin toolkit which is an open source system for speech synthesis. In the text-to-speech system, the HMM-based statistical parametric speech synthesis method is widely used, but it is known that the quality of synthesized speech is degraded due to limitations of the acoustic modeling scheme that includes context factors. In this paper, we propose an acoustic modeling architecture that uses deep neural network technique, which shows excellent performance in various fields. Fully connected deep feedforward neural network (DNN), recurrent neural network (RNN), gated recurrent unit (GRU), long short-term memory (LSTM), bidirectional LSTM (BLSTM) are included in the architecture. Experimental results have shown that the performance is improved by including sequence modeling in the architecture, and the architecture with LSTM or BLSTM shows the best performance. It has been also found that inclusion of delta and delta-delta components in the acoustic feature parameters is advantageous for performance improvement.
Keywords
deep neural networks; Merlin toolkit; text-to-speech (TTS);
Citations & Related Records
연도 인용수 순위
  • Reference
1 Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D(7), 1877-1884.   DOI
2 Najafabadi, M., Villanustre, F., Khoshgoftaar, T., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1-21.
3 Nitech [Nagoya Institute of Technology]. (2015). HTS: HMM/DNNbased speech synthesis system (version 2.3) [Computer program]. Retrieved from http://hts.sp.nitech.ac.jp/
4 Riedi, M. (1995). A neural-network-based model of segmental duration for speech synthesis. Proceedings of the Eurospeech 1995 (pp. 599-602).
5 Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.   DOI
6 Tokuda, K., Kobayashi, T., & Imai, S. (1995). Speech parameter generation from HMM using dynamic features. Proceedings of the 1995 International Conference on Acoustics, Speech, Signal Processing (pp. 660-663). Detroit, MI.
7 Weijters, T., & Thole, J. (1993). Speech synthesis with artificial neural networks. Proceedings of the International Conference on Neural Networks (pp. 1764-1769). San Diego, CA.
8 Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin, & D. E. Rumelhart (Eds.), Back-propagation:Theory, architectures and applications (pp. 433-486). Hillsdale, NJ: Lawrence Erlbaum Associates.
9 Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 202-207).
10 Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis. Proceedings of the Eurospeech 1999 (pp. 2347-2350).
11 Yu, K., Zen, H., Mairesse, F., & Young, S. (2011). Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis. Speech Communication, 53(6), 914-923.   DOI
12 Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039-1064.   DOI
13 Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, Signal Processing (pp. 7962-7966). Vancouver, BC.
14 CSTR [The Center for Speech Technology Research]. (2014). Festival:The festival speech synthesis system (version 2.4) [Computer program]. Retrieved from http://www.cstr.ed.ac.uk/projects/festival/
15 Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166.   DOI
16 Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1-127.   DOI
17 Chung, J., Gulcehre, C., Cho, K. H., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from https://arxiv.org/abs/1412.3555
18 CSTR [The Center for Speech Technology Research]. (2018a). Ossian:A python based tool for automatically building speech synthesis front ends [Computer program]. Retrieved from https://github.com/CSTR-Edinburgh/Ossian/
19 CSTR [The Center for Speech Technology Researc]. (2018b). The Merlin toolkit [Computer program]. Retrieved from https://github.com/CSTR-Edinburgh/merlin/tree/master/egs/build_your_own_voice/
20 Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, Signal Processing (pp. 373-376).
21 Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., Meng, H. M., & Deng, L. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques & future trends. IEEE Signal Processing Magazine, 32(3), 35-52.   DOI
22 Imai, S., & Kobayashi, T. (2017). SPTK: Speech signal processing toolkit (version 3.11) [Computer program]. Retrieved from http://sp-tk.sourceforge.net/
23 Kawahara, H., Masuda-Katsuse, I., & de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3-4), 187-207.   DOI
24 Kubichek, R. (1993). Mel-cepstral distance measure for objective speech quality assessment. Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing (pp. 125-128). Victoria, BC, Canada.
25 Luo, Z., Takiguchi, T., & Ariki, Y. (2016). Emotional voice conversion using deep neural networks with MCC and F0 features. Proceedings of the IEEE/ACIS 15th International Conference on Computer and Information Science (pp. 1-5). Okayama, Japan.
26 Merritt, T., Latorre, J., & King, S. (2015). Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech. Proceedings of the International Conference on Acoustics, Speech, Signal Processing (pp. 4220-4224). Brisbane, Australia.
27 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.   DOI