Browse > Article
http://dx.doi.org/10.9708/jksci/2012.17.9.075

Improvement of Naturalness for a HMM-based Korean TTS using the prosodic boundary information  

Lim, Gi-Jeong (School of Electrical Engineering, University of Ulsan)
Lee, Jung-Chul (School of Electrical Engineering, University of Ulsan)
Abstract
HMM-based Text-to-Speech systems generally utilize context dependent tri-phone units from a large corpus speech DB to enhance the synthetic speech. To downsize a large corpus speech DB, acoustically similar tri-phone units are clustered based on the decision tree using context dependent information. Context dependent information includes phoneme sequence as well as prosodic information because the naturalness of synthetic speech highly depends on the prosody such as pause, intonation pattern, and segmental duration. However, if the prosodic information was complicated, many context dependent phonemes would have no examples in the training data, and clustering would provide a smoothed feature which will generate unnatural synthetic speech. In this paper, instead of complicate prosodic information we propose a simple three prosodic boundary types and decision tree questions that use rising tone, falling tone, and monotonic tone to improve naturalness. Experimental results show that our proposed method can improve naturalness of a HMM-based Korean TTS and get high MOS in the perception test.
Keywords
HTS; HMM; tri-phone; decision tree-based clustering;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi and S. Imai, "An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features," Proc. of EUROSPEECH,vol. 1, pp. 757-760, Sep. 1995.
2 J. Latorre, and et. al., "Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?," Proc. ICASSP, pp. 4724-4727, May 2011.
3 K. Tokuda, H. Zen, and A.W. Black, "An HMM based approach to multilingual speech synthesis," Text to speech synthesis: New paradigms and advances, S. Narayanan, A. Alwan (Eds.), Prentice Hall, pp.135-153, Aug. 2004.
4 A.W. Black, H. Zen, and K. Tokuda, "Statistical parametric speech synthesis," Proc. ICASSP 2007, vol. 4, pp. 1229-1232, Apr. 2007.
5 H.C. Lee, and J.M. Seo, " A study of Implementing An Embedded System for Conversion from Text to Speech ," Journal of the Korea Society of Computer and Information, v.13, no.3, pp.77-83, May 2008.
6 S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.-Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, "The Hidden Markov Model Toolkit (HTK)," http://htk.eng.cam.ac.uk/
7 K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako, A.W. Black, and T. Nose, "The HMM based speech synthesis system (HTS)," http://hts.sp.nitech.ac.jp/
8 A.W. Black, P. Taylor, and R. Caley, "The festival speech synthesis system," http://www.festvox.org/ festival/
9 S. Kim, J. Kim, and M. Hahn, "HMM-Based Korean Speech Synthesis System for Hand-held Devices," IEEE Trans. Consumer Electronics, vol. 52, no. 4, pp.1384-1390, Nov. 2006.   DOI   ScienceOn
10 J. Lee, "A Tree-based Reduction of Speech DB in a Large Corpus-based Korean TTS," Journal of the Korea Society of Computer and Information, v.15, no.7, pp.91-98, Jul. 2010.   DOI
11 S. Imai, "Cepstral analysis synthesison the melfrequency scale," Proc. ICASSP, vol. 1, pp. 93-96, Apr. 1983.
12 K. Shinoda and T. Watanabe, "MDL-based contextdependent subword modeling for speech recognition," J. Acoust. Soc. Jpn.(E), vol.21, no.2, pp. 79-86, Feb. 2000.   DOI
13 Q. Zhang, F. Soong, Y. Qian, Z. Yan, J. Pan, and Y. Yan, "Improved modeling for FO generation and V /U decision in HMM-based TTS," Proc. ICASSP, pp. 4606-4609, Mar. 2010.
14 K. Tokuda, T. Mausko, N. Miyazaki, and T. Kobayashi, "Multi-space probability distribution HMM (Invited paper)," IEICE Trans. Inf. & Syst., vol. E85-D, no. 3, pp.455-464, Mar. 2002
15 S. J. Young, J. J. Odell, and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling," Proc. ARPA Human Language Technology Workshop, pp. 307-312, Mar. 1994.
16 K. Shinoda and T. Watanabe, "Acoustic modeling based on the MDL criterion for speech recognition," Proc. Eurospeech, vol. 1, pp. 99-102, Sep. 1997.