Unit Generation Based on Phrase Break Strength and Pruning for Corpus-Based Text-to-Speech

  • Kim, Sang-Hun (Speech Information Technology Research Center, ETRI) ;
  • Lee, Young-Jik (Speech Information Technology Research Center, ETRI) ;
  • Hirose, Keikichi (Department of Electrical, Electronic, Information and Communication Engineering, Graduate School of Engineering, University of Tokyo)
  • Received : 2001.05.16
  • Published : 2001.12.30

Abstract

This paper discusses two important issues of corpus-based synthesis: synthesis unit generation based on phrase break strength information and pruning redundant synthesis unit instances. First, the new sentence set for recording was designed to make an efficient synthesis database, reflecting the characteristics of the Korean language. To obtain prosodic context sensitive units, we graded major prosodic phrases into 5 distinctive levels according to pause length and then discriminated intra-word triphones using the levels. Using the synthesis unit with phrase break strength information, synthetic speech was generated and evaluated subjectively. Second, a new pruning method based on weighted vector quantization (WVQ) was proposed to eliminate redundant synthesis unit instances from the synthesis database. WVQ takes the relative importance of each instance into account when clustering similar instances using vector quantization (VQ) technique. The proposed method was compared with two conventional pruning methods through objective and subjective evaluations of synthetic speech quality: one to simply limit the maximum number of instances, and the other based on normal VQ-based clustering. For the same reduction rate of instance number, the proposed method showed the best performance. The synthetic speech with reduction rate 45% had almost no perceptible degradation as compared to the synthetic speech without instance reduction.

Keywords

References

  1. Proc. of the Int’l Conf. on Acoustics, Speech, and Signal Processing v.1 Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database Hunt, A.;Black, A.W.
  2. The 3rd ESCA/COCOSDA Workshop on Speech Synthesis Diphone Synthesis Using Unit Selection Beutnagel, M.;Conkie, A.;Syrdal, A.
  3. Proc. of the Int’l Conf. on Acoustics, Speech, and Signal Processing v.1 Automatic Generation of Synthesis Units for Trainable Text-to-Speech Systems Hon, H.;Acero, A.;Huang, X.;Liu, J.;Plumpe, M.
  4. Proc. of Eurospeech97 v.2 Automatically Clustering Similar Units for Units Selection in Speech Synthesis Black, A.W.;Taylor, P.A.
  5. Prosody and Selection of Source Units for Concatenative Synthesis;A Collection of Technical Publications Campbell, N.;Black, A.W.
  6. Int’l J. of Speech Technology A New Korean Corpus-Based Text-to-Speech System Kim, S.H.;Lee, Y.J.;Hirose, K.
  7. Speech Comm. v.9 Pitch Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones Moulines, E.;Charpentier, F.
  8. ETRI J. v.22 no.2 An Algorithm for Predicting the Relation between Linguistic Items and Corpus Sizes Yang, D.H.
  9. J. Acoust. Soc. America v.90 The Use of Prosody in Syntactic Disambiguation Price, P.J.;Ostendorf, M.;Shattuck-Hufnagel, S.;Fong, C.
  10. ICSLP ToBI: a Standard Scheme for Labeling Prosody Silverman, K.;Beckman, M.;Pierrhumbert, J.;Ostendorf, M.;Wightman, C.;Price, P.;Hirschberg, J.
  11. Speech Sciences v.7 no.1 K-ToBI(Korean ToBI) Labelling Conventions Beckman, M.;Jun, S.A.
  12. J. Acoust. Soc. America v.91 no.3 Segmental Durations in the Vicinity of Prosodic Phrase Boundaries Wightman, C.W.;Shattuck-Hufnagel, S.;Ostendorf, M.;Price, P.J.
  13. Australian Int’l Conf. on Speech Science and Technology (SST’94) Proc. Korean Text-to-Speech System Using TD-PSOLA Kim, S.H.;Lee, J.C.
  14. ETRI J. v.23 no.1 Encoding of Speech Spectral Parameters Using Adaptive Quantization Range Method Lee, I.S.
  15. IEEE ASSP Magazine Vector Quantization Gray, R.M.