Prosodic Contour Generation for Korean Text-To-Speech System Using Artificial Neural Networks

  • Published : 2009.06.30

Abstract

To get more natural synthetic speech generated by a Korean TTS (Text-To-Speech) system, we have to know all the possible prosodic rules in Korean spoken language. We should find out these rules from linguistic, phonetic information or from real speech. In general, all of these rules should be integrated into a prosody-generation algorithm in a TTS system. But this algorithm cannot cover up all the possible prosodic rules in a language and it is not perfect, so the naturalness of synthesized speech cannot be as good as we expect. ANNs (Artificial Neural Networks) can be trained to learn the prosodic rules in Korean spoken language. To train and test ANNs, we need to prepare the prosodic patterns of all the phonemic segments in a prosodic corpus. A prosodic corpus will include meaningful sentences to represent all the possible prosodic rules. Sentences in the corpus were made by picking up a series of words from the list of PB (phonetically Balanced) isolated words. These sentences in the corpus were read by speakers, recorded, and collected as a speech database. By analyzing recorded real speech, we can extract prosodic pattern about each phoneme, and assign them as target and test patterns for ANNs. ANNs can learn the prosody from natural speech and generate prosodic patterns of the central phonemic segment in phoneme strings as output response of ANNs when phoneme strings of a sentence are given to ANNs as input stimuli.

Keywords

References

  1. J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech. Springer-Verlag. 1976
  2. J. Allen, M. S. Hunnicutt and D. H. Klatt et al, From Text To Speech. Cambridge University Press, 1987
  3. A. Waibel, Prosody and Speech Recognition. Morgan Kaufmann Publishers, 1988
  4. A. M. Liberman et al, 'Minimal rules for synthesizing speech,' J. Acoust. Soc. Am., vol. 31, no. 11, PP. 1490-1499, Nov. 1959 https://doi.org/10.1121/1.1907654
  5. J. Allen, 'Synthesis of speech from unrestricted text,' Proc. IEEE, vol. 64, no .4, PP. 433-442, Apr. 1976 https://doi.org/10.1109/PROC.1976.10152
  6. N. Umeda, 'Vowel duration in American English,' J. Acoust. Soc. Am., vol. 56, PP. 434-445, 1975 https://doi.org/10.1121/1.380688
  7. J. Pierrehumbert, 'Synthesizing intonation,' J. Acoust. Soc. Am., vol. 70, no. 4, PP. 985-995, Oct. 1981 https://doi.org/10.1121/1.387033
  8. R. M. Meli and F. Fallside, 'The modeling of FO contours,' in IEEE Proc. ICASSP '82, PP. 947-949, 1982.
  9. M. Ljungqvist and H. Fujisaki, 'Generating Intonation for Swedish Text-to-Speech Conversion Using a Quantitative Model for the FO Contour,' in Proc. Eurospeech '93, PP. 873-876, 1993
  10. Hyun Bok Lee, 'Korean prosody: Speech rhythm and intonation,' Korea Journal, PP. 42-69, Feb. 1987
  11. J. C. Lee, S. H. Kim and M. Hahn, 'Intonation Processing for Korean TTS Conversion Using Stylization Method,' in Proc. ICSPAT '95, vol. II, PP. 1943-1946, 1995
  12. C. Tuerk and T. Robinson, 'Speech Synthesis Using Artificial Neural Networks Trained on Cepstral Coefficients,' in Proc. Eurospeech '93, PP. 1713-1716, 1993
  13. M. Riedi, 'A Neural-Network-Based Model of Segmental Duration for Speech Synthesis,' in Proc. Eurospeech '95, vol. I, PP. 599 -602, 1995
  14. D. p. Morgan and C. L. Scofield, Neural Networks and Speech Processing, Kluwer Academic Pub., 1991
  15. Mazin G. Rahim, Artificial Neural Networks for Speech Analysis/Synthesis, Chapman & Hall, 1994
  16. Adam Blum, Neural Networks in C++, John Wiley & Sons Inc., 1992
  17. Sok Wang Chang, Hyun Joan Kim, Chang Su Ryoo, Un-Cheon Lim, 'A Study on the Prosody Generation in Isolated Words with an Artificial Neural Network,' in Proc. ICSP'97, vol. 1 of 2, PP. 207-210, 1997
  18. Kyung-Joong Min, Un-Cheon Lim, 'Architecture of Artificial Neural Networks for Prosody Generation in Korean Sentences', Proc. ICSP'2001, vol. 2 of 2, PP. 771-776, 2001
  19. Kyoung-Joong Min, Un-Cheon Lim, 'Korean Prosody Generation and Artificial Neural Networks', INTERSPEECH 2004 ICSLP, Vol. 3 of 8, PP. 1869-1872, 2004