Browse > Article

Prosodic Contour Generation for Korean Text-To-Speech System Using Artificial Neural Networks  

Lim, Un-Cheon (Dept. of Electronics Eng., Hoseo Univ.)
Abstract
To get more natural synthetic speech generated by a Korean TTS (Text-To-Speech) system, we have to know all the possible prosodic rules in Korean spoken language. We should find out these rules from linguistic, phonetic information or from real speech. In general, all of these rules should be integrated into a prosody-generation algorithm in a TTS system. But this algorithm cannot cover up all the possible prosodic rules in a language and it is not perfect, so the naturalness of synthesized speech cannot be as good as we expect. ANNs (Artificial Neural Networks) can be trained to learn the prosodic rules in Korean spoken language. To train and test ANNs, we need to prepare the prosodic patterns of all the phonemic segments in a prosodic corpus. A prosodic corpus will include meaningful sentences to represent all the possible prosodic rules. Sentences in the corpus were made by picking up a series of words from the list of PB (phonetically Balanced) isolated words. These sentences in the corpus were read by speakers, recorded, and collected as a speech database. By analyzing recorded real speech, we can extract prosodic pattern about each phoneme, and assign them as target and test patterns for ANNs. ANNs can learn the prosody from natural speech and generate prosodic patterns of the central phonemic segment in phoneme strings as output response of ANNs when phoneme strings of a sentence are given to ANNs as input stimuli.
Keywords
Korean TTS; Prosody; ANNs; Prosodic Corpus;
Citations & Related Records
연도 인용수 순위
  • Reference
1 C. Tuerk and T. Robinson, 'Speech Synthesis Using Artificial Neural Networks Trained on Cepstral Coefficients,' in Proc. Eurospeech '93, PP. 1713-1716, 1993
2 A. M. Liberman et al, 'Minimal rules for synthesizing speech,' J. Acoust. Soc. Am., vol. 31, no. 11, PP. 1490-1499, Nov. 1959   DOI
3 J. C. Lee, S. H. Kim and M. Hahn, 'Intonation Processing for Korean TTS Conversion Using Stylization Method,' in Proc. ICSPAT '95, vol. II, PP. 1943-1946, 1995
4 Sok Wang Chang, Hyun Joan Kim, Chang Su Ryoo, Un-Cheon Lim, 'A Study on the Prosody Generation in Isolated Words with an Artificial Neural Network,' in Proc. ICSP'97, vol. 1 of 2, PP. 207-210, 1997
5 J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech. Springer-Verlag. 1976
6 J. Pierrehumbert, 'Synthesizing intonation,' J. Acoust. Soc. Am., vol. 70, no. 4, PP. 985-995, Oct. 1981   DOI   ScienceOn
7 M. Riedi, 'A Neural-Network-Based Model of Segmental Duration for Speech Synthesis,' in Proc. Eurospeech '95, vol. I, PP. 599 -602, 1995
8 R. M. Meli and F. Fallside, 'The modeling of FO contours,' in IEEE Proc. ICASSP '82, PP. 947-949, 1982.
9 N. Umeda, 'Vowel duration in American English,' J. Acoust. Soc. Am., vol. 56, PP. 434-445, 1975   DOI   ScienceOn
10 Kyoung-Joong Min, Un-Cheon Lim, 'Korean Prosody Generation and Artificial Neural Networks', INTERSPEECH 2004 ICSLP, Vol. 3 of 8, PP. 1869-1872, 2004
11 M. Ljungqvist and H. Fujisaki, 'Generating Intonation for Swedish Text-to-Speech Conversion Using a Quantitative Model for the FO Contour,' in Proc. Eurospeech '93, PP. 873-876, 1993
12 Kyung-Joong Min, Un-Cheon Lim, 'Architecture of Artificial Neural Networks for Prosody Generation in Korean Sentences', Proc. ICSP'2001, vol. 2 of 2, PP. 771-776, 2001
13 Adam Blum, Neural Networks in C++, John Wiley & Sons Inc., 1992
14 Hyun Bok Lee, 'Korean prosody: Speech rhythm and intonation,' Korea Journal, PP. 42-69, Feb. 1987
15 D. p. Morgan and C. L. Scofield, Neural Networks and Speech Processing, Kluwer Academic Pub., 1991
16 J. Allen, M. S. Hunnicutt and D. H. Klatt et al, From Text To Speech. Cambridge University Press, 1987
17 J. Allen, 'Synthesis of speech from unrestricted text,' Proc. IEEE, vol. 64, no .4, PP. 433-442, Apr. 1976   DOI   ScienceOn
18 A. Waibel, Prosody and Speech Recognition. Morgan Kaufmann Publishers, 1988
19 Mazin G. Rahim, Artificial Neural Networks for Speech Analysis/Synthesis, Chapman & Hall, 1994