[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2021.13.3.065

Designing a large recording script for open-domain English speech synthesis

Kim, Sunhee (Department of French Language Education, Seoul National University)
Kim, Hojeong (Department of Foreign Language Education, Seoul National University)
Lee, Yooseop (Department of French Language Education, Seoul National University)
Kim, Boryoung (Department of French Language Education, Seoul National University)
Won, Yongkook (Center for Educational Research, Seoul National University)
Kim, Bongwan (Kakao Enterprise Corp.)

Publication Information

Phonetics and Speech Sciences / v.13, no.3, 2021 , pp. 65-70 More about this Journal

Abstract

This paper proposes a method for designing a large recording script for open domain English speech synthesis. For read-aloud style text, 12 domains and 294 sub-domains were designed using text contained in five different news media publications. For conversational style text, 4 domains and 36 sub-domains were designed using movie subtitles. The final script consists of 43,013 sentences, 27,085 read-aloud style sentences, and 15,928 conversational style sentences, consisting of 549,683 tokens and 38,356 types. The completed script is analyzed using four criteria: word coverage (type coverage and token coverage), high-frequency vocabulary coverage, phonetic coverage (diphone coverage and triphone coverage), and readability. The type coverage of our script reaches 36.86% despite its low token coverage of 2.97%. The high-frequency vocabulary coverage of the script is 73.82%, and the diphone coverage and triphone coverage of the whole script is 86.70% and 38.92%, respectively. The average readability of whole sentences is 9.03. The results of analysis show that the proposed method is effective in producing a large recording script for English speech synthesis, demonstrating good coverage in terms of unique words, high-frequency vocabulary, phonetic units, and readability.

Keywords

recording script; speech synthesis; English; word coverage; phonetic coverage; readability;

Citations & Related Records

Reference

1	Santen, J. V., & Buchsbaum, A. (1997, September). Methods for optimal text selection. Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 553-556). Rhodes, Greece.
2	Torres, H. M., Gurlekian, J. A., Evin, D. A., & Mercado, C. G. C. (2019). Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Language Resources and Evaluation, 53(3), 419-447. DOI
3	Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017, August). Tacotron: Towards end-to-end speech synthesis. Proceedings of the Interspeech 2017 (pp. 4006-4010). Stockholm, Sweden.
4	Dong, M., Cen, L., Chan, P., & Li, H. (2009). Readability consideration in speech synthesis recording script selection. International Journal on Asian Language Processing, 19(2), 45-54.
5	Gallegos, P. O., Williams, J., Rownicka, J., & King, S. (2020, October). An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets. Proceedings of the Interspeech 2020 (pp. 1758-1762). Shanghai, China.
6	Honnet, P. E., Lazaridis, A., Garner, P. N., & Yamagishi, J. (2017). The SIWIS French speech synthesis database-Design and recording of a high quality French database for speech synthesis. Retrieved from https://infoscience.epfl.ch/record/225946
7	Klare, G. R. (1974-1975). Assessing readability. Reading Research Quarterly, 10(1), 62-102. DOI
8	Kawai, H., Yamamoto, S., Higuchi, N., & Shimizu, T. (2000, October). A design method of speech corpus for text-to-speech synthesis taking account of prosody. Proceedings of the 6th International Conference on Spoken Language Processing (pp. 420-425). Beijing, China.
9	Chevelu, J., & Lolive, D. (2015, September). Do not build your TTS training corpus randomly. Proceedings of the 23rd European Signal Processing Conference (EUSIPCO) (pp. 350-354). Nice, France.
10	King, S. (2014). Measuring a decade of progress in text-to-speech. Loquens, 1(1), e006. DOI
11	Tao, J., Liu, F., Zhang, M., & Jia, H. (2008, October). Design of speech corpus for mandarin text to speech. Proceedings of the Blizzard Challenge 2008 Workshop (pp. 1-4). Brisbane, Australia.
12	Kominek, J., & Black, A. W. (2004, June). The CMU Arctic speech databases. Proceedings of the 5th ISCA ITRW Speech Synthesis (pp. 223-224). Pittsburgh, PA.
13	Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., & Wu, Y. (2019, September). LibriTTS: A corpus derived from LibriSpeech for text-to-speech. Proceedings of the Interspeech 2019 (pp. 1526-1530). Graz, Austria.
14	Zhu, W., Zhang, W., Shi, Q., Chen, F., Li, H., Ma, X., & Shen, L. (2002, September). Corpus building for data-driven TTS systems. Proceedings of the 2002 IEEE Workshop on Speech Synthesis(pp. 199-202). Santa Monica, CA.
15	Kim, S., Kim, J., Kim, S., & Kim, H. (2013, November). Recording script design for speech corpus of English news reading TTS. Proceedings of the 2013 Autumn Conference of Acoustical Society of Korea (pp. 49-52). Jeju, Korea.
16	Mobius, B. (2000). Corpus-based speech synthesis: Methods and challenges. Arbeitspapiere des Instituts fur Maschinelle Sprach- verarbeitung, 6(4), 87-116.
17	Kuo, F. Y., Ouyang, I. C., Aryal, S., & Lanchantin, P. (2019, September). Selection and training schemes for improving TTS voice built on found data. Proceedings of the Interspeech 2019 (pp. 1516-1520). Graz, Austria.
18	Matousek, J., Psutka, J., & Kruta, J. (2001, September). Design of speech corpus for text-to-speech synthesis. Proceedings of the Eurospeech 2001 (pp. 2047-2050). Aalborg, Denmark.
19	Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499.
20	Prahallad, K., & Black, A. W. (2011, July). Segmentation of monologues in audio books for building synthetic voices. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1444-1449.
21	Watts, O., Stan, A., Clark, R., Mamiya, Y., Giurgiu, M., Yamagishi, J., & King, S. (2013, September). Unsupervised and lightlysupervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis. Proceedings of the 8th ISCA Speech Synthesis Workshop (pp. 101-106). Barcelona, Spain.
22	Bozkurt, B., Ozturk, O., & Dutoit, T. (2003, September). Text design for TTS speech corpus building using a modified greedy selection. Proceedings of the Eurospeech 2003 (pp. 277-280). Geneva, Switzerland.
23	Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017, August). Deep voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning, PMLR 70 (pp. 195-204). Sydney, Australia.
24	Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Beijing, China: O'Reilly Media.
25	Bonafonte, A., Hoge, H., Tropf, H. S., Moreno, A., van der Heuvel, H., Sundermann, D., ... Kiss, I. (2005). TTS baselines and specifications (Report No. FP6-506738). Retrieved from https://docsbay.net/tc-star-projectdeliverable-no-d8title-tts-baselines-specifications
26	Park, K., & Kim, J. (2019). g2pE: A simple Python module for English grapheme to phoneme conversion. Retrieved from https://github.com/Kyubyong/g2p
27	Kominek, J., & Black, A. W. (2003). CMU Arctic database for speech synthesis (Report No. CMU-LTI-03-177). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=6699C4E348169581A2EED5E3041C1C81?doi=10.1.1.64.8827&rep=rep1&type=pdf
28	Nation, P. (n.d.). Vocabulary lists. Retrieved from https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists
29	News Articles [dataset] (2018, May). Retrieved from https://www.kaggle.com/harishcscode/all-news-articles-from-home-page-media-house/version/1
30	Park, K., & Mulc, T. (2019, September). CSS10: A collection of single speaker speech datasets for 10 languages. Proceedings of the Interspeech 2019 (pp. 1566-1570). Graz, Austria.
31	Purwins, H., Li, B., Virtanen, T., Schluter, J., Chang, S. Y., & Sainath, T. (2019). Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206-219. DOI