Designing a large recording script for open-domain English speech synthesis |
Kim, Sunhee
(Department of French Language Education, Seoul National University)
Kim, Hojeong (Department of Foreign Language Education, Seoul National University) Lee, Yooseop (Department of French Language Education, Seoul National University) Kim, Boryoung (Department of French Language Education, Seoul National University) Won, Yongkook (Center for Educational Research, Seoul National University) Kim, Bongwan (Kakao Enterprise Corp.) |
1 | Santen, J. V., & Buchsbaum, A. (1997, September). Methods for optimal text selection. Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 553-556). Rhodes, Greece. |
2 | Torres, H. M., Gurlekian, J. A., Evin, D. A., & Mercado, C. G. C. (2019). Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Language Resources and Evaluation, 53(3), 419-447. DOI |
3 | Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017, August). Tacotron: Towards end-to-end speech synthesis. Proceedings of the Interspeech 2017 (pp. 4006-4010). Stockholm, Sweden. |
4 | Dong, M., Cen, L., Chan, P., & Li, H. (2009). Readability consideration in speech synthesis recording script selection. International Journal on Asian Language Processing, 19(2), 45-54. |
5 | Gallegos, P. O., Williams, J., Rownicka, J., & King, S. (2020, October). An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets. Proceedings of the Interspeech 2020 (pp. 1758-1762). Shanghai, China. |
6 | Honnet, P. E., Lazaridis, A., Garner, P. N., & Yamagishi, J. (2017). The SIWIS French speech synthesis database-Design and recording of a high quality French database for speech synthesis. Retrieved from https://infoscience.epfl.ch/record/225946 |
7 | Klare, G. R. (1974-1975). Assessing readability. Reading Research Quarterly, 10(1), 62-102. DOI |
8 | Kawai, H., Yamamoto, S., Higuchi, N., & Shimizu, T. (2000, October). A design method of speech corpus for text-to-speech synthesis taking account of prosody. Proceedings of the 6th International Conference on Spoken Language Processing (pp. 420-425). Beijing, China. |
9 | Chevelu, J., & Lolive, D. (2015, September). Do not build your TTS training corpus randomly. Proceedings of the 23rd European Signal Processing Conference (EUSIPCO) (pp. 350-354). Nice, France. |
10 | King, S. (2014). Measuring a decade of progress in text-to-speech. Loquens, 1(1), e006. DOI |
11 | Tao, J., Liu, F., Zhang, M., & Jia, H. (2008, October). Design of speech corpus for mandarin text to speech. Proceedings of the Blizzard Challenge 2008 Workshop (pp. 1-4). Brisbane, Australia. |
12 | Kominek, J., & Black, A. W. (2004, June). The CMU Arctic speech databases. Proceedings of the 5th ISCA ITRW Speech Synthesis (pp. 223-224). Pittsburgh, PA. |
13 | Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., & Wu, Y. (2019, September). LibriTTS: A corpus derived from LibriSpeech for text-to-speech. Proceedings of the Interspeech 2019 (pp. 1526-1530). Graz, Austria. |
14 | Zhu, W., Zhang, W., Shi, Q., Chen, F., Li, H., Ma, X., & Shen, L. (2002, September). Corpus building for data-driven TTS systems. Proceedings of the 2002 IEEE Workshop on Speech Synthesis(pp. 199-202). Santa Monica, CA. |
15 | Kim, S., Kim, J., Kim, S., & Kim, H. (2013, November). Recording script design for speech corpus of English news reading TTS. Proceedings of the 2013 Autumn Conference of Acoustical Society of Korea (pp. 49-52). Jeju, Korea. |
16 | Mobius, B. (2000). Corpus-based speech synthesis: Methods and challenges. Arbeitspapiere des Instituts fur Maschinelle Sprach- verarbeitung, 6(4), 87-116. |
17 | Kuo, F. Y., Ouyang, I. C., Aryal, S., & Lanchantin, P. (2019, September). Selection and training schemes for improving TTS voice built on found data. Proceedings of the Interspeech 2019 (pp. 1516-1520). Graz, Austria. |
18 | Matousek, J., Psutka, J., & Kruta, J. (2001, September). Design of speech corpus for text-to-speech synthesis. Proceedings of the Eurospeech 2001 (pp. 2047-2050). Aalborg, Denmark. |
19 | Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499. |
20 | Prahallad, K., & Black, A. W. (2011, July). Segmentation of monologues in audio books for building synthetic voices. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1444-1449. |
21 | Watts, O., Stan, A., Clark, R., Mamiya, Y., Giurgiu, M., Yamagishi, J., & King, S. (2013, September). Unsupervised and lightlysupervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis. Proceedings of the 8th ISCA Speech Synthesis Workshop (pp. 101-106). Barcelona, Spain. |
22 | Bozkurt, B., Ozturk, O., & Dutoit, T. (2003, September). Text design for TTS speech corpus building using a modified greedy selection. Proceedings of the Eurospeech 2003 (pp. 277-280). Geneva, Switzerland. |
23 | Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017, August). Deep voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning, PMLR 70 (pp. 195-204). Sydney, Australia. |
24 | Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Beijing, China: O'Reilly Media. |
25 | Bonafonte, A., Hoge, H., Tropf, H. S., Moreno, A., van der Heuvel, H., Sundermann, D., ... Kiss, I. (2005). TTS baselines and specifications (Report No. FP6-506738). Retrieved from https://docsbay.net/tc-star-projectdeliverable-no-d8title-tts-baselines-specifications |
26 | Park, K., & Kim, J. (2019). g2pE: A simple Python module for English grapheme to phoneme conversion. Retrieved from https://github.com/Kyubyong/g2p |
27 | Kominek, J., & Black, A. W. (2003). CMU Arctic database for speech synthesis (Report No. CMU-LTI-03-177). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=6699C4E348169581A2EED5E3041C1C81?doi=10.1.1.64.8827&rep=rep1&type=pdf |
28 | Nation, P. (n.d.). Vocabulary lists. Retrieved from https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists |
29 | News Articles [dataset] (2018, May). Retrieved from https://www.kaggle.com/harishcscode/all-news-articles-from-home-page-media-house/version/1 |
30 | Park, K., & Mulc, T. (2019, September). CSS10: A collection of single speaker speech datasets for 10 languages. Proceedings of the Interspeech 2019 (pp. 1566-1570). Graz, Austria. |
31 | Purwins, H., Li, B., Virtanen, T., Schluter, J., Chang, S. Y., & Sainath, T. (2019). Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206-219. DOI |