Acknowledgement
이 논문은 한국연구재단 지역대학 우수과학자지원 사업(NRF-2020R1I1A3052136)에 의해 연구되었음
References
- C. H. Kwon, "Performance comparison of state-of-the-art vocoder technology based on deep learning in a Korean TTS system", The Journal of the Convergence on Culture Technology (JCCT), Vol. 6, No. 2, pp. 509-514, 2020, DOI:10.17703/JCCT.2020.6.2.509
- C. H. Kwon, "Comparison of Korean real-time text-to-speech technology based on deep learning", The Journal of the Convergence on Culture Technology (JCCT), Vol. 7, No. 1, pp. 640-645, 2021, DOI:10.17703/JCCT.2021.7.1.640
- M. S. Jo, "A study on a multi-speaker TTS system using speaker embedding", Master Thesis, Graduate School of Daejeon Univ. 2021
- D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification", Proceedings of the Interspeech 2017, pp. 999-1003, 2017, DOI:10.21437/Interspeech.2017-620
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp. 5329-5333, 2018, DOI:10.1109/ICASSP.2018.8461375
- A. Nagrani, J. S. Chung, A. Zisserman, "VoxCeleb: A large-scale speaker identification dataset", Proceedings of the Interspeech 2017, pp. 2616-2620, 2017, DOI:10.1109/ICASSP.2018.8461375
- Zeroth-Korean: Korean open source speech corpus, https://www.openslr.org/40/
- J. W. Ha, K. H. Nam, J. Kang, et al., "ClovaCall: Korean goal-oriented dialog speech corpus for automatic speech recognition of contact centers", Proceedings of the Interspeech 2020, pp. 409-413, 2020, DOI:10.21437/Interspeech.2020-1136
- H. Zen, V. Dang, R. Clark, et al., "LibriTTS: A corpus derived from LibriSpeech for text-to-speech", Proceedings of the Interspeech 2019, pp. 1526-1530, 2019, DOI:10.21437/Interspeech.2019-2441
- M. McLaren, L. Ferrer, D. Castan, A. Lawson, "The speakers in the wild (SITW) speaker recognition database", Proceedings of the Interspeech 2016, pp. 818-822, 2016, DOI:10.21437/Interspeech.2016-1129
- D. Povey, A. Ghoshal, G. Boulianne, et al., "The Kaldi speech recognition toolkit", Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2011, 2011
- T. Hayashi, R, Yamamoto, K. Inoue, et al., "ESP net-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 7654-7658, 2020, DOI:10.1109/ICASSP40776.2020.9053512
- S. Watanabe, T. Hori, S. Karita, et al., "ESPnet: End-to-end speech processing toolkit", Proceedings of the Interspeech 2018, pp. 2207-2211, 2018, DOI:10.21437/Interspeech.2018-1456
- T. Ko, V. Peddinti, D. Povey, M. Seltzer, S. Khudanpur, "A study on data augmentation of reverberant of speech for robust speech recognition", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017, pp. 5220-5224, 2017, DOI:10.1109/ICASSP.2017.7953152
- D. Snyder, G. Chen, D. Povey, "MUSAN: A music, speech, and noise corpus", arXiv preprint. https://arxiv.org/pdf/1510.08484.pdf, 2015 Oct.
- J. Shen, R. Pang, R. J. Weiss, et al., "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp. 4779-4783, 2018, DOI: 10.1109/ICASSP.2018.8461368
- A. Oord, S. Dieleman, H. Zen, et al., "WaveNet: A generative model for raw audio", Proceedings of the 9th ISCA Speech Synthesis Workshop, pp. 125-125., 2016