과제정보
본 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2021R1A2C1014044).
참고문헌
- Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). Barcelona, Spain.
- Choi, S., Han, S., Kim, D., & Ha, S. (2020, October). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. Proceedings of the Interspeech 2020. Shanghai, China.
- Casanova, E., Shulby, C., Golge, E., Muller, N. M., de Oliveira, F. S., Candido A. C. Jr., Soares, A. S., ... & Ponti, M. A. (2021, August-September). Sc-glowTTS: An efficient zero-shot multi-speaker text-to-speech model. Proceedings of the Interspeech 2021 (pp. 3645-3649). Brno, Czechia.
- Casanova, E., Weber, J., Shulby, C., Candido Jr., A., Golge, E., & Ponti, M. A. (2022, June). Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. Proceedings of the 39th International Conference on Machine Learning (pp. 2709-2720). Baltimore, MD.
- Chung, J. S., Nagrani, A., & Zisserman, A. (2018, September). VoxCeleb2: Deep speaker recognition. Proceedings of the Interspeech (pp. 1086-1090). Hyderabad, India.
- Desplanques, B., Thienpondt, J., & Demuynck, K. (2020, October). ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Proceedings of the Interspeech (pp. 3830-3834). Shanghai, China.
- Heo, H. S., Lee, B. J., Huh, J., & Chung, J. S. (2020, October). Clova baseline system for the VoxCeleb speaker recognition challenge 2020. Proceedings of the Interspeech. Shanghai, China.
- Hu, J., Shen, L., & Sun, G. (2018, June). Squeeze-and-excitation networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7132-7141). Salt Lake City, UT.
- Hsu, W. N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang, Y., Cao, Y., ... Pang, R. (2018, April-May). Hierarchical generative modeling for controllable speech synthesis. Proceedings of the International Conference on Learning Representations. Vancouver, BC.
- Jung, J., Kim, Y., Heo, H. S., Lee, B. J., Kwon, Y., & Chung, J. S. (2022, September). Pushing the limits of raw waveform speaker recognition. Proceedings of the Interspeech 2022 (pp. 2228-2232). Incheon, Korea,
- Jung, J., Kim, S., Shim, H., Kim, J., & Yu, H. (2020, October). Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. Proceedings of the Interspeech 2020 (pp. 1496-1500). Shanghai, China.
- Kwon, Y., Heo, H. S., Lee, B. J., & Chung, J. S. (2021, June). The ins and outs of speaker recognition: Lessons from VoxSRC 2020. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5809-5813). Toronto, ON.
- Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033.
- Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence (pp. 6706-6713). Honolulu, HI.
- Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
- Moss, H. B., Aggarwal, V., Prateek, N., Gonzalez, J., & Barra-Chicote, R. (2020, May). Boffin TTS: Few-shot speaker adaptation by Bayesian optimization. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7639-7643). Barcelona, Spain.
- Nagrani, A., Chung, J. S., & Zisserman, A. (2017, August). VoxCeleb: A large-scale speaker identification dataset. Proceedings of the Interspeech 2017 (pp. 2616-2620). Stockholm, Sweden.
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Vancouver, BC.
- Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020, June). FastSpeech 2: Fast and high-quality end-to-end text to speech. Retrieved from arXiv:2006.04558v8
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L, ... Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach, CA.
- Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2018, April). Generalized end-to-end loss for speaker verification. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4879-4883). Calgary, AB.
- Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., ... Wu, Y. (2019, September). LibriTTS: A corpus derived from librispeech for text-to-speech. Proceedings of the Interspeech 2019. Graz, Austria.
- Zhao, B., Zhang, X., Wang, J., Cheng, N., & Xiao, J. (2022, May). nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech. Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4293-4297). Singapore.