References
- Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, "Trainable frontend for robust and far-field keyword spotting," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5670- 5674, 2017.
- W. Fan, X. Xu, B. Cai, and X. Xing, "ISNet: Individual standardization network for speech emotion recognition," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.30, pp.1803-1814, 2022. https://doi.org/10.1109/TASLP.2022.3171965
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3156-3164, 2015.
- P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, "Hierarchical recurrent neural encoder for video representation with application to captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1029-1038, 2016.
- D. Amodei et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," in International Conference on Machine Learning, pp.173-182, 2016.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "Wav2Vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, Vol.33, pp.12449-12460, 2020.
- J. Shen et al., "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4779-4783, 2018.
- Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, "FastSpeech: Fast, Robust and Controllable Text to Speech," Advances in Neural Information Processing Systems, Vol. 32, 2019.
- J. Yeung and G. Bae, Forever young, beautiful and scandal-free: The rise of South Korea's virtual influencers [Internet], https://edition.cnn.com/style/article/south-korea-virtual-influencers-beauty-social-media-intl-hnk-dst/ind ex.html
- J. Zong, C. Lee, A. Lundgard, J. W. Jang, D. Hajas, and A. Satyanarayan, "Rich screen reader experiences for accessible data visualization," Computer Graphics Forum, Vol.41, No.3, 2023.
- K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, "Autovc: Zero-shot voice style transfer with only autoencoder loss," in International Conference on Machine Learning, pp.5210-5219, 2019.
- S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, "Voice conversion using artificial neural networks," In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3893-3896, 2009.
- B. Sisman, J. Yamagishi, S. King, and H. Li, "An overview of voice conversion and its challenges: From statistical modeling to deep learning," IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.29, pp.132-157, 2020. https://doi.org/10.1109/TASLP.2020.3038524
- D. A. Reynolds, "Gaussian mixture models," Encyclopedia of biometrics, Vol.741, pp.659-663, 2009. https://doi.org/10.1007/978-0-387-73003-5_196
- S. Mobin and J. Bruna, "Voice conversion using convolutional neural networks," arXiv preprint, 2016. [Internet], https://arxiv.org/abs/1610.08927
- J. Lai, B. Chen, T. Tan, S. Tong, and K. Yu, "Phone-aware LSTM-RNN for voice conversion," in 2016 IEEE 13th International Conference on Signal Processing, pp.177-182, 2016.
- A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, "Generative adversarial networks: An overview," IEEE Signal Processing Magazine, Vol.35, No.1, pp.53-65, 2018. https://doi.org/10.1109/MSP.2017.2765202
- M. Mirza and S. Osindero, "Conditional generative adversarial nets," arXiv preprint, 2014. [Internet], https://arxiv.org/abs/1411.1784
- J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proceedings of the IEEE International Conference on Computer Vision, pp.2223-2232, 2017.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, "Improved training of Wasserstein GANs," Advances in Neural Information Processing Systems, Vol.30, 2017.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," Advances in Neural Information Processing Systems, Vol.20, 2017.
- H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks," In 2018 IEEE Spoken Language Technology Workshop (SLT), pp.266-273, 2018.
- Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo, "Stargan: Unified generative adversarial networks for multi-domain image-to-image translation," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.8789-8797, 2018.
- M. S. Al-Radhi, T. G. Csapo, and G. Nemeth, "Parallel voice conversion based on a continuous sinusoidal model," in 2019 International Conference on Speech Technology and Human-Computer Dialogue, pp.1-6, 2019.
- Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, "Nonparallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5274-5278, 2018.
- W. C. Huang, T. Hayashi, Y. C. Wu., H. Kameoka, and T. Toda, "Pretraining techniques for sequence-to-sequence voice conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.29, pp.745-755, 2021. https://doi.org/10.1109/TASLP.2021.3049336
- S. Lee, B. Ko, K. Lee, I. C. Yoo, and D. Yook, "Many-to-many voice conversion using conditional cycle-consistent adversarial networks," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.6279-6283, 2020.
- J. W. Jung, Y. J. Kim, H.S. Heo, B. J. Lee, Y. Kwon, and J. S. Chung, "Pushing the limits of raw waveform speaker recognition." in Proceedings of Interspeec, pp.2228-2232, 2022.
- M. Morris, F. Yokomori, and K. Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications." IEICE TRANSACTIONS on Information and Systems, Vol.E99-D, No.7, pp.1877-1884, 2016. https://doi.org/10.1587/transinf.2015EDP7457
- E. O. Brigham and R. E. Morrow, "The fast Fourier transform," IEEE Spectrum, Vol.4, No.12, pp.63-70, 1967. https://doi.org/10.1109/MSPEC.1967.5217220
- M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein Generative Adversarial Networks," in Proceedings of the 34th International Conference on Machine Learning, Vol.70, pp.214-223, 2017.
- S. H. Gao, M. M. Cheng, K. Zhao, and X. W. Hu, "Res2Net: A new multi-scale backbone architecture," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.43, No.2, pp.652-662, 2019. https://doi.org/10.1109/TPAMI.2019.2938758
- J. W. Park, S. B. Kim, H. J. Shim, J. H. Kim, and H. J. Yu, "Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms," in Proceedings of Interspeech, pp.1496-1500, 2020.
- B. Desplanques, J. Thienpondt, K. Demuynck, "ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification," in Proceedings of Interspeech, pp.3830-3834, 2020.
- J. Hu, L. Shen, and G. Sun, "Squeeze-and-Excitation Networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.7132-7141, 2018.
- J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, "The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods," in Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2018), pp.195-202, 2018.
- K. Zhou, B. Sisman, R. Liu, and H. Li, "Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.920-924, 2021.