References
- Biadsy, F., Weiss, R. J., Moreno, P. J., Kanvesky, D., & Jia, Y. (2019). Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. arXiv. Retrieved from: https://arxiv.org/abs/1904.04169
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. Retrieved from: https://arxiv.org/abs/1412.3555
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. Retrieved from: https://arxiv.org/abs/1810.04805
- Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243. https://doi.org/10.1109/TASSP.1984.1164317
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Huang, W. C., Hayashi, T., Wu, Y. C., Kameoka, H., & Toda, T. (2019). Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining. arXiv. Retrieved from: https://arxiv.org/abs/1912.06813
- Jia, Y., Weiss, R. J., Biadsy, F., Macherey, W., Johnson, M., Chen, Z., & Wu, Y. (2019). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv. Retrieved from: https://arxiv.org/abs/1904.06037
- Kim, J. W., Jung, H. Y., & Lee, M. (2020). Vocoder-free end-to-end voice conversion with transformer Network. arXiv. Retrieved from: https://arxiv.org/abs/2002.03808
- Kim, Y. (2014, October). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1746-1751). Doha, Qatar.
- Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. arXiv. Retrieved from: https://arxiv.org/ abs/1412.6980
- Kwon, S., Kim, S. J., & Choeh, J. Y. (2016). Preprocessing for elderly speech recognition of smart devices. Computer Speech & Language, 36, 110-121. https://doi.org/10.1016/j.csl.2015.09.002
- Lee, J., Cho, K., & Hofmann, T. (2017). Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5, 365-378. https://doi.org/10.1162/tacl_a_00067
- Leonard, R. G., & Doddington, G. R. (1993). Tidigits speech corpus. Philadelphia, PA: Texas Instruments.
- Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. Proceedings of the 33rd AAAI Conference on Artificial Intelligence (Vol. 33, pp. 6706-6713). Hawaii, HI.
- Liu, R., Chen, X., & Wen, X. (2020, May). Voice conversion with transformer Network. Proceedings of the ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7759-7759). Barcelona, Spain.
- McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017, August). Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Interspeech (Vol. 2017, pp. 498-502). Stockholm, Sweden.
- Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884. https://doi.org/10.1587/transinf.2015EDP7457
- Potamianos, A., Narayanan, S., & Lee, S. (1997, September). Automatic speech recognition for children. Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 2371-2374). Rhodes, Greece.
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understandingpaper.pdf
- Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Saurous, R. A. (2018, April). Natural tts synthesis by conditioning wavenet on MEL spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779-4783). Calgary, AB.
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems 27 (NIPS 2014) (pp. 3104-3112). San Mateo, CA.
- Tanaka, K., Kameoka, H., Kaneko, T., & Hojo, N. (2019, May). AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms. Proceedings of the ICASSP 2019−2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6805-6809). Brighton, UK.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). San Mateo, CA.