Browse > Article
http://dx.doi.org/10.13064/KSSS.2020.12.3.055

Voice-to-voice conversion using transformer network  

Kim, June-Woo (Department of Artificial Intelligence, Kyungpook National University)
Jung, Ho-Young (Department of Artificial Intelligence, Kyungpook National University)
Publication Information
Phonetics and Speech Sciences / v.12, no.3, 2020 , pp. 55-63 More about this Journal
Abstract
Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52±0.22 for naturalness and 3.89±0.19 for similarity.
Keywords
voice conversion; transformer network; signal-to-signal conversion;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Kwon, S., Kim, S. J., & Choeh, J. Y. (2016). Preprocessing for elderly speech recognition of smart devices. Computer Speech & Language, 36, 110-121.   DOI
2 Lee, J., Cho, K., & Hofmann, T. (2017). Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5, 365-378.   DOI
3 Leonard, R. G., & Doddington, G. R. (1993). Tidigits speech corpus. Philadelphia, PA: Texas Instruments.
4 Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. Proceedings of the 33rd AAAI Conference on Artificial Intelligence (Vol. 33, pp. 6706-6713). Hawaii, HI.
5 Liu, R., Chen, X., & Wen, X. (2020, May). Voice conversion with transformer Network. Proceedings of the ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7759-7759). Barcelona, Spain.
6 McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017, August). Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Interspeech (Vol. 2017, pp. 498-502). Stockholm, Sweden.
7 Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.   DOI
8 Potamianos, A., Narayanan, S., & Lee, S. (1997, September). Automatic speech recognition for children. Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 2371-2374). Rhodes, Greece.
9 Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.   DOI
10 Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understandingpaper.pdf
11 Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Saurous, R. A. (2018, April). Natural tts synthesis by conditioning wavenet on MEL spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779-4783). Calgary, AB.
12 Huang, W. C., Hayashi, T., Wu, Y. C., Kameoka, H., & Toda, T. (2019). Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining. arXiv. Retrieved from: https://arxiv.org/abs/1912.06813
13 Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems 27 (NIPS 2014) (pp. 3104-3112). San Mateo, CA.
14 Tanaka, K., Kameoka, H., Kaneko, T., & Hojo, N. (2019, May). AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms. Proceedings of the ICASSP 2019−2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6805-6809). Brighton, UK.
15 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). San Mateo, CA.
16 Biadsy, F., Weiss, R. J., Moreno, P. J., Kanvesky, D., & Jia, Y. (2019). Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. arXiv. Retrieved from: https://arxiv.org/abs/1904.04169
17 Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. Retrieved from: https://arxiv.org/abs/1412.3555
18 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. Retrieved from: https://arxiv.org/abs/1810.04805
19 Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.   DOI
20 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.   DOI
21 Jia, Y., Weiss, R. J., Biadsy, F., Macherey, W., Johnson, M., Chen, Z., & Wu, Y. (2019). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv. Retrieved from: https://arxiv.org/abs/1904.06037
22 Kim, J. W., Jung, H. Y., & Lee, M. (2020). Vocoder-free end-to-end voice conversion with transformer Network. arXiv. Retrieved from: https://arxiv.org/abs/2002.03808
23 Kim, Y. (2014, October). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1746-1751). Doha, Qatar.
24 Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. arXiv. Retrieved from: https://arxiv.org/ abs/1412.6980