Voice-to-voice conversion using transformer network

Transformer 네트워크를 이용한 음성신호 변환

  • Kim, June-Woo (Department of Artificial Intelligence, Kyungpook National University) ;
  • Jung, Ho-Young (Department of Artificial Intelligence, Kyungpook National University)
  • 김준우 (경북대학교 인공지능학과) ;
  • 정호영 (경북대학교 인공지능학과)
  • Received : 2020.07.30
  • Accepted : 2020.09.17
  • Published : 2020.09.30


Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52±0.22 for naturalness and 3.89±0.19 for similarity.

음성 변환은 다양한 음성 처리 응용에 적용될 수 있으며, 음성 인식을 위한 학습 데이터 증강에도 중요한 역할을 할 수 있다. 기존의 방법은 음성 합성을 이용하여 음성 변환을 수행하는 구조를 사용하여 멜 필터뱅크가 중요한 파라미터로 활용된다. 멜 필터뱅크는 뉴럴 네트워크 학습의 편리성 및 빠른 연산 속도를 제공하지만, 자연스러운 음성파형을 생성하기 위해서는 보코더를 필요로 한다. 또한, 이 방법은 음성 인식을 위한 다양한 데이터를 얻는데 효과적이지 않다. 이 문제를 해결하기 위해 본 논문은 원형 스펙트럼을 사용하여 음성 신호 자체의 변환을 시도하였고, 어텐션 메커니즘으로 스펙트럼 성분 사이의 관계를 효율적으로 찾아내어 변환을 위한 자질을 학습할 수 있는 transformer 네트워크 기반 딥러닝 구조를 제안하였다. 영어 숫자로 구성된 TIDIGITS 데이터를 사용하여 개별 숫자 변환 모델을 학습하였고, 연속 숫자 음성 변환 디코더를 통한 결과를 평가하였다. 30명의 청취 평가자를 모집하여 변환된 음성의 자연성과 유사성에 대해 평가를 진행하였고, 자연성 3.52±0.22 및 유사성 3.89±0.19 품질의 성능을 얻었다.



  1. Biadsy, F., Weiss, R. J., Moreno, P. J., Kanvesky, D., & Jia, Y. (2019). Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. arXiv. Retrieved from:
  2. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. Retrieved from:
  3. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. Retrieved from:
  4. Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.
  5. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
  6. Huang, W. C., Hayashi, T., Wu, Y. C., Kameoka, H., & Toda, T. (2019). Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining. arXiv. Retrieved from:
  7. Jia, Y., Weiss, R. J., Biadsy, F., Macherey, W., Johnson, M., Chen, Z., & Wu, Y. (2019). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv. Retrieved from:
  8. Kim, J. W., Jung, H. Y., & Lee, M. (2020). Vocoder-free end-to-end voice conversion with transformer Network. arXiv. Retrieved from:
  9. Kim, Y. (2014, October). Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1746-1751). Doha, Qatar.
  10. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. arXiv. Retrieved from: abs/1412.6980
  11. Kwon, S., Kim, S. J., & Choeh, J. Y. (2016). Preprocessing for elderly speech recognition of smart devices. Computer Speech & Language, 36, 110-121.
  12. Lee, J., Cho, K., & Hofmann, T. (2017). Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5, 365-378.
  13. Leonard, R. G., & Doddington, G. R. (1993). Tidigits speech corpus. Philadelphia, PA: Texas Instruments.
  14. Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. Proceedings of the 33rd AAAI Conference on Artificial Intelligence (Vol. 33, pp. 6706-6713). Hawaii, HI.
  15. Liu, R., Chen, X., & Wen, X. (2020, May). Voice conversion with transformer Network. Proceedings of the ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7759-7759). Barcelona, Spain.
  16. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017, August). Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Interspeech (Vol. 2017, pp. 498-502). Stockholm, Sweden.
  17. Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
  18. Potamianos, A., Narayanan, S., & Lee, S. (1997, September). Automatic speech recognition for children. Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 2371-2374). Rhodes, Greece.
  19. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understandingpaper.pdf
  20. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.
  21. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Saurous, R. A. (2018, April). Natural tts synthesis by conditioning wavenet on MEL spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779-4783). Calgary, AB.
  22. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems 27 (NIPS 2014) (pp. 3104-3112). San Mateo, CA.
  23. Tanaka, K., Kameoka, H., Kaneko, T., & Hojo, N. (2019, May). AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms. Proceedings of the ICASSP 2019−2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6805-6809). Brighton, UK.
  24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). San Mateo, CA.