DOI QR코드

DOI QR Code

다수 화자 한국어 음성 변환 실험

Many-to-many voice conversion experiments using a Korean speech corpus

  • 육동석 (고려대학교 컴퓨터학과 인공지능연구실) ;
  • 서형진 (고려대학교 컴퓨터학과 인공지능연구실) ;
  • 고봉구 (고려대학교 컴퓨터학과 인공지능연구실) ;
  • 유인철 (고려대학교 컴퓨터학과 인공지능연구실)
  • 투고 : 2022.03.16
  • 심사 : 2022.05.13
  • 발행 : 2022.05.31

초록

심층 생성 모델의 일종인 Generative Adversarial Network(GAN)과 Variational AutoEncoder(VAE)는 비병렬 학습 데이터를 사용한 음성 변환에 새로운 방법론을 제시하고 있다. 특히, Conditional Cycle-Consistent Generative Adversarial Network(CC-GAN)과 Cycle-Consistent Variational AutoEncoder(CycleVAE)는 다수 화자 사이의 음성 변환에 우수한 성능을 보이고 있다. 그러나, CC-GAN과 CycleVAE는 비교적 적은 수의 화자를 대상으로 연구가 진행되어왔다. 본 논문에서는 100 명의 한국어 화자 데이터를 사용하여 CC-GAN과 CycleVAE의 음성 변환 성능과 확장 가능성을 실험적으로 분석하였다. 실험 결과 소규모 화자의 경우 CC-GAN이 Mel-Cepstral Distortion(MCD) 기준으로 4.5 % 우수한 성능을 보이지만 대규모 화자의 경우 CycleVAE가 제한된 학습 시간 안에 12.7 % 우수한 성능을 보였다.

Recently, Generative Adversarial Networks (GAN) and Variational AutoEncoders (VAE) have been applied to voice conversion that can make use of non-parallel training data. Especially, Conditional Cycle-Consistent Generative Adversarial Networks (CC-GAN) and Cycle-Consistent Variational AutoEncoders (CycleVAE) show promising results in many-to-many voice conversion among multiple speakers. However, the number of speakers has been relatively small in the conventional voice conversion studies using the CC-GANs and the CycleVAEs. In this paper, we extend the number of speakers to 100, and analyze the performances of the many-to-many voice conversion methods experimentally. It has been found through the experiments that the CC-GAN shows 4.5 % less Mel-Cepstral Distortion (MCD) for a small number of speakers, whereas the CycleVAE shows 12.7 % less MCD in a limited training time for a large number of speakers.

키워드

과제정보

이 논문은 2017년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업임(No. NRF-2017R1E1A1A01078157).

참고문헌

  1. B. Ko, K. Lee, I.-C. Yoo, and D. Yook, "Korean voice conversion experiments using CC-GAN and VAW-GAN" (in Korean), Proc, Speech Communication and Signal Processing, 36, 39 (2019).
  2. B. Jang, H. Seo, I.-C. Yoo, and D. Yook, "CycleVAE based many-to-many voice conversion experiments using Korean speech corpus" (in Korean), J. Acoust. Soc. Suppl.2(s) 40, 79 (2021).
  3. I.-C. Yoo, K. Lee, S.-G. Leem, H. Oh, B. Ko, and D. Yook, "Speaker anonymization for personal information protection using voice conversion techniques," IEEE Access, 8, 198637-198645 (2020). https://doi.org/10.1109/access.2020.3035416
  4. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Proc. NIPS, 2672-2680 (2014).
  5. D. Kingma and M. Welling, "Auto-encoding variational Bayes," arXiv:1312.6114 (2013).
  6. J. Zhu, T. Park, P. Isola, and A. Efros, "Unpaired image-to image translation using cycle-consistent adversarial networks," Proc. IEEE Int. Conf. Computer Vision, 2242-2251 (2017).
  7. T. Kaneko and H. Kameoka, "CycleGAN-VC: Nonparallel voice conversion using cycle-consistent adversarial networks," Proc. EUSIPCO, 2114-2118 (2018).
  8. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "CycleGAN-VC2: Improved CycleGAN-based nonparallel voice conversion," Proc. IEEE ICASSP, 6820- 6824 (2019).
  9. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "CycleGAN-VC3: Examining and improving CycleGAN-VCs for Mel-spectrogram conversion," Proc. Interspeech, 2017-2021 (2020).
  10. D. Yook, I.-C. Yoo, and S. Yoo, "Voice conversion using conditional CycleGAN," Proc. Int. Conf. CSCI, 1460-1461 (2018).
  11. S. Lee, B. Ko, K. Lee, I.-C. Yoo, and D. Yook, "Many-to-many voice conversion using conditional cycle-consistent adversarial networks," Proc. IEEE ICASSP, 6279-6283 (2020).
  12. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," Proc. IEEE Workshop on SLT, 266-273 (2018).
  13. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion," Proc. Interspeech, 679-683 (2019).
  14. C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, "Voice conversion from non-parallel corpora using variational autoencoder," Proc. APSIPA, 1-6 (2016).
  15. A. Oord and O. Vinyals, "Neural discrete representation learning," Proc. NIPS, 6309-6318 (2017).
  16. C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, "Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks," Proc. Interspeech, 3364-3368 (2017).
  17. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder," IEEE/ ACM Trans. on Audio, Speech, and Lang. Process. 27, 1432-1443 (2019).
  18. P. Tobing, Y. Wu, T. Hayashi, K. Kobayashi, and T. Toda, "Non-parallel voice conversion with cyclic variational autoencoder," Proc. Interspeech, 674-678 (2019).
  19. D. Yook, S.-G. Leem, K. Lee, and I.-C. Yoo, "Manyto-Many voice conversion using cycle-consistent variational autoencoder with multiple decoders," Proc. Odyssey: The Speaker Language Recognition Workshop, 215-221 (2020).
  20. B. Ko, Many-to-many voice conversion using cycle-consistency for Korean speech (in Korean), (Master Thesis, Korea University, 2020).
  21. M. Morise, F. Yokomori, and K. Ozawa, "WORLD: A vocoder-based high-quality speech synthesis system for real-time applications," IEICE Trans. on Information and Systems, 99, 1877-1884 (2016).
  22. D. Kingma and J. Ba, "Adam: A method for stochastic optimization," Proc. ICLR, 1-13 (2015).
  23. T. Toda, A. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Trans. on Audio, Speech, and Lang. Process. 15, 2222-2235 (2007). https://doi.org/10.1109/TASL.2007.907344
  24. S. Takamichi, T. Toda, A. Black, G. Neubig, S. Sakti, and S. Nakamura, "Postfilters to modify the modulation spectrum for statistical parametric speech synthesis," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 24, 755-767 (2016). https://doi.org/10.1109/TASLP.2016.2522655