Improving Fidelity of Synthesized Voices Generated by Using GANs

Back, Moon-Ki;Yoon, Seung-Won;Lee, Sang-Baek;Lee, Kyu-Chul;

doi:10.3745/KTSDE.2021.10.1.9

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 10 Issue 1
/
Pages.9-18
/
2021
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Improving Fidelity of Synthesized Voices Generated by Using GANs

GAN으로 합성한 음성의 충실도 향상

백문기 (충남대학교 컴퓨터융합학부) ;
윤승원 (충남대학교 컴퓨터융합학부) ;
이상백 (충남대학교 컴퓨터융합학부) ;
이규철 (충남대학교 컴퓨터융합학부)

Received : 2020.08.20
Accepted : 2020.10.18
Published : 2021.01.31

https://doi.org/10.3745/KTSDE.2021.10.1.9 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Although Generative Adversarial Networks (GANs) have gained great popularity in computer vision and related fields, generating audio signals independently has yet to be presented. Unlike images, an audio signal is a sampled signal consisting of discrete samples, so it is not easy to learn the signals using CNN architectures, which is widely used in image generation tasks. In order to overcome this difficulty, GAN researchers proposed a strategy of applying time-frequency representations of audio to existing image-generating GANs. Following this strategy, we propose an improved method for increasing the fidelity of synthesized audio signals generated by using GANs. Our method is demonstrated on a public speech dataset, and evaluated by Fréchet Inception Distance (FID). When employing our method, the FID showed 10.504, but 11.973 as for the existing state of the art method (lower FID indicates better fidelity).

생성적 적대 신경망(Generative Adversarial Networks, GANs)은 컴퓨터 비전 분야와 관련 분야에서 큰 인기를 얻었으나, 아직까지는 오디오 신호를 직접적으로 생성하는 GAN이 제시되지 못했다. 오디오 신호는 이미지와 다르게 이산 값으로 구성된 생플링된 신호이므로, 이미지 생성에 널리 사용되는 CNN 구조로 학습하기 어렵다. 이러한 제약을 해결하고자, 최근 GAN 연구자들은 오디오 신호의 시간-주파수 표현을 기존 이미지 생성 GAN에 적용하는 전략을 제안했다. 본 논문은 이 전략을 따르면서 GAN을 사용해 생성된 오디오 신호의 충실도를 높이기 위한 개선된 방법을 제안한다. 본 방법은 공개된 스피치 데이터세트를 사용해 검증했으며, 프레쳇 인셉션 거리(Fréchet Inception Distance, FID)를 사용해 평가했다. 기존의 최신(state-of-the-art) 방법은 11.973의 FID를, 본 연구에서 제안하는 방법은 10.504의 FID를 보였다(FID가 낮을수록 충실도는 높다).

Keywords

References

A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, pp.1097-1105. 2012.
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [Internet], http://www.image-net.org/challenges/LSVRC/
Detection and Classification of Acoustic Scenes and Events (DCASE) [Internet], http://dcase.community/
TensorFlow Speech Recognition Challenge [Internet], https://www.kaggle.com/c/tensorflow-speech-recognition-challenge
I. Goodfellow, et al,. "Generative adversarial nets," in Advances in Neural Information Processing Systems, pp.2672-2680, 2014.
A. Brock, J. Donahue, and K. Simonyan, "Large scale gan training for high fidelity natural image synthesis," arXiv preprint arXiv:1809.11096, 2018.
Y. Wu, J. Donahue, D. Balduzzi, K. Simonyan, and T. Lillicrap, "Logan: Latent optimisation for generative adversarial networks," arXiv preprint arXiv:1912.00953, 2019.
D. Nie, et al., "Medical image synthesis with context-aware generative adversarial networks," in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.417-425, 2017.
A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv:1511.06434, 2015.
C. Donahue, J. McAuley, and M. Puckette, "Adversarial audio synthesis," arXiv preprint arXiv:1802.04208, 2018.
A. Odena, V. Dumoulin, and C. Olah, "Deconvolution and checkerboard artifacts," Distill, Vol.1, No.10, pp.e3, 2016.
J. Engel, K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, "Gansynth: Adversarial neural audio synthesis," arXiv preprint arXiv:1902.08710, 2019.
T. Karras, T. Aila, S. Laine, and J. Lehtinen, "Progressive growing of gans for improved quality, stability, and variation," arXiv preprint arXiv:1710.10196, 2017.
J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi, "Neural audio synthesis of musical notes with WaveNet autoencoders," in International Conference on Machine Learning, pp.1068-1077, 2017.
M. Tayyab, I. Ahmad, N. Sun, J. Zhou, and X. Dong, "Application of integrated artificial neural networks based on decomposition methods to predict streamflow at Upper Indus Basin, Pakistan," Atmosphere, Vol.9, No.12, pp.494, 2018. https://doi.org/10.3390/atmos9120494
D. Fitzgerald, "Harmonic/percussive separation using median filtering," in Proceedings of the International Conference on Digital Audio Effects (DAFx-10), pp.217-220, 2010.
P. Warden, "Speech commands: A public dataset for single-word speech recognition", Dataset available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz, 2017.
A. Borji, "Pros and cons of gan evaluation measures," Computer Vision and Image Understanding, Vol.179, pp.41-65, 2019. https://doi.org/10.1016/j.cviu.2018.10.009
E. Richardson, and Y. Weiss, "On gans and gmms," in Advances in Neural Information Processing Systems, pp.5847-5858, 2018.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved techniques for training gans," arXiv preprint arXiv:1606.03498, 2016.
C. Szegedy, et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," in Advances in Neural Information Processing Systems, pp.6626-6637, 2017.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.