Browse > Article
http://dx.doi.org/10.3745/KTSDE.2021.10.1.9

Improving Fidelity of Synthesized Voices Generated by Using GANs  

Back, Moon-Ki (충남대학교 컴퓨터융합학부)
Yoon, Seung-Won (충남대학교 컴퓨터융합학부)
Lee, Sang-Baek (충남대학교 컴퓨터융합학부)
Lee, Kyu-Chul (충남대학교 컴퓨터융합학부)
Publication Information
KIPS Transactions on Software and Data Engineering / v.10, no.1, 2021 , pp. 9-18 More about this Journal
Abstract
Although Generative Adversarial Networks (GANs) have gained great popularity in computer vision and related fields, generating audio signals independently has yet to be presented. Unlike images, an audio signal is a sampled signal consisting of discrete samples, so it is not easy to learn the signals using CNN architectures, which is widely used in image generation tasks. In order to overcome this difficulty, GAN researchers proposed a strategy of applying time-frequency representations of audio to existing image-generating GANs. Following this strategy, we propose an improved method for increasing the fidelity of synthesized audio signals generated by using GANs. Our method is demonstrated on a public speech dataset, and evaluated by Fréchet Inception Distance (FID). When employing our method, the FID showed 10.504, but 11.973 as for the existing state of the art method (lower FID indicates better fidelity).
Keywords
Generative Adversarial Networks; $Fr{\acute{e}}chet$ Inception Distance; Fidelity Improvement; Synthesized Voice;
Citations & Related Records
연도 인용수 순위
  • Reference
1 A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, pp.1097-1105. 2012.
2 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [Internet], http://www.image-net.org/challenges/LSVRC/
3 Detection and Classification of Acoustic Scenes and Events (DCASE) [Internet], http://dcase.community/
4 TensorFlow Speech Recognition Challenge [Internet], https://www.kaggle.com/c/tensorflow-speech-recognition-challenge
5 I. Goodfellow, et al,. "Generative adversarial nets," in Advances in Neural Information Processing Systems, pp.2672-2680, 2014.
6 A. Brock, J. Donahue, and K. Simonyan, "Large scale gan training for high fidelity natural image synthesis," arXiv preprint arXiv:1809.11096, 2018.
7 Y. Wu, J. Donahue, D. Balduzzi, K. Simonyan, and T. Lillicrap, "Logan: Latent optimisation for generative adversarial networks," arXiv preprint arXiv:1912.00953, 2019.
8 D. Nie, et al., "Medical image synthesis with context-aware generative adversarial networks," in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.417-425, 2017.
9 A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv:1511.06434, 2015.
10 C. Donahue, J. McAuley, and M. Puckette, "Adversarial audio synthesis," arXiv preprint arXiv:1802.04208, 2018.
11 A. Odena, V. Dumoulin, and C. Olah, "Deconvolution and checkerboard artifacts," Distill, Vol.1, No.10, pp.e3, 2016.
12 J. Engel, K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, "Gansynth: Adversarial neural audio synthesis," arXiv preprint arXiv:1902.08710, 2019.
13 T. Karras, T. Aila, S. Laine, and J. Lehtinen, "Progressive growing of gans for improved quality, stability, and variation," arXiv preprint arXiv:1710.10196, 2017.
14 J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi, "Neural audio synthesis of musical notes with WaveNet autoencoders," in International Conference on Machine Learning, pp.1068-1077, 2017.
15 M. Tayyab, I. Ahmad, N. Sun, J. Zhou, and X. Dong, "Application of integrated artificial neural networks based on decomposition methods to predict streamflow at Upper Indus Basin, Pakistan," Atmosphere, Vol.9, No.12, pp.494, 2018.   DOI
16 D. Fitzgerald, "Harmonic/percussive separation using median filtering," in Proceedings of the International Conference on Digital Audio Effects (DAFx-10), pp.217-220, 2010.
17 P. Warden, "Speech commands: A public dataset for single-word speech recognition", Dataset available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz, 2017.
18 T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved techniques for training gans," arXiv preprint arXiv:1606.03498, 2016.
19 A. Borji, "Pros and cons of gan evaluation measures," Computer Vision and Image Understanding, Vol.179, pp.41-65, 2019.   DOI
20 E. Richardson, and Y. Weiss, "On gans and gmms," in Advances in Neural Information Processing Systems, pp.5847-5858, 2018.
21 C. Szegedy, et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015.
22 M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," in Advances in Neural Information Processing Systems, pp.6626-6637, 2017.
23 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.