[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2022.02.018

Voice Frequency Synthesis using VAW-GAN based Amplitude Scaling for Emotion Transformation

Kwon, Hye-Jeong (Department of Computer Science, Kyonggi University)
Kim, Min-Jeong (Department of Computer Science, Kyonggi University)
Baek, Ji-Won (Department of Computer Science, Kyonggi University)
Chung, Kyungyong (Division of AI Computer Science and Engineering, Kyonggi University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.16, no.2, 2022 , pp. 713-725 More about this Journal

Abstract

Mostly, artificial intelligence does not show any definite change in emotions. For this reason, it is hard to demonstrate empathy in communication with humans. If frequency modification is applied to neutral emotions, or if a different emotional frequency is added to them, it is possible to develop artificial intelligence with emotions. This study proposes the emotion conversion using the Generative Adversarial Network (GAN) based voice frequency synthesis. The proposed method extracts a frequency from speech data of twenty-four actors and actresses. In other words, it extracts voice features of their different emotions, preserves linguistic features, and converts emotions only. After that, it generates a frequency in variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) in order to make prosody and preserve linguistic information. That makes it possible to learn speech features in parallel. Finally, it corrects a frequency by employing Amplitude Scaling. With the use of the spectral conversion of logarithmic scale, it is converted into a frequency in consideration of human hearing features. Accordingly, the proposed technique provides the emotion conversion of speeches in order to express emotions in line with artificially generated voices or speeches.

Keywords

Emotion Transformation; Generative Adversarial Network; Voice Frequency Synthesis; Voice Analysis;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	W. Al-Dulaimi, T. K. Moon, J. H. Gunther, "Voice transformation using two-level dynamic warping and neural networks," Signals, vol. 2, no. 3, pp. 456-474, 2021. DOI
2	P. Narvaez, W. S. Percybrooks, "Synthesis of normal heart sounds using generative adversarial networks and empirical wavelet transform," Appl. Sci., vol. 10, no. 19, pp. 7003-7018, 2020. DOI
3	K. Chung, S. Y. Oh, "Voice activity detection using an improved unvoiced feature normalization process in noisy environments," Wirel. Pers. Commun., vol. 89, no. 3, pp. 747-759, 2016. DOI
4	J. H. He, S. J. Kou, C. H. He, Z. W. Zhang, K. A. Gepreel, "Fractal oscillation and its frequency-amplitude property," Fractals, vol. 29, no. 4, pp. 2150105-991, Jan. 2021. DOI
5	M. Tan, X. Xu, A. Boes, B. Corcoran, J. Wu, T. G. Nguyen, S. T. Chu, B. E. Little, R. Morandotti, A. Mitchell, D. J. Moss, "Photonic RF arbitrary waveform generator based on a soliton crystal micro-comb source," J. Light. Technol., vol. 38, no. 22, pp. 6221-6226, Jul. 2020. DOI
6	S. Qamar, H. Mujtaba, H. Majeed, M. O. Beg, "Relationship identification between conversational agents using emotion analysis," Cognit Comput, vol. 13, no. 3, pp. 673-687, Jan. 2021. DOI
7	K. Zhou, B. Sisman, H. Li, "Transforming spectrum and prosody for emotional voice conversion with non-parallel training data," arXiv, 2020.
8	M. S. Al-Radhi, T. G. Csapo, C. Zainko, G. Nemeth, "Continuous wavelet vocoder-based decomposition of parametric speech waveform synthesis," arXiv, Jun. 2021.
9	H. Ma, W. Huang, Y. Jing, S. Pignatti, G. Laneve, Y. Dong, H. Ye, L. Liu, A. Guo, J. Jiang, "Identification of Fusarium head blight in winter wheat ears using continuous wavelet analysis," Sensors, vol. 20, no. 1, pp. 20, Dec. 2020.
10	N. Hekmat, T. Vogel, Y. Wang, S. Mansourzadeh, F. Aslani, A. Omar, M. Hoffmann, F. Meyer, C. J. Saraceno, "Cryogenically cooled GaP for optical rectification at high excitation average powers," Opt. Mater. Express., vol. 10, no. 11, pp. 2768-2782, 2020. DOI
11	S. Kim and H. Choi, "Emotional voice conversion using generative adversarial networks," GAN., vol. 8, no. 3.169, pp. 5-784, 2017.
12	S. R. Livingstone, F. A. Russo, "The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American English," PLoS ONE, vol. 13, no. 5, e0196391, 2018. DOI
13	L. Teng, Z. Fu, and Y. Yao, "Interactive translation in echocardiography training system with enhanced cycle-GAN," IEEE Access, vol. 8, pp. 106147-106156, 2020. DOI
14	J. C. Kim, K. Chung, "Prediction model of user physical activity using data characteristics-based long short-term memory recurrent neural networks," KSII Transactions on Internet and Information Systems, vol. 13, no. 4, pp. 2060-2077, Apr. 2019. DOI
15	H. Yoo, K. Chung, "Deep learning-based evolutionary recommendation model for heterogeneous big data integration," KSII Transactions on Internet and Information Systems, Vol. 14, No. 9, pp. 3730-3744, Sep. 2020. DOI
16	R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki, "GMM-based emotional voice conversion using spectrum and prosody features," J. signal process., vol. 2, no. 5, pp. 134-138, Oct. 2012. DOI
17	Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, "Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2017, pp. 1-13, 2017. DOI
18	H. Fei, D. Ji, Y. Zhang, and Y. Ren, "Topic-enhanced capsule network for multi-label emotion classification," IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 1839-1848, 2020. DOI
19	H. J. Kwon, D. H. Shin, K. Chung, "PGGAN-based anomaly classification on chest x-ray using weighted multi-scale similarity," IEEE Access, vol. 9, pp. 113315-113325, Aug. 2021. DOI
20	D. H. Shin, R. C. Park, K. Chung, "Decision boundary-based anomaly detection model using improved ANOGAN from ECG data," IEEE Access, vol. 8, pp. 108664-108674, Jun. 2020. DOI
21	R. Ramos-Aguilar, J. A. Olvera-Lopez, I. Olmos-Pineda, S. Sanchez-Urrieta, "Feature extraction from EEG spectrograms for epileptic seizure detection," Pattern Recognit. Lett., vol. 133, pp. 202-209. May. 2020. DOI
22	Z. Luo, T. Takiguchi, and Y. Ariki, "Emotional voice conversion using deep neural networks with MCC and F0 features," in Proc. of the 15th International Conference on Computer and Information Science (ICIS), pp. 1-5, Jun. 2016.
23	M. Pasini, "MelGAN-VC: voice conversion and audio style transfer on arbitrarily long samples using spectrograms," arXiv, Dec. 2019.
24	H. Ming, D. Y. Huang, L. Xie, J. Wu, M. Dong, and H. Li, "Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion," in Proc. of the International Conference of the Speech Communication Association, pp. 2453-2457, Sep. 2016.
25	K. Zhou, B. Sisman, and H. Li, "Transforming spectrum and prosody for emotional voice conversion with non-parallel training data," arXiv, 2020.
26	C. C. Hsu, H. T. Hwang, Y. C. Wu, Y. Tsao, H. M. Wang, "Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks," arXiv, Jun. 2017.
27	R. Yamamoto, E. Song, and J. M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199-6203, May. 2020.
28	J. Lee, Y. Jung, H. Kim, "Dual attention in time and frequency domain for voice activity detection," arXiv, Aug. 2020.
29	J. Zhu, T. Park, P. Isola, and A. Efros, "Unpaired image-to image translation using cycle-consistent adversarial networks," arXiv, 2017.
30	V. Popa, H. Silen, J. Nurminen, M. Gabbouj, "Local linear transformation for voice conversion," in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4517-4520, 2012.
31	H. Yoo, R. C. Park, K. Chung, "IoT-based health big-data process technologies: a survey," KSII Transactions on Internet and Information Systems, Vol. 15, No. 3, pp. 974-992, Mar. 2021.
32	S. Cho, S. Jeon, W. Choi, R. Managuli, C. Kim, "Nonlinear pth root spectral magnitude scaling beamforming for clinical photoacoustic and ultrasound imaging," Opt. Lett., vol. 45, no. 16, pp. 4575-4578, 2020. DOI