Determination of representative emotional style of speech based on k-means algorithm

Oh, Sangshin;Um, Se-Yun;Jang, Inseon;Ahn, Chung Hyun;Kang, Hong-Goo;

doi:10.7776/ASK.2019.38.5.614

The Journal of the Acoustical Society of Korea (한국음향학회지)

Volume 38 Issue 5
/
Pages.614-620
/
2019
/
1225-4428(pISSN)
/
2287-3775(eISSN)

The Acoustical Society of Korea (한국음향학회)

DOI QR Code

Determination of representative emotional style of speech based on k-means algorithm

k-평균 알고리즘을 활용한 음성의 대표 감정 스타일 결정 방법

Oh, Sangshin ;
Um, Se-Yun ;
Jang, Inseon ;
Ahn, Chung Hyun ;
Kang, Hong-Goo (School of Electrical and Electronic Engineering)

오상신 (연세대학교 전기전자공학부) ;
엄세연 (연세대학교 전기전자공학부) ;
장인선 (한국전자통신연구원 미디어연구본부) ;
안충현 (한국전자통신연구원 미디어연구본부) ;
강홍구 (연세대학교 전기전자공학부)

Received : 2019.07.16
Accepted : 2019.09.04
Published : 2019.09.30

https://doi.org/10.7776/ASK.2019.38.5.614 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a method to effectively determine the representative style embedding of each emotion class to improve the global style token-based end-to-end speech synthesis system. The emotion expressiveness of conventional approach was limited because it utilized only one style representative per each emotion. We overcome the problem by extracting multiple number of representatives per each emotion using a k-means clustering algorithm. Through the results of listening tests, it is proved that the proposed method clearly express each emotion while distinguishing one emotion from others.

본 논문은 전역 스타일 토큰(Global Style Token, GST)을 사용하는 종단 간(end-to-end) 감정 음성 합성 시스템의 성능을 높이기 위해 각 감정의 스타일 벡터를 효과적으로 결정하는 방법을 제안한다. 기존 방법은 각 감정을 표현하기 위해 한 개의 대푯값만을 사용하므로 감정 표현의 풍부함 측면에서 크게 제한된다. 이를 해결하기 위해 본 논문에서는 k-평균 알고리즘을 사용하여 다수의 대표 스타일을 추출하는 방법을 제안한다. 청취 평가를 통해 제안 방법을 이용해 추출한 각 감정의 대표 스타일이 기존 방법에 비해 감정 표현 정도가 뛰어나며, 감정 간의 차이를 명확히 구별할 수 있음을 보였다.

Keywords

References

H. Zen, A. Senior, and M. Schuster, "Statistical parametric speech synthesis using deep neural networks," Proc. IEEE ICASSP, 7962-7966 (2013).
Y. Qian, Y. Fan, W. Hu, and F. K Soong, "On the training aspects of deep neural network (dnn) for parametric tts synthesis," Proc. IEEE ICASSP, 3829-3833 (2014).
A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv: 1609.03499 (2016).
Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J Weiss, N. jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A Saurous, "Tacotron: Towards end-to-end speech synthesis," Proc. Interspeech, 4006-4010 (2017).
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," Proc. IEEE ICASSP, 4779-4783 (2018).
J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, "Char2wav: End-to-end speech synthesis," Proc. ICLR, 1-6 (2017).
A. Gibiansky, S. Arik, G. Diamos, J. Miler, K. Peng, W. Ping, J. Raiman, and Y. Zhou, "Deep voice 2: Multi-speaker neural text-to-speech," Advances in NIPS, 2962-2970 (2017).
Y. Wang, R. J. Skerry-Ryan, Y. Xiao, D. Stanton, J. Shor, E. Battenberg, R. Clark, and R. A. Saurous, "Uncovering latent style factors for expressive speech synthesis," arXiv preprint arXiv:1711.00520 (2017).
Y. Lee, A. Rabiee, and S. -Y. Lee, "Emotional end-toend neural speech synthesizer," arXiv preprint arXiv: 1711.05447 (2017).
O. Kwon, I. Jang, C. H. Ahn, and H. -G. Kang, "Emotional speech synthesis based on style embedded Tacotron2 framework," Proc. ITC-CSCC, 1-4 (2019).
J. Tao, Y. Kang, and A. Li, "Prosody conversion from neutral speech to emotional speech," IEEE Trans. on Audio, Speech, and Lang. Process. 14, 1145-1154 (2006). https://doi.org/10.1109/TASL.2006.876113
Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, "Voice conversion with smoothed gmm and map adaptation," Eighth European Conference on Speech Communication and Technology, 2413-2416 (2003).
Y. -J. Zhang, S. Pan, L. He, and Z. -H. Ling, "Learning latent representation for style control and transfer in end-to-end speech synthesis," Proc. IEEE ICASSP, 6945-6949 (2019).
Y. Wang, D. Stanton, Y. Zhang, RJ. Skerry- Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to- end speech synthesis," arXiv preprint arXiv:1803.09017 (2018).
RJ. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, "Towards end-to-end prosody transfer for expressive speech synthesis with tacotron," arXiv preprit arXiv:1803.09047 (2018).
S. Lloyd, "Least squares quantization in PCM," IEEE Trans. on information theory, 28, 129-137 (1982). https://doi.org/10.1109/TIT.1982.1056489

The Journal of the Acoustical Society of Korea (한국음향학회지)

Determination of representative emotional style of speech based on k-means algorithm

k-평균 알고리즘을 활용한 음성의 대표 감정 스타일 결정 방법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)