DOI QR코드

DOI QR Code

Determination of representative emotional style of speech based on k-means algorithm

k-평균 알고리즘을 활용한 음성의 대표 감정 스타일 결정 방법

  • 오상신 (연세대학교 전기전자공학부) ;
  • 엄세연 (연세대학교 전기전자공학부) ;
  • 장인선 (한국전자통신연구원 미디어연구본부) ;
  • 안충현 (한국전자통신연구원 미디어연구본부) ;
  • 강홍구 (연세대학교 전기전자공학부)
  • Received : 2019.07.16
  • Accepted : 2019.09.04
  • Published : 2019.09.30

Abstract

In this paper, we propose a method to effectively determine the representative style embedding of each emotion class to improve the global style token-based end-to-end speech synthesis system. The emotion expressiveness of conventional approach was limited because it utilized only one style representative per each emotion. We overcome the problem by extracting multiple number of representatives per each emotion using a k-means clustering algorithm. Through the results of listening tests, it is proved that the proposed method clearly express each emotion while distinguishing one emotion from others.

본 논문은 전역 스타일 토큰(Global Style Token, GST)을 사용하는 종단 간(end-to-end) 감정 음성 합성 시스템의 성능을 높이기 위해 각 감정의 스타일 벡터를 효과적으로 결정하는 방법을 제안한다. 기존 방법은 각 감정을 표현하기 위해 한 개의 대푯값만을 사용하므로 감정 표현의 풍부함 측면에서 크게 제한된다. 이를 해결하기 위해 본 논문에서는 k-평균 알고리즘을 사용하여 다수의 대표 스타일을 추출하는 방법을 제안한다. 청취 평가를 통해 제안 방법을 이용해 추출한 각 감정의 대표 스타일이 기존 방법에 비해 감정 표현 정도가 뛰어나며, 감정 간의 차이를 명확히 구별할 수 있음을 보였다.

Keywords

References

  1. H. Zen, A. Senior, and M. Schuster, "Statistical parametric speech synthesis using deep neural networks," Proc. IEEE ICASSP, 7962-7966 (2013).
  2. Y. Qian, Y. Fan, W. Hu, and F. K Soong, "On the training aspects of deep neural network (dnn) for parametric tts synthesis," Proc. IEEE ICASSP, 3829-3833 (2014).
  3. A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv: 1609.03499 (2016).
  4. Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J Weiss, N. jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A Saurous, "Tacotron: Towards end-to-end speech synthesis," Proc. Interspeech, 4006-4010 (2017).
  5. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," Proc. IEEE ICASSP, 4779-4783 (2018).
  6. J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, "Char2wav: End-to-end speech synthesis," Proc. ICLR, 1-6 (2017).
  7. A. Gibiansky, S. Arik, G. Diamos, J. Miler, K. Peng, W. Ping, J. Raiman, and Y. Zhou, "Deep voice 2: Multi-speaker neural text-to-speech," Advances in NIPS, 2962-2970 (2017).
  8. Y. Wang, R. J. Skerry-Ryan, Y. Xiao, D. Stanton, J. Shor, E. Battenberg, R. Clark, and R. A. Saurous, "Uncovering latent style factors for expressive speech synthesis," arXiv preprint arXiv:1711.00520 (2017).
  9. Y. Lee, A. Rabiee, and S. -Y. Lee, "Emotional end-toend neural speech synthesizer," arXiv preprint arXiv: 1711.05447 (2017).
  10. O. Kwon, I. Jang, C. H. Ahn, and H. -G. Kang, "Emotional speech synthesis based on style embedded Tacotron2 framework," Proc. ITC-CSCC, 1-4 (2019).
  11. J. Tao, Y. Kang, and A. Li, "Prosody conversion from neutral speech to emotional speech," IEEE Trans. on Audio, Speech, and Lang. Process. 14, 1145-1154 (2006). https://doi.org/10.1109/TASL.2006.876113
  12. Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, "Voice conversion with smoothed gmm and map adaptation," Eighth European Conference on Speech Communication and Technology, 2413-2416 (2003).
  13. Y. -J. Zhang, S. Pan, L. He, and Z. -H. Ling, "Learning latent representation for style control and transfer in end-to-end speech synthesis," Proc. IEEE ICASSP, 6945-6949 (2019).
  14. Y. Wang, D. Stanton, Y. Zhang, RJ. Skerry- Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to- end speech synthesis," arXiv preprint arXiv:1803.09017 (2018).
  15. RJ. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, "Towards end-to-end prosody transfer for expressive speech synthesis with tacotron," arXiv preprit arXiv:1803.09047 (2018).
  16. S. Lloyd, "Least squares quantization in PCM," IEEE Trans. on information theory, 28, 129-137 (1982). https://doi.org/10.1109/TIT.1982.1056489