Browse > Article
http://dx.doi.org/10.7746/jkros.2018.13.1.016

Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network  

Sakurai, Ryuhei (Ritsumeikan University)
Shimba, Taiki (Ritsumeikan University)
Yamazoe, Hirotake (Ritsumeikan University)
Lee, Joo-Ho (Ritsumeikan University)
Publication Information
The Journal of Korea Robotics Society / v.13, no.1, 2018 , pp. 16-25 More about this Journal
Abstract
The talking head (TH) indicates an utterance face animation generated based on text and voice input. In this paper, we propose the generation method of TH with facial expression and intonation by speech input only. The problem of generating TH from speech can be regarded as a regression problem from the acoustic feature sequence to the facial code sequence which is a low dimensional vector representation that can efficiently encode and decode a face image. This regression was modeled by bidirectional RNN and trained by using SAVEE database of the front utterance face animation database as training data. The proposed method is able to generate TH with facial expression and intonation TH by using acoustic features such as MFCC, dynamic elements of MFCC, energy, and F0. According to the experiments, the configuration of the BLSTM layer of the first and second layers of bidirectional RNN was able to predict the face code best. For the evaluation, a questionnaire survey was conducted for 62 persons who watched TH animations, generated by the proposed method and the previous method. As a result, 77% of the respondents answered that the proposed method generated TH, which matches well with the speech.
Keywords
Talking heads; Recurrent neural network; Acoustic features; Facial features;
Citations & Related Records
연도 인용수 순위
  • Reference
1 O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion Recognition by speech signals," 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 125-128, 2003.
2 Y. Pan, P. Shen, and L. Shen, "Speech emotion recognition using support vector machine," International Journal of Smart Home, vol. 6, no. 2, pp. 101-107, April, 2012.
3 S. Furui, "Speaker-independent isolated word recognition using dynamic features of speech spectrum," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 1, pp. 52-59, Feb., 1986.   DOI
4 S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book, Microsoft Corporation, 1995.
5 S. Imai, T. Kobayashi, K. Tokuda, T. Masuko, K. Koishida, S. Sako, and H. Zen, Speech signal processing toolkit (SPTK), [Online], http://sp-tk.sourceforge.net/, Accessed: Feb. 14, 2018.
6 I. Matthews and S. Baker, "Active appearance models revisited," International Journal of Computer Vision, vol. 60, no. 2, pp. 135-164, Nov., 2004.   DOI
7 J. Alabort-i-Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou, "Menpo: a comprehensive platform for parametric image alignment and visual deformable models," 22nd ACM international conference on Multimedia, Orlando, Florida, USA, pp. 679-682, 2014.
8 D.W. Massaro, "Symbiotic value of an embodied agent in language learning," 37th Hawaii International Conference on System Sciences, Big Island, HI, USA, 2004, doi: 10.1109/ HICSS.2004.1265333.   DOI
9 B. Fan, L. Wang, F.K. Soong, and L. Xie, "Photo-real t alking head with deep bidirectional LSTM", International Conference on Acoustics, Speech, and Signal Processing, Brisbane, QLD, Australia, 2015, doi: 10.1109/ICASSP.2015.7178899.   DOI
10 L. Wang, and F.K. Soong, "HMM trajectory-guided sample selection for photo-realistic talking head," Multimedia Tools and Applications, vol. 74, no. 22, pp. 9849-9869, Nov., 2014.   DOI
11 M. Schuster, and K.K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, Nov., 1997.
12 B.D. Lucas, and T. Kanade, "An iterative image registration technique with an application to stereo vision," 1981 DARPA Image Understanding Workshop, pp. 121-130, April 1981.
13 T. Tieleman, and G. Hinton, "Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude", COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26-30, 2012.
14 A. Karpathy, J. Johnson, and L. Fei-Fei, "Visualizing and understanding recurrent networks", arXiv:1506.02078, 2015.
15 E. Cosatto, and H.P. Graf, "Sample-based synthesis of photo realistic talking heads," Computer Animation 98, Philadelphia, PA, USA, USA, pp. 103-110, 1998.
16 V. Wan, R. Blokland, N. Braunschweiler, L. Chen, B. Kolluru, J. Latorre, R. Maia, B. Stenger, K. Yanagisawa, Y. Stylianou, M. Akamine, M.J.F. Gales, and R. Cipolla, "Photo-Realistic Expressive Text to Talking Head Synthesis," 14 th Annual Conference of the International Speech Communication Association, Lyon, France, pp. 2667-2669, 2013.
17 S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies," A field guide to dynamical recurrent neural networks, IEEE Press, 2001.
18 A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, "A novel connectionist system for unconstrained handwriting recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, May 2009.
19 H. Sanaul and P.J.B. Jackson, "Multimodal Emotion Recognition," W. Wang ed., Machine Audition: Principles, Algorithms and Systems, Hershey, PA: IGI Global, 2011, pp. 398-423, doi: 10.4018/978-1-61520-919-4.ch017.   DOI
20 R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, "Multi-PIE," Image and Vision Computing, vol. 28, no. 5, pp. 807-813, May, 2010.   DOI
21 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, Vol. 15, pp. 1929-1958, Jun., 2014.
22 W. Han, L. Wang, F. Soong, and B. Yuan, "Improved minimum converted trajectory error training for real-time speech-to-lips conversion," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012, doi: 10.1109/ICASSP.2012.6288921.   DOI
23 D. W. Massaro, J. B eskow, M. M. C ohen, C. L . Fry, a nd T. Rodgriguez, "Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks", Auditory-Visual Speech Processing, Santa Cruz, CA, USA, pp. 133-138, 1999.