Browse > Article
http://dx.doi.org/10.3837/tiis.2020.08.018

Text-driven Speech Animation with Emotion Control  

Chae, Wonseok (Content Validation Research Section, Electronics and Telecommunications Research Institute)
Kim, Yejin (School of Games, Hongik University)
Publication Information
KSII Transactions on Internet and Information Systems (TIIS) / v.14, no.8, 2020 , pp. 3473-3487 More about this Journal
Abstract
In this paper, we present a new approach to creating speech animation with emotional expressions using a small set of example models. To generate realistic facial animation, two example models called key visemes and expressions are used for lip-synchronization and facial expressions, respectively. The key visemes represent lip shapes of phonemes such as vowels and consonants while the key expressions represent basic emotions of a face. Our approach utilizes a text-to-speech (TTS) system to create a phonetic transcript for the speech animation. Based on a phonetic transcript, a sequence of speech animation is synthesized by interpolating the corresponding sequence of key visemes. Using an input parameter vector, the key expressions are blended by a method of scattered data interpolation. During the synthesizing process, an importance-based scheme is introduced to combine both lip-synchronization and facial expressions into one animation sequence in real time (over 120Hz). The proposed approach can be applied to diverse types of digital content and applications that use facial animation with high accuracy (over 90%) in speech recognition.
Keywords
Speech animation; lip-synchronization; emotional expressions; facial expression synthesis; example models;
Citations & Related Records
연도 인용수 순위
  • Reference
1 L. Wang, H. Chen, S. Li, and H. M. Meng, "Phoneme-level articulatory animation in pronunciation training," Speech Communication, vol. 54, no. 7, pp. 845-856, Sep. 2012.   DOI
2 M. Kawai, T. Iwao, D. Mima, A. Maejima, and S. Morishima, "Data-driven speech animation synthesis focusing on realistic inside of the mouth," Journal of Information Processing, vol. 22, no. 2, pp. 401-409, 2014.   DOI
3 F. Kuhnke and J. Ostermann, "Visual speech synthesis from 3D mesh sequences driven by combined speech features," in Proc. of IEEE International Conference on Multimedia and Expo (ICME), pp. 1075-1080, 2017.
4 Y. Lee, D. Terzopoulos, and K. Waters, "Realistic Modeling for Facial Animation," in Proc. of ACM SIGGRAPH, pp. 55-62, 1995.
5 E. Sifakis, I. Neverov, and Ronald Fedkiw, "Automatic determination of facial muscle activations from sparse motion capture marker data," ACM Transactions on Graphics, vol. 24, no. 3, pp. 417-425, 2005.   DOI
6 M. Cong, M. Bao, J. L. E, K. S Bhat, and R. Fedkiw, "Fully automatic generation of anatomical face simulation models," in Proc. of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 175-183, 2015.
7 A.-E. Ichim, P. Kadlechk, L. Kavan, and M. Pauly, "Phace: Physics-based Face Modeling and Animation," ACM Transactions on Graphics, vol. 36, no. 4, pp. 1-14, 2017.
8 F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H. Salesin, "Synthesizing Realistic Facial Expressions from Photographs," in Proc. of ACM SIGGRAPH, pp. 75-84, 1998.
9 B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin, "Making Faces," in Proc. of ACM SIGGRAPH, pp. 55-66, 1998.
10 E. Ju and J. Lee, "Expressive Facial Gestures from Motion Capture Data," Computer Graphics Forum, vol. 27, no. 2, pp. 381-388, 2008.   DOI
11 P. Ekman and E. L. Rosenberg, What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System, Oxford University Press, 1997.
12 M. Lau, J. Chai, Y.-Q. Xu, and H.-Y. Shum, "Face poser: Interactive modeling of 3D facial expressions using facial priors," ACM Trans. on Graphics, vol 29, no. 1, pp. 1-17, 2009.
13 V. Barrielle, N. Stoiber, and C. Cagniart, "BlendForces: A Dynamic Framework for Facial Animation," Computer Graphics Forum, vol 35, no. 2, pp. 341-352, 2016.   DOI
14 J. Jia, S. Zhang, F. Meng, Y. Wang, and L. Cai, "Emotional audiovisual speech synthesis based on PAD," IEEE Transactions on Audio Speech and Language Processing, vol. 19, no. 3, pp. 570-582, 2011.   DOI
15 V. Wan, R. Blokland, N. Braunschweiler, L. Chen, B. Kolluru, J. Latorre, R. Maia, B. Stenger, K. Yanagisawa, Y. Stylianou, M. Akamine, M. J. F. Gales, and R. Cipolla, "Photo-Realistic Expressive Text to Talking Head Synthesis," in Proc. of Annual Conference of the International Speech Communication Association, pp. 2667-2669, 2013.
16 A. Stef, K. Perera, H. P. H. Shum, and E. S. L. Ho, "Synthesizing Expressive Facial and Speech Animation by Text-to-IPA Translation with Emotion Control," in Proc. of International Conference on Software, Knowledge, Information Management & Applications, pp. 1-8, 2018.
17 VoiceText. Available online: http://www.voiceware.co.kr (accessed on 1 June 2019).
18 C. G. Fisher, "Confusions among visually perceived consonants," Journal of Speech and Hearing Research, vol 11, pp. 796-804, 1968.   DOI
19 J. A. Russel, "A Circomplex Model of Affect," Journal of Personality and Social Psychology, vol. 39, pp. 1161-1178, 1980.   DOI
20 P.-P. Sloan, C. F. Rose, and M. F. Cohen, "Shape by example," in Proc. of Symposium on Interactive 3D Graphics, pp. 135-144, 2001.
21 M. Brand, "Voice Puppetry," in Proc. of ACM SIGGRAPH, pp. 21-28, 1999.
22 H. Pyun, Y. Kim, W. Chae, H. W. Kang, and S. Y. Shin, "An Example-Based Approach for Facial Expression Cloning," in Proc. of Eurographics/SIGGRAPH Symposium on Computer Animation, pp. 167-176, 2003.
23 C. Bregler, M. Covell, and M. Slaney, "Video rewrite: Driving visual speech with audio," in Proc. of ACM SIGGRAPH, pp. 353-360, 1997.
24 E. Cossato and H. Graf, "Photo-Realistic Talking Heads from Image Samples," IEEE Transactions on Multimedia, vol. 2, no. 3, pp. 152-163, 2000.   DOI
25 T. Ezzat and T. Poggio, "Visual Speech synthesis by morphing visemes," International Journal of Computer Vision, vol. 38, pp. 45-57, 2000.   DOI
26 T. Ezzat, G. Geiger, and T. Poggio, "Trainable Videorealistic Speech Animation," ACM Transactions on Graphics, vol. 21, no. 3, 2002.
27 F. I. Parke, A Parametric Model of Human Faces, Ph.D. Thesis, University of Utah, 1974.
28 A. Pearce, B. Wyvill, G. Wyvill, and D. Hill, "Speech and Expression: A Computer Solution to Face Animation," in Proc. of Graphics Interface, pp. 136-140, 1986.
29 G. A. Kalberer and L. V. Gool, "Lip Animation Based on Observed 3D Speech Dynamics," in Proc. of Computer Animation 2001, vol. 4309, pp. 20-27, 2001.