Emotional-Controllable Talking Face Generation on Real-Time System

Van-Thien Phan;Hyung-Jeong Yang;Seung-Won Kim;Ji-Eun Shin;Soo-Hyung Kim;

doi:10.3745/PKIPS.y2024m10a.523

Annual Conference of KIPS (한국정보처리학회:학술대회논문집)

2024.10a
/
Pages.523-526
/
2024
/
2005-0011(pISSN)
/
2671-7298(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Emotional-Controllable Talking Face Generation on Real-Time System

Van-Thien Phan (Dept. of Artificial Intelligence, Chonnam National University) ;
Hyung-Jeong Yang (Dept. of Artificial Intelligence, Chonnam National University) ;
Seung-Won Kim (Dept. of Artificial Intelligence, Chonnam National University) ;
Ji-Eun Shin (Dept. of Psychology, Chonnam National University) ;
Soo-Hyung Kim (Dept. of Artificial Intelligence, Chonnam National University)

Published : 2024.10.31

https://doi.org/10.3745/PKIPS.y2024m10a.523 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Recent progress in audio-driven talking face generation has focused on achieving more realistic and emotionally expressive lip movements, enhancing the quality of virtual avatars and animated characters for applications in entertainment, education, healthcare, and more. Despite these advances, challenges remain in creating natural and emotionally nuanced lip synchronization efficiently and accurately. To address these issues, we introduce a novel method for audio-driven lip-sync that offers precise control over emotional expressions, outperforming current techniques. Our method utilizes Conditional Deep Variational Autoencoder to produce lifelike lip movements that align seamlessly with audio inputs while dynamically adjusting for various emotional states. Experimental results highlight the advantages of our approach, showing significant improvements in emotional accuracy and the overall quality of the generated facial animations, video sequences on the Crema-D dataset [1].

Keywords

Acknowledgement

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT) (RS-2023-00219107). This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00256629) grant funded by the Korea government (MSIT) . This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2023-RS-2022-00156287) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

References

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377-390, 2014.
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realistic speech-driven facial animation with gans. CoRR, abs/1906.06337, 2019. URL http://arxiv.org/ abs/1906.06337.
Sefik Emre Eskimez, You Zhang, and Zhiyao Duan. Speech driven talking face generation from a single image and an emotion condition. IEEE Transactions on Multimedia, pages 1-1, 2021. doi: 10.1109/TMM.2021.3099900.
Ian Magnusson, Aruna Sankaranarayanan, and Andrew Lippman. Invertable frowns: Video-to-video facial emotion translation. In Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, pages 25-33, 2021.
Child, R. (2020). Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images. ArXiv, abs/2011.10650.
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484-492, 2020.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694-711. Springer, 2016.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1-9, 2015.

Annual Conference of KIPS (한국정보처리학회:학술대회논문집)

Emotional-Controllable Talking Face Generation on Real-Time System

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)