Individual Audio-Driven Talking Head Generation based on Sequence of Landmark

Son Thanh-Hoang Vo;Quang-Vinh Nguyen;Hyung-Jeong Yang;Jieun Shin;Seungwon Kim;Soo-Huyng Kim;

한국정보처리학회:학술대회논문집 (Annual Conference of KIPS)

한국정보처리학회 2024년도 추계학술발표대회
/
Pages.553-556
/
2024
/
2005-0011(pISSN)
/
2671-7298(eISSN)

한국정보처리학회 (Korea Information Processing Society)

랜드마크 시퀀스를 기반으로 한 개별 오디오 구동 화자 생성

Individual Audio-Driven Talking Head Generation based on Sequence of Landmark

;
;
양형정 (전남대학교 인공지능융합대학) ;
신지은 (전남대학교 인공지능융합대학) ;
김승원 (전남대학교 인공지능융합대학) ;
김수형 (전남대학교 인공지능융합대학)

Son Thanh-Hoang Vo (College of Artificial Intelligence Convergence, Chonnam National University) ;
Quang-Vinh Nguyen (College of Artificial Intelligence Convergence, Chonnam National University) ;
Hyung-Jeong Yang (College of Artificial Intelligence Convergence, Chonnam National University) ;
Jieun Shin (College of Artificial Intelligence Convergence, Chonnam National University) ;
Seungwon Kim (College of Artificial Intelligence Convergence, Chonnam National University) ;
Soo-Huyng Kim (College of Artificial Intelligence Convergence, Chonnam National University)

발행 : 2024.10.31

PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Talking Head Generation is a highly practical task that is closely tied to current technology and has a wide range of applications in everyday life. This technology will be of great help in the fields of photography, online conversation as well as in education and medicine. In this paper, the authors proposed a novel approach for Individual Audio-Driven Talking Head Generation by leveraging a sequence of landmarks and employing a diffusion model for image reconstruction. Building upon previous landmark-based methods and advancements in generative models, the authors introduce an optimized noise addition technique designed to enhance the model's ability to learn temporal information from input data. The proposed method outperforms recent approaches in metrics such as Landmark Distance (LD) and Structural Similarity Index Measure (SSIM), demonstrating the effectiveness of the diffusion model in this domain. However, there are still challenges in optimization. The paper conducts ablation studies to identify these issues and outlines directions for future development.

키워드

과제정보

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00256629) grant funded by the Korea government (MSIT), the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2024-00437718) supervised by IITP, and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00219107).

참고문헌

P. Pataranutaporn et al., "AI-generated characters for supporting personalized learning and well-being," Nat Mach Intell, vol. 3, no. 12, pp. 1013-1022, Dec. 2021, doi: 10.1038/s42256-021-00417-9.
J. Wang, X. Qian, M. Zhang, R. T. Tan, and H. Li, "Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 14653-14662. doi: 10.1109/CVPR52729.2023.01408.
B. Zhang, X. Zhang, N. Cheng, J. Yu, J. Xiao, and J. Wang, "EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model," Jan. 15, 2024, arXiv: arXiv:2401.08049. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2401.08049
K. Cheng et al., "VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild," Nov. 27, 2022, arXiv: arXiv:2211.14758. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2211.14758
X. Ji et al., "Audio-Driven Emotional Video Portraits," May 19, 2021, arXiv: arXiv:2104.07452. Accessed: Jun. 07, 2024. [Online]. Available: http://arxiv.org/abs/2104.07452
L. Chen, R. K. Maddox, Z. Duan, and C. Xu, "Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss," May 09, 2019, arXiv: arXiv:1905.03820. Accessed: Jun. 21, 2024. [Online]. Available: http://arxiv.org/abs/1905.03820
S. Tan, B. Ji, and Y. Pan, "EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation," in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France: IEEE, Oct. 2023, pp. 22089-22099. doi: 10.1109/ICCV51070.2023.02024.
J. Wang, Y. Zhao, L. Liu, T. Xu, Q. Li, and S. Li, "Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks," Jun. 06, 2023, arXiv: arXiv:2306.03594. Accessed: Apr. 04, 2024. [Online]. Available: http://arxiv.org/abs/2306.03594
M. Stypulkowski, K. Vougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, "Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation," in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2024, pp. 5089-5098. doi: 10.1109/WACV57701.2024.00502.
S. Shen et al., "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation," presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982-1991. Accessed: Jun. 17, 2024. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2023/html/Shen_DiffTalk_Crafting_Diffusion_Models_for_Generalized_Audio-Driven_Portraits_Animation_CVPR_2023_paper.html
H.-S. Vo-Thanh, Q.-V. Nguyen, and S.-H. Kim, "KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation," Sep. 09, 2024, arXiv: arXiv:2409.05330. Accessed: Sep. 10, 2024. [Online]. Available: http://arxiv.org/abs/2409.05330
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv.org. Accessed: Sep. 10, 2024. [Online]. Available: https://arxiv.org/abs/2006.11477v3
Z. Liu et al., "KAN: Kolmogorov-Arnold Networks," arXiv.org. Accessed: Sep. 10, 2024. [Online]. Available: https://arxiv.org/abs/2404.19756v4
K. Wang et al., "MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation," in Computer Vision - ECCV 2020, vol. 12366, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., in Lecture Notes in Computer Science, vol. 12366. , Cham: Springer International Publishing, 2020, pp. 700-717. doi: 10.1007/978-3-030-58589-1_42.
Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, "MakeltTalk: speaker-aware talking-head animation," ACM Trans. Graph., vol. 39, no. 6, pp. 1-15, Dec. 2020, doi: 10.1145/3414685.3417774.
"A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild | Proceedings of the 28th ACM International Conference on Multimedia." Accessed: Sep. 10, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3394171.3413532

한국정보처리학회:학술대회논문집 (Annual Conference of KIPS)

랜드마크 시퀀스를 기반으로 한 개별 오디오 구동 화자 생성

Individual Audio-Driven Talking Head Generation based on Sequence of Landmark

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)