랜드마크 시퀀스를 기반으로 한 개별 오디오 구동 화자 생성

Individual Audio-Driven Talking Head Generation based on Sequence of Landmark

  • ;
  • ;
  • 양형정 (전남대학교 인공지능융합대학) ;
  • 신지은 (전남대학교 인공지능융합대학) ;
  • 김승원 (전남대학교 인공지능융합대학) ;
  • 김수형 (전남대학교 인공지능융합대학)
  • Son Thanh-Hoang Vo (College of Artificial Intelligence Convergence, Chonnam National University) ;
  • Quang-Vinh Nguyen (College of Artificial Intelligence Convergence, Chonnam National University) ;
  • Hyung-Jeong Yang (College of Artificial Intelligence Convergence, Chonnam National University) ;
  • Jieun Shin (College of Artificial Intelligence Convergence, Chonnam National University) ;
  • Seungwon Kim (College of Artificial Intelligence Convergence, Chonnam National University) ;
  • Soo-Huyng Kim (College of Artificial Intelligence Convergence, Chonnam National University)
  • 발행 : 2024.10.31

초록

Talking Head Generation is a highly practical task that is closely tied to current technology and has a wide range of applications in everyday life. This technology will be of great help in the fields of photography, online conversation as well as in education and medicine. In this paper, the authors proposed a novel approach for Individual Audio-Driven Talking Head Generation by leveraging a sequence of landmarks and employing a diffusion model for image reconstruction. Building upon previous landmark-based methods and advancements in generative models, the authors introduce an optimized noise addition technique designed to enhance the model's ability to learn temporal information from input data. The proposed method outperforms recent approaches in metrics such as Landmark Distance (LD) and Structural Similarity Index Measure (SSIM), demonstrating the effectiveness of the diffusion model in this domain. However, there are still challenges in optimization. The paper conducts ablation studies to identify these issues and outlines directions for future development.

키워드

과제정보

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00256629) grant funded by the Korea government (MSIT), the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2024-00437718) supervised by IITP, and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00219107).

참고문헌

  1. P. Pataranutaporn et al., "AI-generated characters for supporting personalized learning and well-being," Nat Mach Intell, vol. 3, no. 12, pp. 1013-1022, Dec. 2021, doi: 10.1038/s42256-021-00417-9.
  2. J. Wang, X. Qian, M. Zhang, R. T. Tan, and H. Li, "Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 14653-14662. doi: 10.1109/CVPR52729.2023.01408.
  3. B. Zhang, X. Zhang, N. Cheng, J. Yu, J. Xiao, and J. Wang, "EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model," Jan. 15, 2024, arXiv: arXiv:2401.08049. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2401.08049
  4. K. Cheng et al., "VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild," Nov. 27, 2022, arXiv: arXiv:2211.14758. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2211.14758
  5. X. Ji et al., "Audio-Driven Emotional Video Portraits," May 19, 2021, arXiv: arXiv:2104.07452. Accessed: Jun. 07, 2024. [Online]. Available: http://arxiv.org/abs/2104.07452
  6. L. Chen, R. K. Maddox, Z. Duan, and C. Xu, "Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss," May 09, 2019, arXiv: arXiv:1905.03820. Accessed: Jun. 21, 2024. [Online]. Available: http://arxiv.org/abs/1905.03820
  7. S. Tan, B. Ji, and Y. Pan, "EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation," in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France: IEEE, Oct. 2023, pp. 22089-22099. doi: 10.1109/ICCV51070.2023.02024.
  8. J. Wang, Y. Zhao, L. Liu, T. Xu, Q. Li, and S. Li, "Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks," Jun. 06, 2023, arXiv: arXiv:2306.03594. Accessed: Apr. 04, 2024. [Online]. Available: http://arxiv.org/abs/2306.03594
  9. M. Stypulkowski, K. Vougioukas, S. He, M. Zieba, S. Petridis, and M. Pantic, "Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation," in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2024, pp. 5089-5098. doi: 10.1109/WACV57701.2024.00502.
  10. S. Shen et al., "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation," presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982-1991. Accessed: Jun. 17, 2024. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2023/html/Shen_DiffTalk_Crafting_Diffusion_Models_for_Generalized_Audio-Driven_Portraits_Animation_CVPR_2023_paper.html
  11. H.-S. Vo-Thanh, Q.-V. Nguyen, and S.-H. Kim, "KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation," Sep. 09, 2024, arXiv: arXiv:2409.05330. Accessed: Sep. 10, 2024. [Online]. Available: http://arxiv.org/abs/2409.05330
  12. A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv.org. Accessed: Sep. 10, 2024. [Online]. Available: https://arxiv.org/abs/2006.11477v3
  13. Z. Liu et al., "KAN: Kolmogorov-Arnold Networks," arXiv.org. Accessed: Sep. 10, 2024. [Online]. Available: https://arxiv.org/abs/2404.19756v4
  14. K. Wang et al., "MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation," in Computer Vision - ECCV 2020, vol. 12366, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., in Lecture Notes in Computer Science, vol. 12366. , Cham: Springer International Publishing, 2020, pp. 700-717. doi: 10.1007/978-3-030-58589-1_42.
  15. Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, "MakeltTalk: speaker-aware talking-head animation," ACM Trans. Graph., vol. 39, no. 6, pp. 1-15, Dec. 2020, doi: 10.1145/3414685.3417774.
  16. "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild | Proceedings of the 28th ACM International Conference on Multimedia." Accessed: Sep. 10, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3394171.3413532