Acknowledgement
본 논문은 2021년~2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 ICT R&D 혁신 바우처 지원사업 기금으로 미디어젠 주식회사 주관하에 수행한 '대화형 아바타 개발을 위한 영어/한국어 음성과 동조된 얼굴 모션 합성 솔루션 개발(2021-0-01096)' 과제의 연구 결과임. 본 논문은 2023년도 정부(교육부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업(No.2022R1A6A1A03052954)이며, 2023년도 정부 (과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(No.RS-2023-00231158, 비전기술을 활용한 편물 검단 및 환편기 예지보전 원격제어 통합모니터링 플랫폼).
References
- M. Jang, S. Jung, and J. Noh, "Speech animation synthesis based on a Korean co-articulation model," Journal of the Korea Computer Graphics Society, Vol.26, No.3, pp.49-59, 2020. https://doi.org/10.15701/kcgs.2020.26.3.49
- S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews, "Dynamic units of visual speech," in Proceedings of the ACM SIGGRAPH/Eurographics Conference on Computer Animation, Lausanne, Switzerland, pp.275-284, 2012.
- S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews, "A deep learning approach for generalized speech animation," ACM Transactions on Graphics, Vol.36, No.4, pp.1-11, 2017. https://doi.org/10.1145/3072959.3073699
- Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, "Visemenet: Audio-driven animator-centric speech animation," ACM Transactions on Graphics, Vol.37, No.161, pp.1-10, 2018. https://doi.org/10.1145/3197517.3201292
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, "MakeltTalk: Speaker-aware talking-head animation," ACM Transactions on Graphics, Vol.39, No.6, pp.1-15, 2020. https://doi.org/10.1145/3414685.3417774
- H. X. Pham, Y. Wang, and V. Pavlovic, "End-to-end learning for 3D facial animation from speech," In Proceedings of the ACM International Conference on Multimodal Interaction, New York, pp.361-365, 2018.
- D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, "Capture, learning, and synthesis of 3D speaking styles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp.10093-10103, 2019.
- A. Nagendran, S. Compton, W. C. Follette, A. Golenchenko, A. Compton, and J. Grizou, "Avatar led interventions in the metaverse reveal that interpersonal effectiveness can be measured, predicted, and improved," Scientific Reports, Vol.12, Iss.1, Article No.21892, 2022.
- Speech Graphics, Clients [Internet], https://www.speech-graphics.com/
- NVIDIA, Omniverse Audio2Face [Internet], https://www.nvidia.com/en-us/omniverse/apps/audio2face/
- NEURAL SYNC, Wav2Lip [Internet], https://www.neuralsyncai.com
- T. Ezzat, G. Geiger, and T. Poggio, "Trainable videorealistic speech animation," ACM Transactions on Graphics, Vol.21, Iss.3, pp.388-398, 2002. https://doi.org/10.1145/566654.566594
- F. Shaw and B. Theobald, "Expressive modulation of neutral visual speech," in IEEE MultiMedia, Vol.23, Iss.4, pp.68-78, 2016. https://doi.org/10.1109/MMUL.2016.63
- A. Richard, M. Zollhofer, Y. Wen, F. Torre, and Y. Sheikh, "MeshTalk: 3D face animation from speech using cross-modality disentanglement," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp.1153-1162, 2021.
- L. Xie, L. Wang, and S. Yang, "Visual speech animation," in Handbook of Human Motion, Springer, Cham, pp. 2115-2144. 2018.
- F. I. Parke, "A parametric model of human faces," PhD thesis, University of Utah, 1974.
- Face the FACS, Facial Expressions in Art, Science, and Technology [Internet], https://melindaozel.com
- S. W. Kim, H. Lee, K. H. Choi, and S. Y. Park, "A talking head system for Korean text," International Journal of Electrical and Computer Engineering, Vol.3, No.2, pp. 167-170, 2009.
- T. E. Kim and Y. S. Park, "Facial animation generation by Korean text input," The Journal of The Korea Institute of Electronic Communication Sciences, Vol.4, No.2, pp. 116-122, 2009.
- T. Kim, "A study on Korean lip-sync for animation characters - based on lip-sync technique in English-speaking animations," Cartoon and Animation Studies, No.13, pp. 97-114, 2008.
- H. H. Oh, I. C. Kim, D. S. Kim, and S. I. Chien, "A study on spatio-temporal features for Korean vowel lipreading," The Journal of the Acoustical Society of Korea, Vol.21, No.1, pp.19-26, 2002.
- H. J. Hyung, B. K. Ahn, D. Choi, D. Lee, and D. W. Lee, "Evaluation of a Korean lip-sync system for an android robot," In Proceedings of the IEEE International Conference on Ubiquitous Robots and Ambient Intelligence, Xian, China, pp.78-82, 2016.
- I. H. Jung and E. Kim, "Natural 3D lip-synch animation based on Korean phonemic data," Journal of Digital Contents Society, Vol.9, No.2, pp.331-339, 2008.
- Y.-C. Wang and R. T.-H. Tsai, "Rule-based Korean grapheme to phoneme conversion using sound patterns," in Proceedings of the Pacific Asia Conference on Language, Information and Computation, Vol.2, pp.843-850, 2009.
- D. Povey et al., "The kaldi speech recognition toolkit," in Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hawaii, US, pp.1-4, 2011.
- S. Lim, J. Goo, and H. Kim, "Visual analysis of attention-based end-to-end speech recognition," Phonetics and Speech Sciences, Vol.11, No.1, pp.41-49, 2019. https://doi.org/10.13064/KSSS.2019.11.1.041
- M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal forced aligner: Trainable text-speech alignment using kaldi." in Proceedings of the International Speech Communication Association, Stockholm, Sweden, pp.498-502, 2017.
- Apple Inc., ARFaceAnchor.BlendShapeLocation [Internet], https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation
- R. D. Kent and F. D. Minifie, "Coarticulation in recent speech production models," Journal of Phonetics, Vol.5, No.2, pp.115-133, 1977. https://doi.org/10.1016/S0095-4470(19)31123-4
- P. Edwards, C. Landreth, E. Fiume, and K. Singh, "JALI: An animator-centric viseme model for expressive lip synchronization," ACM Transactions on Graphics, Vol.35, No.4, pp.1-11, 2016. https://doi.org/10.1145/2897824.2925984
- Blender Online Community, Blender - a 3D modeling and rendering package [Internet], http://www.blender.org
- B. Fan, L. Wang, F. K. Soong, and L. Xie. "Photo-real talking head with deep bidirectional LSTM," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Australia, pp.4884-4888, 2015.
- Hugging Face, Facebook Models [Internet], https://huggingface.co/facebook
- Apple App Store, Face Cap - Motion Capture [Internet] https://apps.apple.com/us/app/face-cap-motion-capture/id1373155478
- OpenSLR, Zeroth-Korean [Internet], http://www.openslr.org/40/