A Study on Korean Speech Animation Generation Employing Deep Learning

Suk Chan Kang;Dong Ju Kim;

doi:10.3745/KTSDE.2023.12.10.461

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 12 Issue 10
/
Pages.461-470
/
2023
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

A Study on Korean Speech Animation Generation Employing Deep Learning

딥러닝을 활용한 한국어 스피치 애니메이션 생성에 관한 고찰

강석찬 (포항공과대학교 인공지능연구원 연구부) ;
김동주 (포항공과대학교 인공지능연구원 연구부)

Received : 2023.04.10
Accepted : 2023.08.24
Published : 2023.10.31

https://doi.org/10.3745/KTSDE.2023.12.10.461 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

While speech animation generation employing deep learning has been actively researched for English, there has been no prior work for Korean. Given the fact, this paper for the very first time employs supervised deep learning to generate Korean speech animation. By doing so, we find out the significant effect of deep learning being able to make speech animation research come down to speech recognition research which is the predominating technique. Also, we study the way to make best use of the effect for Korean speech animation generation. The effect can contribute to efficiently and efficaciously revitalizing the recently inactive Korean speech animation research, by clarifying the top priority research target. This paper performs this process: (i) it chooses blendshape animation technique, (ii) implements the deep-learning model in the master-servant pipeline of the automatic speech recognition (ASR) module and the facial action coding (FAC) module, (iii) makes Korean speech facial motion capture dataset, (iv) prepares two comparison deep learning models (one model adopts the English ASR module, the other model adopts the Korean ASR module, however both models adopt the same basic structure for their FAC modules), and (v) train the FAC modules of both models dependently on their ASR modules. The user study demonstrates that the model which adopts the Korean ASR module and dependently trains its FAC module (getting 4.2/5.0 points) generates decisively much more natural Korean speech animations than the model which adopts the English ASR module and dependently trains its FAC module (getting 2.7/5.0 points). The result confirms the aforementioned effect showing that the quality of the Korean speech animation comes down to the accuracy of Korean ASR.

딥러닝을 활용한 스피치 애니메이션 생성은 영어를 중심으로 활발하게 연구되어왔지만, 한국어에 관해서는 사례가 없었다. 이에, 본 논문은 최초로 지도 학습 딥러닝을 한국어 스피치 애니메이션 생성에 활용해 본다. 이 과정에서, 딥러닝이 스피치 애니메이션 연구를 그 지배적 기술인 음성 인식 연구로 귀결시킬 수 있는 중요한 효과를 발견하게 되어, 이 효과를 한국어 스피치 애니메이션 생성에 최대한 활용하는 방법을 고찰한다. 이 효과는 연구의 최우선 목표를 명확하게 하여, 근래에 들어 활발하지 않은 한국어 스피치 애니메이션 연구를 효과적이고 효율적으로 재활성화하는데 기여할 수 있다. 본 논문은 다음 과정들을 수행한다: (i) 블렌드쉐입 애니메이션 기술을 선택하며, (ii) 딥러닝 모델을 음성 인식 모듈과 표정 코딩 모듈의 주종 관계 파이프라인으로 구현하고, (iii) 한국어 스피치 모션 캡처 dataset을 제작하며, (iv) 두 대조용 딥러닝 모델들을 준비하고 (한 모델은 영어 음성 인식 모듈을 채택하고, 다른 모델은 한국어 음성 인식 모듈을 채택하며, 두 모델이 동일한 기본 구조의 표정 코딩 모듈을 채택한다), (v) 두 모델의 표정 코딩 모듈을 음성 인식 모듈에 종속되게 학습시킨다. 유저 스터디 결과는, 한국어 음성 인식 모듈을 채택하여 표정 코딩 모듈을 종속적으로 학습시킨 모델 (4.2/5.0 점 획득)이, 영어 음성 인식 모듈을 채택하여 표정 코딩 모듈을 종속적으로 학습시킨 모델 (2.7/5.0 점 획득)에 비해 결정적으로 더 자연스러운 한국어 스피치 애니메이션을 생성함을 보여 주었다. 이 결과는 한국어 스피치 애니메이션의 품질이 한국어 음성 인식의 정확성으로 귀결됨을 보여 줌으로써 상기의 효과를 확인해준다.

Keywords

Acknowledgement

본 논문은 2021년~2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 ICT R&D 혁신 바우처 지원사업 기금으로 미디어젠 주식회사 주관하에 수행한 '대화형 아바타 개발을 위한 영어/한국어 음성과 동조된 얼굴 모션 합성 솔루션 개발(2021-0-01096)' 과제의 연구 결과임. 본 논문은 2023년도 정부(교육부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업(No.2022R1A6A1A03052954)이며, 2023년도 정부 (과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(No.RS-2023-00231158, 비전기술을 활용한 편물 검단 및 환편기 예지보전 원격제어 통합모니터링 플랫폼).

References

M. Jang, S. Jung, and J. Noh, "Speech animation synthesis based on a Korean co-articulation model," Journal of the Korea Computer Graphics Society, Vol.26, No.3, pp.49-59, 2020. https://doi.org/10.15701/kcgs.2020.26.3.49
S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews, "Dynamic units of visual speech," in Proceedings of the ACM SIGGRAPH/Eurographics Conference on Computer Animation, Lausanne, Switzerland, pp.275-284, 2012.
S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews, "A deep learning approach for generalized speech animation," ACM Transactions on Graphics, Vol.36, No.4, pp.1-11, 2017. https://doi.org/10.1145/3072959.3073699
Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, "Visemenet: Audio-driven animator-centric speech animation," ACM Transactions on Graphics, Vol.37, No.161, pp.1-10, 2018. https://doi.org/10.1145/3197517.3201292
Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, "MakeltTalk: Speaker-aware talking-head animation," ACM Transactions on Graphics, Vol.39, No.6, pp.1-15, 2020. https://doi.org/10.1145/3414685.3417774
H. X. Pham, Y. Wang, and V. Pavlovic, "End-to-end learning for 3D facial animation from speech," In Proceedings of the ACM International Conference on Multimodal Interaction, New York, pp.361-365, 2018.
D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, "Capture, learning, and synthesis of 3D speaking styles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp.10093-10103, 2019.
A. Nagendran, S. Compton, W. C. Follette, A. Golenchenko, A. Compton, and J. Grizou, "Avatar led interventions in the metaverse reveal that interpersonal effectiveness can be measured, predicted, and improved," Scientific Reports, Vol.12, Iss.1, Article No.21892, 2022.
Speech Graphics, Clients [Internet], https://www.speech-graphics.com/
NVIDIA, Omniverse Audio2Face [Internet], https://www.nvidia.com/en-us/omniverse/apps/audio2face/
NEURAL SYNC, Wav2Lip [Internet], https://www.neuralsyncai.com
T. Ezzat, G. Geiger, and T. Poggio, "Trainable videorealistic speech animation," ACM Transactions on Graphics, Vol.21, Iss.3, pp.388-398, 2002. https://doi.org/10.1145/566654.566594
F. Shaw and B. Theobald, "Expressive modulation of neutral visual speech," in IEEE MultiMedia, Vol.23, Iss.4, pp.68-78, 2016. https://doi.org/10.1109/MMUL.2016.63
A. Richard, M. Zollhofer, Y. Wen, F. Torre, and Y. Sheikh, "MeshTalk: 3D face animation from speech using cross-modality disentanglement," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp.1153-1162, 2021.
L. Xie, L. Wang, and S. Yang, "Visual speech animation," in Handbook of Human Motion, Springer, Cham, pp. 2115-2144. 2018.
F. I. Parke, "A parametric model of human faces," PhD thesis, University of Utah, 1974.
Face the FACS, Facial Expressions in Art, Science, and Technology [Internet], https://melindaozel.com
S. W. Kim, H. Lee, K. H. Choi, and S. Y. Park, "A talking head system for Korean text," International Journal of Electrical and Computer Engineering, Vol.3, No.2, pp. 167-170, 2009.
T. E. Kim and Y. S. Park, "Facial animation generation by Korean text input," The Journal of The Korea Institute of Electronic Communication Sciences, Vol.4, No.2, pp. 116-122, 2009.
T. Kim, "A study on Korean lip-sync for animation characters - based on lip-sync technique in English-speaking animations," Cartoon and Animation Studies, No.13, pp. 97-114, 2008.
H. H. Oh, I. C. Kim, D. S. Kim, and S. I. Chien, "A study on spatio-temporal features for Korean vowel lipreading," The Journal of the Acoustical Society of Korea, Vol.21, No.1, pp.19-26, 2002.
H. J. Hyung, B. K. Ahn, D. Choi, D. Lee, and D. W. Lee, "Evaluation of a Korean lip-sync system for an android robot," In Proceedings of the IEEE International Conference on Ubiquitous Robots and Ambient Intelligence, Xian, China, pp.78-82, 2016.
I. H. Jung and E. Kim, "Natural 3D lip-synch animation based on Korean phonemic data," Journal of Digital Contents Society, Vol.9, No.2, pp.331-339, 2008.
Y.-C. Wang and R. T.-H. Tsai, "Rule-based Korean grapheme to phoneme conversion using sound patterns," in Proceedings of the Pacific Asia Conference on Language, Information and Computation, Vol.2, pp.843-850, 2009.
D. Povey et al., "The kaldi speech recognition toolkit," in Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hawaii, US, pp.1-4, 2011.
S. Lim, J. Goo, and H. Kim, "Visual analysis of attention-based end-to-end speech recognition," Phonetics and Speech Sciences, Vol.11, No.1, pp.41-49, 2019. https://doi.org/10.13064/KSSS.2019.11.1.041
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal forced aligner: Trainable text-speech alignment using kaldi." in Proceedings of the International Speech Communication Association, Stockholm, Sweden, pp.498-502, 2017.
Apple Inc., ARFaceAnchor.BlendShapeLocation [Internet], https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation
R. D. Kent and F. D. Minifie, "Coarticulation in recent speech production models," Journal of Phonetics, Vol.5, No.2, pp.115-133, 1977. https://doi.org/10.1016/S0095-4470(19)31123-4
P. Edwards, C. Landreth, E. Fiume, and K. Singh, "JALI: An animator-centric viseme model for expressive lip synchronization," ACM Transactions on Graphics, Vol.35, No.4, pp.1-11, 2016. https://doi.org/10.1145/2897824.2925984
Blender Online Community, Blender - a 3D modeling and rendering package [Internet], http://www.blender.org
B. Fan, L. Wang, F. K. Soong, and L. Xie. "Photo-real talking head with deep bidirectional LSTM," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Australia, pp.4884-4888, 2015.
Hugging Face, Facebook Models [Internet], https://huggingface.co/facebook
Apple App Store, Face Cap - Motion Capture [Internet] https://apps.apple.com/us/app/face-cap-motion-capture/id1373155478
OpenSLR, Zeroth-Korean [Internet], http://www.openslr.org/40/