Development of a Lipsync Algorithm Based on Audio-visual Corpus

시청각 코퍼스 기반의 립싱크 알고리듬 개발

  • 김진영 (전남대학교 공과대학 전자공학과) ;
  • 하영민 (전남대학교 공과대학 정보통신협동과정) ;
  • 이화숙 (광주여자대학교 예술디자인학부)
  • Published : 2001.04.01

Abstract

A corpus-based lip sync algorithm for synthesizing natural face animation is proposed in this paper. To get the lip parameters, some marks were attached some marks to the speaker's face, and the marks' positions were extracted with some Image processing methods. Also, the spoken utterances were labeled with HTK and prosodic information (duration, pitch and intensity) were analyzed. An audio-visual corpus was constructed by combining the speech and image information. The basic unit used in our approach is syllable unit. Based on this Audio-visual corpus, lip information represented by mark's positions was synthesized. That is. the best syllable units are selected from the audio-visual corpus and each visual information of selected syllable units are concatenated. There are two processes to obtain the best units. One is to select the N-best candidates for each syllable. The other is to select the best smooth unit sequences, which is done by Viterbi decoding algorithm. For these process, the two distance proposed between syllable units. They are a phonetic environment distance measure and a prosody distance measure. Computer simulation results showed that our proposed algorithm had good performances. Especially, it was shown that pitch and intensity information is also important as like duration information in lip sync.

본 논문에서는 자연스러운 얼굴 합성을 위한 코퍼스 기반의 립싱크 알고리듬을 제안한다. 립싱크 알고리듬을 개발하기 위하여 여성 아나운서의 시청각 코퍼스를 구축하였다 코퍼스 구축시, 입술파라미터 추출하기 위하여 여성화자의 얼굴에 스티커를 붙이고, 이의 위치를 영상처리기법에 의하여 얻었다. 그리고 길이, 세기 그리고 피치의 운율정보를 얻기 위하여 음성을 HTK (hidden Markov tool kit)를 사용하여 레이블 하였다. 립싱크의 기본단위로는 자음-모음-자음의 음절단위를 사용하였는데, 구축된 시청각 코퍼스는 입술의 정보 그리고 음운론적, 운율적 정보를 포함하는 음절들로 구성된다. 입술합성시에는 입력된 텍스트로부터 음절의 열을 만들고 각 음절에 적절한 대표들을 코퍼스로부터 N개씩 선정후, 최적의 열은 비터비탐색을 통하여 얻었다. 이를 위하여 음운론적 거리와 운율거리 함수가 정하였다. 컴퓨터 모의실험결과 제안된 알고리듬이 좋은 성능을 보임을 확인할 수 있었으며, 특히 립싱크에서는 길이정보뿐 아니라 길이와 피치의 정보도 유용함을 밝혔다.

Keywords

References

  1. Journal of the Acoustical Society of America v.26 Visual contribution to speech intelligibility in noise W. H. Sumby;I. Pollack
  2. Handbook of Research on Face Processing Q. Summerfield;A. MacLeod;M. McGrath;M. Brooke
  3. MIT Al Memo No1658/CBCL Meno No. 173 Visual Speech Synthesis by Morphing Visemes Tony Ezzat;Tomaso Poggio
  4. IEEE Trans. on Computer Graphicis and Applications Parameterized models for facial animation F. Parke
  5. Proceedings of SIGGRAPH 87 A Muscle models for Animating 3D Facial Expression Keith Waters
  6. Proc. SIGGRAPH 95 Realistic Modeling for Facial Animation Yuencheng Lee;Demetri Terzopoulos;Keith Waters
  7. Proceedings of International Conference on Image Processing A robust real-time face tracking algorithm R. Quian;I. Sezan;K. Matthews
  8. Journal of Speech & Hearing Research v.30 Effects of consonantal context on vowel lipreading A. A. Montgomery;B. E. Walden;R. A. Prosda
  9. Joint Publication Research Service v.30 Rech: artikulyatsiyaⅠVospriyatiye (TransArticulation and Perception) V. A. Kozhevnikov;L. A. Chistovich
  10. Models and Techniques in Computer Animation M. M. Cohen;D. W. Massaro
  11. The HTK Book S. Young;J. Odell;D. Ollason;V. Valtchev;P. Woodland
  12. Proceedings of Auditory-Visual Speech Processing, Visual speech Synthesis with Concatenative Speech A. Hallgren;B. Lyberg