DOI QR코드

DOI QR Code

Korean Lip-Reading: Data Construction and Sentence-Level Lip-Reading

한국어 립리딩: 데이터 구축 및 문장수준 립리딩

  • Sunyoung Cho (AI & Autonomy Technology Center, Advanced Defense Science & Technology Research Institute, Agency for Defense Development) ;
  • Soosung Yoon (Defense Satellite Systems PMO, Advanced Defense Science & Technology Research Institute, Agency for Defense Development)
  • 조선영 (국방과학연구소 국방첨단과학기술연구원 인공지능자율기술센터) ;
  • 윤수성 (국방과학연구소 국방첨단과학기술연구원 위성체계단)
  • Received : 2023.07.05
  • Accepted : 2024.01.17
  • Published : 2024.04.05

Abstract

Lip-reading is the task of inferring the speaker's utterance from silent video based on learning of lip movements. It is very challenging due to the inherent ambiguities present in the lip movement such as different characters that produce the same lip appearances. Recent advances in deep learning models such as Transformer and Temporal Convolutional Network have led to improve the performance of lip-reading. However, most previous works deal with English lip-reading which has limitations in directly applying to Korean lip-reading, and moreover, there is no a large scale Korean lip-reading dataset. In this paper, we introduce the first large-scale Korean lip-reading dataset with more than 120 k utterances collected from TV broadcasts containing news, documentary and drama. We also present a preprocessing method which uniformly extracts a facial region of interest and propose a transformer-based model based on grapheme unit for sentence-level Korean lip-reading. We demonstrate that our dataset and model are appropriate for Korean lip-reading through statistics of the dataset and experimental results.

Keywords

Acknowledgement

이 논문은 2024년 정부의 재원으로 수행된 연구 결과임.

References

  1. T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman, "Deep audio-visual speech recognition," IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 44, Issue. 12, pp. 8717-8727, 2018.
  2. S. Petridis, M. Pantic, "Deep complementary bottleneck features for visual speech recognition," ICASSP, pp. 2304-2308, 2016.
  3. M. Wand, J. Koutn, J. Schmidhuber, "Lipreading with long short-term memory," ICASSP, pp. 6115-6119, 2016.
  4. S. O. Kim, K. H. Lee, "Design & implementation of speechreading system using the face feature on the korean 8 vowels," Korea Society of Computer and Information Winter Conference pp. 135-140, 2008.
  5. M. A. Lee, "A lip-reading algorithm using optical flow and properties of articulatory phonation," Journal of Korea Multimedia Society, Vol. 21, No. 7, pp. 745-754, 2018. https://doi.org/10.9717/KMMS.2018.21.7.745
  6. J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, S. Zafeiriou, "RetinaFace: Single-shot multi-level dense localisation in the wild," CVPR, pp. 5203-5212, 2020.
  7. A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber, "Connectionist temporal classification: Labeling unsegmented sequence data with recurrent neural networks," ICML, pp. 369-376, 2006.
  8. T. Stafylakis, G. Tzimiropoulos, "Combining residual networks with LSTMs for lipreading," Interspeech, pp. 3652-3656, 2017.
  9. K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition," CVPR, pp. 770-778, 2016.
  10. J. S. Chung, A. Zisserman, "Lip reading in the wild," ACCV, pp. 87-103, 2016.
  11. J. S. Chung, A. Senior, O. Vinyals, A. Zisserman, "Lip reading sentences in the wild," CVPR, 2017.
  12. J. S. Chung, A. Zisserman, "Lip reading in profile," BMVC, pp. 155.1-155.11, 2017.
  13. T. Afouras, J. S. Chung, A. Zisserman, "LRS3-TED: a large-scale dataset for visual speech recognition," In arXiv preprint arXiv:1809.00496, 2018.
  14. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, "SSD: Single shot multibox detector," ECCV, pp. 21-37, 2016.
  15. J. Yuan, M. Liberman, "Speaker identification on the scotus corpus," Journal of the Acoustical Society of America, Vol. 123, No. 5, pp. 3878-3882, 2008.
  16. J. S. Chung, A. Zisserman, "Out of time: automated lip sync in the wild," In Workshop on Multi-view Lip-reading, ACCV, pp. 251-263, 2016.
  17. T. Stafylakis, G. Tzimiropoulos, "Combining residual networks with lstms for lipreading," Interspeech, 2017.
  18. Y. M. Assael, B. Shillingford, S. Whiteson, N. Freitas, "Lipnet: Sentence-level lipreading," arXiv: 1611.01559, 2016.
  19. B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, M. Mulville, B. Coppin, B. Laurie, A. Senior, N. Freitas, "Large-scale visual speech recognition," Interspeech, pp. 4134-4139, 2019.
  20. X. Zhang, F. Cheng, S. Wang, "Spatio-temporal fusion based convolutional sequence learning for lip reading," ICCV, pp. 713-722, 2019.
  21. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang, "Conformer: Convolution-augmented transformer for speech recognition," Interspeech, pp. 5036-5040, 2020.
  22. P. Ma, S. Petridis, M. Pantic, "End-to-end autiovisual speech recognition with conformers," ICASSP, pp. 7613-7617, 2021.
  23. H. Dinkel, S. Wang, X. Xu, M. Wu, K. Yu, "Voice activity detection in the wild: a data-driven approach using teacher-student training," IEEE/ACM Trans. on Audio, Speech and Language Processing, Vol. 29, pp. 1542-1555, 2021. https://doi.org/10.1109/TASLP.2021.3073596
  24. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, M. Sonderegger, "Montreal Forced Aligner: Trainable text-speech alignment using Kaldi," Interspeech, pp. 498-502, 2017.
  25. S. S. Yoon, T. Y. Chun, D.-J. Jung, H. S. Song, "A study on data preprocessing for lip-reading of national defense data," KIMST Annual Conference, pp. 367-368, 2022.