DOI QR코드

DOI QR Code

Three-Dimensional Convolutional Vision Transformer for Sign Language Translation

수어 번역을 위한 3차원 컨볼루션 비전 트랜스포머

  • 성호렬 (고려대학교 컴퓨터정보학과) ;
  • 조현중 (고려대학교 컴퓨터융합소프트웨어학과 )
  • Received : 2023.11.22
  • Accepted : 2024.02.09
  • Published : 2024.03.31

Abstract

In the Republic of Korea, people with hearing impairments are the second-largest demographic within the registered disability community, following those with physical disabilities. Despite this demographic significance, research on sign language translation technology is limited due to several reasons including the limited market size and the lack of adequately annotated datasets. Despite the difficulties, a few researchers continue to improve the performacne of sign language translation technologies by employing the recent advance of deep learning, for example, the transformer architecture, as the transformer-based models have demonstrated noteworthy performance in tasks such as action recognition and video classification. This study focuses on enhancing the recognition performance of sign language translation by combining transformers with 3D-CNN. Through experimental evaluations using the PHOENIX-Wether-2014T dataset [1], we show that the proposed model exhibits comparable performance to existing models in terms of Floating Point Operations Per Second (FLOPs).

한국에서 청각장애인은 지체장애인에 이어 두 번째로 많은 등록 장애인 그룹이다. 하지만 수어 기계 번역은 시장 성장성이 작고, 엄밀하게 주석처리가 된 데이터 세트가 부족해 발전 속도가 더디다. 한편, 최근 컴퓨터 비전과 패턴 인식 분야에서 트랜스포머를 사용한 모델이 많이 제안되고 있는데, 트랜스포머를 이용한 모델은 동작 인식, 비디오 분류 등의 분야에서 높은 성능을 보여오고 있다. 이에 따라 수어 기계 번역 분야에서도 트랜스포머를 도입하여 성능을 개선하려는 시도들이 제안되고 있다. 본 논문에서는 수어 번역을 위한 인식 부분을 트랜스포머와 3D-CNN을 융합한 3D-CvT를 제안한다. 또, PHOENIX-Wether-2014T [1]를 이용한 실험을 통해 제안 모델은 기존 모델보다 적은 연산량으로도 비슷한 번역 성능을 보이는 효율적인 모델임을 실험적으로 증명하였다.

Keywords

Acknowledgement

본 연구는 2023년 정부(교육부)의 재원으로 한국연구재단 기초연구사업의 지원을 받아 수행된 연구임(No. NRF-2021R1F1A1049202).

References

  1. N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, "Neural sign language translation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 
  2. N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, "Sign language transformers: Joint end-to-end sign language recognition and translation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 
  3. K. Yin, and R. Jesse, "Better sign language translation with STMC-transformer," arXiv preprint arXiv:2004.00588, 2020. 
  4. H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, "Improving sign language translation with monolingual data by sign back-translation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 
  5. Y. Chen, F. Wei, X. Sun, Z. Wu, and S. Lin, "A simple multi-modality transfer learning baseline for sign language translation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 
  6. Y. Chen, R. Zuo, F. Wei, Y. Wu, S. Liu, and B. Mak, "Two-stream network for sign language recognition and translation," Advances in Neural Information Processing Systems, Vol.35, pp.17043-17056, 2022. 
  7. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020. 
  8. H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan et al., "Cvt: Introducing convolutions to vision transformers," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 
  9. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," Proceedings of the 23rd International Conference on Machine Learning, 2006. 
  10. S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification," Proceedings of the European Conference on Computer Vision (ECCV), 2018. 
  11. W. Kay et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017. 
  12. D. Li, C. R. Opazo, X. Yu, and H. Li, "Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020. 
  13. Y. Liu et al., "Multilingual denoising pre-training for neural machine translation," Transactions of the Association for Computational Linguistics, Vol.8, pp.726-742, 2020.  https://doi.org/10.1162/tacl_a_00343
  14. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "Bleu: a method for automatic evaluation of machine translation," Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002. 
  15. Y. Wang et al., "Internvideo: General video foundation models via generative and discriminative learning," arXiv preprint arXiv:2212.03191, 2022. 
  16. A. J. Piergiovanni, W. Kuo, and A. Angelova, "Rethinking video vits: Sparse video tubes for joint image and video learning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 
  17. G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?," ICML, Vol.2, No.3, 2021. 
  18. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, "Vivit: A video vision transformer," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 
  19. M. Lewis et al., "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension," arXiv preprint arXiv:1910.13461, 2019. 
  20. J. Guo et al., "Cmt: Convolutional neural networks meet vision transformers," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 
  21. J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009.