DOI QR코드

DOI QR Code

Analysis of Research Trends in Deep Learning-Based Video Captioning

딥러닝 기반 비디오 캡셔닝의 연구동향 분석

  • 려치 (전주대학교 문화기술학과) ;
  • 이은주 (전주대학교 문화기술학과) ;
  • 김영수 (전주대학교 인공지능학과)
  • Received : 2023.10.05
  • Accepted : 2023.11.28
  • Published : 2024.01.31

Abstract

Video captioning technology, as a significant outcome of the integration between computer vision and natural language processing, has emerged as a key research direction in the field of artificial intelligence. This technology aims to achieve automatic understanding and language expression of video content, enabling computers to transform visual information in videos into textual form. This paper provides an initial analysis of the research trends in deep learning-based video captioning and categorizes them into four main groups: CNN-RNN-based Model, RNN-RNN-based Model, Multimodal-based Model, and Transformer-based Model, and explain the concept of each video captioning model. The features, pros and cons were discussed. This paper lists commonly used datasets and performance evaluation methods in the video captioning field. The dataset encompasses diverse domains and scenarios, offering extensive resources for the training and validation of video captioning models. The model performance evaluation method mentions major evaluation indicators and provides practical references for researchers to evaluate model performance from various angles. Finally, as future research tasks for video captioning, there are major challenges that need to be continuously improved, such as maintaining temporal consistency and accurate description of dynamic scenes, which increase the complexity in real-world applications, and new tasks that need to be studied are presented such as temporal relationship modeling and multimodal data integration.

컴퓨터 비전과 자연어 처리의 융합의 중요한 결과로서 비디오 캡셔닝은 인공지능 분야의 핵심 연구 방향이다. 이 기술은 비디오 콘텐츠의 자동이해와 언어 표현을 가능하게 함으로써, 컴퓨터가 비디오의 시각적 정보를 텍스트 형태로 변환한다. 본 논문에서는 딥러닝 기반 비디오 캡셔닝의 연구 동향을 초기 분석하여 CNN-RNN 기반 모델, RNN-RNN 기반 모델, Multimodal 기반 모델, 그리고 Transformer 기반 모델이라는 네 가지 주요 범주로 나누어 각각의 비디오 캡셔닝 모델의 개념과 특징 그리고 장단점을 논하였다. 그리고 이 논문은 비디오 캡셔닝 분야에서 일반적으로 자주 사용되는 데이터 집합과 성능 평가방안을 나열하였다. 데이터 세트는 다양한 도메인과 시나리오를 포괄하여 비디오 캡션 모델의 훈련 및 검증을 위한 광범위한 리소스를 제공한다. 모델 성능 평가방안에서는 주요한 평가 지표를 언급하며, 모델의 성능을 다양한 각도에서 평가할 수 있도록 연구자들에게 실질적인 참조를 제공한다. 마지막으로 비디오 캡셔닝에 대한 향후 연구과제로서 실제 응용 프로그램에서의 복잡성을 증가시키는 시간 일관성 유지 및 동적 장면의 정확한 서술과 같이 지속해서 개선해야 할 주요 도전과제와 시간 관계 모델링 및 다중 모달 데이터 통합과 같이 새롭게 연구되어야 하는 과제를 제시하였다.

Keywords

Acknowledgement

이 논문은 2022년 전주대학교 연구비 지원을 받아 수행되었음.

References

  1. J. Zhang and Y. Peng, "Video captioning with object-aware spatio-temporal correlation and aggregation," IEEE Transactions on Image Processing, Vol.29, pp.6209-6222, 2020.  https://doi.org/10.1109/TIP.2020.2988435
  2. B. Pan et al., "Spatio-temporal graph for video captioning with knowledge distillation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10870-10879, 2020. 
  3. L. Li, X. Gao, J. Deng, Y. Tu, Z. Zha, and Q. Huang, "Long short-term relation transformer with global gating for video captioning," IEEE Transactions on Image Processing, Vol.31, pp.2726-2738, 2022.  https://doi.org/10.1109/TIP.2022.3158546
  4. L. Yao et al., "Describing videos by exploiting temporal structure," in Proceedings of the IEEE International Conference on Computer Vision, pp.4507-4515, 2015. 
  5. Z. Gan et al., "Semantic compositional networks for visual captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.5630-5639, 2017. 
  6. H. Chen, K. Lin, A. Maye, J. Li, and X. Hu, "A semantics-assisted video captioning model trained with scheduled sampling," Frontiers in Robotics and AI, Vol.7, pp.475767, 2020. 
  7. B. Wang, L. Ma, W. Zhang, and W. Liu, "Reconstruction network for video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.7622-7631, 2018. 
  8. H. Ye, G. Li, Y. Qi, S. Wang, Q. Huang, and M. Yang, "Hierarchical modular network for video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.17939-17948, 2022. 
  9. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, "Sequence to sequence-video to text," in Proceedings of the IEEE International Conference on Computer Vision, pp.4534-4542, 2015. 
  10. S. Venugopalan et al., "Translating videos to natural language using deep recurrent neural networks[J]," arXiv preprint arXiv:1412.4729, 2014. 
  11. C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015. 
  12. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016. 
  13. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international Conference on Computer Vision, pp.4489-4497, 2015. 
  14. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, Vol.9, No.8, pp.1735-1780, 1997.  https://doi.org/10.1162/neco.1997.9.8.1735
  15. S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1492-1500, 2017.
  16. M. Zolfaghari, K. Singh, and T. Brox, "Eco: Efficient convolutional network for online video understanding," in Proceedings of the European Conference on Computer Vision ECCV, pp.695-712, 2018. 
  17. J. P. Martin, B. Bustos, and J. Perez, "Improving video captioning with temporal composition of a visual-syntactic embedding," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.3039-3049, 2021. 
  18. L. Yan et al., "Gl-rg: Global-local representation granularity for video captioning," arXiv preprint arXiv:2205. 10706, 2022. 
  19. D. Tran, J. Ray, Z. Shou, S. F. Chang, and M. Paluri, "Convnet architecture search for spatiotemporal feature learning," arXiv preprint arXiv:1708.05038, 2017.
  20. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014. 
  21. B. K. Horn and B. G. Schunck, "Determining optical flow," Artificial Intelligence, Vol.17, No.1-3, pp.185-203, 1981.  https://doi.org/10.1016/0004-3702(81)90024-2
  22. J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2625-2634, 2015. 
  23. P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, "Hierarchical recurrent neural encoder for video representation with application to captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1029-1038, 2016. 
  24. L. Gao, Z. Guo, H. Zhang, X. Xu, and H. Shen, "Video captioning with attention-based lstm and semantic consistency," IEEE Transactions on Multimedia, Vol.19, No.9, pp.2045-2055, 2017. https://doi.org/10.1109/TMM.2017.2729019
  25. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2818-2826, 2016. 
  26. H. Jegou, M. Douze, C. Schmid, and P. Perez, "Aggregating local descriptors into a compact image representation," in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp.3304-3311, 2010. 
  27. T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," arXiv preprint arXiv: 1609.02907, 2016. 
  28. L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, "End-to-end dense video captioning with masked transformer," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.8739-8748, 2018. 
  29. R. Girshick, "Fast r-cnn," in Proceedings of the IEEE International Conference on Computer Vision, pp.1440-1448, 2015. 
  30. T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, "End-to-end dense video captioning with parallel decoding," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.6847-6857, 2021. 
  31. V. Ramanishka et al., "Multimodal video description," in Proceedings of the 24th ACM International Conference on Multimedia, pp.1092-1096, 2016. 
  32. B. Logan et al., "Mel frequency cepstral coefficients for music modeling," in Ismir. Plymouth, MA Vol.270, p.11, 2000. 
  33. Y. Xu, J. Yang, and K. Mao, "Semantic-filtered soft-split-aware video captioning with audio-augmented feature," Neurocomputing, Vol.357, pp.24-35, 2019.  https://doi.org/10.1016/j.neucom.2019.05.027
  34. C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, "Videobert: A joint model for video and language representation learning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.7464-7473, 2019. 
  35. J. Devlin, M. Chang, K. Lee, and K. Toutanova, "Bert: Pretraining of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. 
  36. S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification," in Proceedings of the European Conference on Computer Vision ECCV, pp.305-321, 2018. 
  37. S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, "Coot: Cooperative hierarchical transformer for video-text representation learning," Advances in Neural Information Processing Systems, Vol.33, pp.22605-22618, 2020. 
  38. V. Iashin and E. Rahtu, "Multi-modal dense video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp.958-959, 2020. 
  39. A. Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems, Vol.30, 2017.
  40. J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6299-6308, 2017.
  41. S. Hershey et al., "Cnn architectures for large-scale audio classification," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp.131-135, 2017.
  42. H. Luo et al., "Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning," Neurocomputing, Vol.508, pp.293-304, 2022.  https://doi.org/10.1016/j.neucom.2022.07.028
  43. A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020. 
  44. A. Radford et al., "Learning transferable visual models from natural language supervision," in International Conference on Machine Learning, PMLR, pp.8748-8763, 2021. 
  45. P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, "End-to-end generative pretraining for multimodal video captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.17959-17968, 2022. 
  46. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic and C. Schmid, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.6836-6846, 2021. 
  47. D. Chen and W. B. Dolan, "Collecting highly parallel data for paraphrase evaluation," in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.190-200, 2011. 
  48. J. Xu, T. Mei, T. Yao, and Y. Rui, "Msr-vtt: A large video description dataset for bridging video and language," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.5288-5296, 2016. 
  49. R. Krishna, K. Hata, F. Ren, F-F. Li, and J. C. Niebles, "Dense-captioning events in videos," in Proceedings of the IEEE International Conference on Computer Vision, pp.706-715, 2017.
  50. G. A. Sigurdsson, G. Varol, X. L. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowd-sourcing data collection for activity understanding," in Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, pp.510-526, 2016. 
  51. X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Wang, "Vatex: A large-scale, high-quality multilingual dataset for video-and-language research," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.4581-4591, 2019. 
  52. A. Rohrbach et al., "Movie description," arXiv preprint, 2016. 
  53. L. Zhou, C. Xu, and J. J Corso, "Towards automatic learning of procedures from web instructional videos," in AAAI Conference on Artificial Intelligence, pp.7590-7598, 2018. 
  54. A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, "A dataset for movie description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3202-3212, 2015. 
  55. S. Pini, M. Cornia, F. Bolelli, L. Baraldi, and R. Cucchiara, "M-vad names: a dataset for video captioning with naming," Multimedia Tools and Applications, Vol.78, pp.14007-14027, 2019.  https://doi.org/10.1007/s11042-018-7040-z
  56. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.311-318, 2002. 
  57. S. Banerjee and A. Lavie, "Meteor: An automatic metric for mt evaluation with improved correlation with human judgments," in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp.65-72, 2005. 
  58. R. Vedantam, C. L. Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4566-4575, 2015. 
  59. C. Lin, "Rouge: A package for automatic evaluation of summaries," in Text Summarization Branches Out, pp.74-81, 2004. 
  60. K. Cho. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. 
  61. A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, "Coherent multi-sentence video description with variable level of detail," in Pattern Recognition: 36th German Conference, GCPR 2014, Munster, Germany, September 2-5, 2014, Proceedings 36. Springer, pp.184-195, 2014. 
  62. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inceptionv4, inception-resnet and the impact of residual connections on learning," in Proceedings of the AAAI Conference on Artificial Intelligence, Vol.31. No.1, 2017.