DOI QR코드

DOI QR Code

텍스트-비디오 검색 모델에서의 캡션을 활용한 비디오 특성 대체 방안 연구

A Study on the Alternative Method of Video Characteristics Using Captioning in Text-Video Retrieval Model

  • 투고 : 2022.10.20
  • 심사 : 2022.11.26
  • 발행 : 2022.12.31

초록

In this paper, we propose a method that performs a text-video retrieval model by replacing video properties using captions. In general, the exisiting embedding-based models consist of both joint embedding space construction and the CNN-based video encoding process, which requires a lot of computation in the training as well as the inference process. To overcome this problem, we introduce a video-captioning module to replace the visual property of video with captions generated by the video-captioning module. To be specific, we adopt the caption generator that converts candidate videos into captions in the inference process, thereby enabling direct comparison between the text given as a query and candidate videos without joint embedding space. Through the experiment, the proposed model successfully reduces the amount of computation and inference time by skipping the visual processing process and joint embedding space construction on two benchmark dataset, MSR-VTT and VATEX.

키워드

과제정보

본 논문은 2020년도 정부 (교육부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업임 (No. 2020R1I1A3072227).

참고문헌

  1. A. Miech, I. Laptev, J. Sivic, "Learning a Text-video Embedding from Incomplete and Heterogeneous Data," arXiv preprint arXiv:1804.02516, 2018.
  2. N. C. Mithun, J. Li, F. Metze, A. K. Roy-Chowdhury, "Learning Joint Embedding with Multimodal Cues for Cross-modal Video-text Retrieval," in ICMR, pp. 19-27, 2018.
  3. X. Li, C. Xu, G. Yang, Z. Chen, J. Dong, "W2VV++: Fully Deep Learning for Ad-hoc Video Search," in ACM Multimedia, pp. 1786-1794, 2019.
  4. A. Torabi, N. Tandon, L. Sigal, "Learning Language-visual Embedding for Movie Understanding with Natural-language," arXiv preprint arXiv:1609.08124, 2016.
  5. G. Awad, J. Fiscus, D. Joy, M. Michel, A. Smeaton, W. Kraaij, G. Quenot, M. Eskevich, R. Aly, R. Ordelman, G. Jones, B. Huet, M. Larson, "TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking," in TRECVID Workshop, 2016.
  6. X. Yang, J. Dong, Y. Cao, X. Wang, M. Wang, T. S. Chua, "Tree-augmented Cross-modal Encoding for Complex-query Video Retrieval," in SIGIR, pp. 1339-1348, 2020.
  7. A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, "Howto100m: Learning a Text-video Embedding by Watching Hundred Million Narrated Video Clips," in ICCV, pp. 2630-2640, 2019.
  8. M. Wray, D. Larlus, G. Csurka, D. Damen, "Fine-grained Action Retrieval Through Multiple Parts-of-speech Embeddings," in ICCV, pp. 450-459, 2019.
  9. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, "Attention is all you need," in NIPS, pp. 5998-6008, 2017.
  10. H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, M. Zhou, "Univl: A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation," arXiv preprint arXiv:2002.06353, 2020.
  11. B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, J. C. Niebles, "Spatio-temporal Graph for Video Captioning with Knowledge Distillation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, 2020.
  12. W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, Y. W. Tai, "Memory-attended Recurrent Network for Video Captioning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8347-8356, 2019.
  13. S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, K. Saenko, "Sequence to Sequence - Video to Text," in Proc. IEEE Int. Conf. Comput. Vis., pp. 4534-4542, 2015.
  14. L. Gao, Z. Guo, H. Zhang, X. Xu, H. T. Shen, "Video Captioning with Attention-based LSTM and Semantic Consistency," IEEE Trans. Multimedia, Vol. 19, No. 9, pp. 2045-2055, 2017. https://doi.org/10.1109/TMM.2017.2729019
  15. J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang, H. T. Shen, "Hierarchical Lstm with Adjusted Temporal Attention for Video Captioning," arXiv preprint arXiv:1706.01231, 2017.
  16. L. Zhou, Y. Zhou, J. J. Corso, R. Socher, C. Xiong, "End-to-end Dense Video Captioning with Masked Transformer," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8739-8748, 2018.
  17. L. Huang, W. Wang, J. Chen, X. Wei, "Attention on Attention for Image Captioning," in Proc. IEEE Int. Conf. Comput. Vis., pp. 4634-4643, 2019.
  18. J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang, M. Wang, "Dual Encoding for Video Retrieval by Text," IEEE Transactions on Pattern Analysis and Machine Intelligence . Vol. 44, No. 8, pp. 4065-4080, 2021.
  19. X. Wang, L. Zhu, Y. Yang, "T2vlad: Global-local Sequence Alignment for Text-video Retrieval," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5079-5088, 2021.
  20. J. Dong, X. Li, C. G. Snoek, "Predicting Visual Features from Text for Image and Video Caption Retrieval," IEEE Transactions on Multimedia, Vol. 20, No. 12, pp. 3377-3388, 2018. https://doi.org/10.1109/tmm.2018.2832602
  21. F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, "VSE++: Improved Visual-semantic Embeddings," in BMVC, 2018, pp. 1-13.
  22. J. Devlin, M. W. Chang, K. Lee, K. Toutanova, "Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint arXiv:1810.04805 (2018).