Browse > Article
http://dx.doi.org/10.14372/IEMEK.2022.17.6.347

A Study on the Alternative Method of Video Characteristics Using Captioning in Text-Video Retrieval Model  

Dong-hun, Lee (Kyungpook National University)
Chan, Hur (Kyungpook National University)
Hyeyoung, Park (Kyungpook National University)
Sang-hyo, Park (Kyungpook National University)
Publication Information
Abstract
In this paper, we propose a method that performs a text-video retrieval model by replacing video properties using captions. In general, the exisiting embedding-based models consist of both joint embedding space construction and the CNN-based video encoding process, which requires a lot of computation in the training as well as the inference process. To overcome this problem, we introduce a video-captioning module to replace the visual property of video with captions generated by the video-captioning module. To be specific, we adopt the caption generator that converts candidate videos into captions in the inference process, thereby enabling direct comparison between the text given as a query and candidate videos without joint embedding space. Through the experiment, the proposed model successfully reduces the amount of computation and inference time by skipping the visual processing process and joint embedding space construction on two benchmark dataset, MSR-VTT and VATEX.
Keywords
Multimodal Deep Learning; Video-Captioning; Text-Video Retrieval;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 A. Miech, I. Laptev, J. Sivic, "Learning a Text-video Embedding from Incomplete and Heterogeneous Data," arXiv preprint arXiv:1804.02516, 2018.
2 N. C. Mithun, J. Li, F. Metze, A. K. Roy-Chowdhury, "Learning Joint Embedding with Multimodal Cues for Cross-modal Video-text Retrieval," in ICMR, pp. 19-27, 2018.
3 X. Li, C. Xu, G. Yang, Z. Chen, J. Dong, "W2VV++: Fully Deep Learning for Ad-hoc Video Search," in ACM Multimedia, pp. 1786-1794, 2019.
4 A. Torabi, N. Tandon, L. Sigal, "Learning Language-visual Embedding for Movie Understanding with Natural-language," arXiv preprint arXiv:1609.08124, 2016.
5 G. Awad, J. Fiscus, D. Joy, M. Michel, A. Smeaton, W. Kraaij, G. Quenot, M. Eskevich, R. Aly, R. Ordelman, G. Jones, B. Huet, M. Larson, "TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking," in TRECVID Workshop, 2016.
6 X. Yang, J. Dong, Y. Cao, X. Wang, M. Wang, T. S. Chua, "Tree-augmented Cross-modal Encoding for Complex-query Video Retrieval," in SIGIR, pp. 1339-1348, 2020.
7 A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, "Howto100m: Learning a Text-video Embedding by Watching Hundred Million Narrated Video Clips," in ICCV, pp. 2630-2640, 2019.
8 M. Wray, D. Larlus, G. Csurka, D. Damen, "Fine-grained Action Retrieval Through Multiple Parts-of-speech Embeddings," in ICCV, pp. 450-459, 2019.
9 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, "Attention is all you need," in NIPS, pp. 5998-6008, 2017.
10 H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, M. Zhou, "Univl: A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation," arXiv preprint arXiv:2002.06353, 2020.
11 B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, J. C. Niebles, "Spatio-temporal Graph for Video Captioning with Knowledge Distillation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, 2020.
12 W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, Y. W. Tai, "Memory-attended Recurrent Network for Video Captioning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8347-8356, 2019.
13 S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, K. Saenko, "Sequence to Sequence - Video to Text," in Proc. IEEE Int. Conf. Comput. Vis., pp. 4534-4542, 2015.
14 L. Huang, W. Wang, J. Chen, X. Wei, "Attention on Attention for Image Captioning," in Proc. IEEE Int. Conf. Comput. Vis., pp. 4634-4643, 2019.
15 L. Gao, Z. Guo, H. Zhang, X. Xu, H. T. Shen, "Video Captioning with Attention-based LSTM and Semantic Consistency," IEEE Trans. Multimedia, Vol. 19, No. 9, pp. 2045-2055, 2017.   DOI
16 J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang, H. T. Shen, "Hierarchical Lstm with Adjusted Temporal Attention for Video Captioning," arXiv preprint arXiv:1706.01231, 2017.
17 L. Zhou, Y. Zhou, J. J. Corso, R. Socher, C. Xiong, "End-to-end Dense Video Captioning with Masked Transformer," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8739-8748, 2018.
18 J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang, M. Wang, "Dual Encoding for Video Retrieval by Text," IEEE Transactions on Pattern Analysis and Machine Intelligence . Vol. 44, No. 8, pp. 4065-4080, 2021.
19 X. Wang, L. Zhu, Y. Yang, "T2vlad: Global-local Sequence Alignment for Text-video Retrieval," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5079-5088, 2021.
20 J. Dong, X. Li, C. G. Snoek, "Predicting Visual Features from Text for Image and Video Caption Retrieval," IEEE Transactions on Multimedia, Vol. 20, No. 12, pp. 3377-3388, 2018.   DOI
21 F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, "VSE++: Improved Visual-semantic Embeddings," in BMVC, 2018, pp. 1-13.
22 J. Devlin, M. W. Chang, K. Lee, K. Toutanova, "Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint arXiv:1810.04805 (2018).