[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2022.07.016

Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval

Liu, Zhi (North China University of Technology)
Cai, Jincen (North China University of Technology)
Zhang, Mengmeng (North China University of Technology)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.16, no.7, 2022 , pp. 2407-2424 More about this Journal

Abstract

Recently, Transformer has made great progress in video retrieval tasks due to its high representation capability. For the structure of a Transformer, the cascaded self-attention modules are capable of capturing long-distance feature dependencies. However, the local feature details are likely to have deteriorated. In addition, increasing the depth of the structure is likely to produce learning bias in the learned features. In this paper, an improved Transformer structure named TransDCS (Transformer with Dynamic Convolution and Shortcut) is proposed. A Multi-head Conv-Self-Attention module is introduced to model the local dependencies and improve the efficiency of local features extraction. Meanwhile, the augmented shortcuts module based on a dual identity matrix is applied to enhance the conduction of input features, and mitigate the learning bias. The proposed model is tested on MSRVTT, LSMDC and Activity-Net benchmarks, and it surpasses all previous solutions for the video-text retrieval task. For example, on the LSMDC benchmark, a gain of about 2.3% MdR and 6.1% MnR is obtained over recently proposed multimodal-based methods.

Keywords

Video representation; cross-modal retrieval; Multi-modal; Local Descriptors; Transformer;

Citations & Related Records

Reference

1	Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, "Conformer: Convolution-augmented Transformer for Speech Recognition," in Proc. of Interspeech, pp. 5036-5040, 2020.
2	Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller, "Rethinking Attention with Performers," in Proc. of ICLR 2021, 2021.
3	Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Gabriel Synnaeve, Mathilde Caron, Ivan Laptev, Natalia Neverova, Jakob Verbeek, and Herve Jegou, "XCiT: Cross-Covariance Image Transformers," arXiv:2106.09681v2 [cs.CV], 2021.
4	Nikita Kitaev, Lukasz Kaise, and Anselm Levskaya, "Reformer : the efficient transformer," in Proc. of ICLR2020, 2020.
5	Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid, "VideoBERT: A Joint Model for Video and Language Representation Learning," in Proc. of Ieee I Conf Comp Vis, pp. 7463-7472, 2019.
6	Max Bain, Arsha Nagrani, Gul Varol, and Andrew Zisserman, "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval," in Proc. of 2021 IEEE/Cvf International Conference on Computer Vision (Iccv 2021), pp. 1728-1738, 2021.
7	Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy, "Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification," in Proc. of Computer Vision - Eccv 2018, vol.11219, pp. 318-335, 2018.
8	Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele, "A Dataset for Movie Description," in Proc. of 2015 IEEE Conference on Computer Vision and Pattern Recognition (Cvpr), pp. 3202-3212, 2015.
9	Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles, "Dense-Captioning Events in Videos," in Proc. of 2017 IEEE International Conference on Computer Vision (Iccv), pp. 706-715, 2017.
10	Antoine Miech, Ivan Laptev, and Inria Josef Sivic, "Learning a Text-Video Embedding from Incomplete and Heterogeneous Data," arXiv:1804.02516v2 [cs.CV], 2020.
11	Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein, "Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning," in Proc. of Thirty-Fifth Aaai Conference on Artificial Intelligence, Thirty-Third Conference on Innovative Applications of Artificial Intelligence and the Eleventh Symposium on Educational Advances in Artificial Intelligence, vol.35, pp. 6644-6652, 2021.
12	Bowen Zhang, Hexiang Hu, and Fei Sha, "Cross-Modal and Hierarchical Modeling of Video and Text," in Proc. of Computer Vision - Eccv 2018, Pt Xiii, vol.11217, pp. 385-401, 2018.
13	Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou, "UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation," arXiv:2002.06353v3 [cs.CV], 2020.
14	Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes, "Spatiotemporal Multiplier Networks for Video Action Recognition," in Proc. of 30th IEEE Conference on Computer Vision and Pattern Recognition (Cvpr 2017), pp. 4768-4777, 2017.
15	Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang, "Dual Encoding for Zero-Example Video Retrieval," in Proc. of 2019 IEEE/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr 2019), pp. 9338-9347, 2019.
16	Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805v2 [cs.CL], 2019.
17	Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran, "Image Transformer," in Proc. of ICML, pp. 4052-4061, 2018.
18	Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li, "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval," arXiv:2104.08860v2 [cs.CV], 2021.
19	Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition," Computer Vision - Eccv 2016, Pt Viii, vol.9912, pp. 20-36, 2016.
20	Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," in Proc. of 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada., 2020.
21	Youngjae Yu, Jongseok Kim, and Gunhee Kim, "A Joint Sequence Fusion Model for Video Question Answering and Retrieval," in Proc. of Computer Vision - ECCV 2018, vol.11211, pp. 487-503, 2018.
22	Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman, "Convolutional Two-Stream Network Fusion for Video Action Recognition," in Proc. of Cvpr Ieee, pp. 1933-1941, 2016.
23	Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang, "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval," in Proc. of 2021 IEEE/Cvf International Conference on Computer Vision (Iccv 2021), pp. 11915-11925, 2021.
24	Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid, "Multi-modal Transformer for Video Retrieval," in Proc. of ECCV 2020: Computer Vision - ECCV 2020, vol.12349, pp. 214-229, 2020.
25	Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury, "Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval," in Proc. of Icmr '18: Proceedings of the 2018 Acm International Conference on Multimedia Retrieval, pp. 19-27, 2018.
26	Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu, "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10638-10647, 2020.
27	Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, RifA. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson, "Cnn Architectures for Large-Scale Audio Classification," in Proc. of Int Conf Acoust Spee and Signal Processing, pp. 131-135, 2017.
28	Xing Cheng, HeZheng Lin, XiangYu Wu, Fan Yang, and Dong Shen, "Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss," arXiv:2109.04290v2 [cs.CV], 2021.
29	Yale Song, and Mohammad Soleymani, "Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval," in Proc. of 2019 IEEE/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr 2019), pp. 1979-1988, 2019.
30	Linchao Zhu, and Yi Yang, "ActBERT: Learning Global-Local Video-Text Representations," in Proc. of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8743-8752, 2020.
31	Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, and Zheng Zhang, "Multi-Scale Self-Attention for Text Classification," in Proc. of AAAI2020, 2020.
32	Jun Xu, Tao Mei, Ting Yao, and Yong Rui, "MSR-VTT: A Large Video Description Dataset for Bridging Video and Language," in Proc. of Cvpr Ieee, pp. 5288-5296, 2016.
33	Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, and James Glass, "AVLnet: Learning Audio-Visual Language Representations from Instructional Videos," in Proc. of Interspeech 2021, pp. 1584-1588, 2021.
34	Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu, "Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling," in Proc. of 2021 IEEE Conference on Computer Vision and Pattern Recognition (Cvpr), pp. 7331-7341, 2021.
35	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, "Attention Is All You Need," in Proc. of 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA., vol.30, pp. 6000-6010, 2017.
36	Antoine Miech, Ivan Laptev, and Josef Sivic, "Learnable pooling with Context Gating for video classification," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (Cvpr), 2017.
37	Karen Simonyan, and Andrew Zisserman, "Two-Stream Convolutional Networks for Action Recognition in Videos," Advances in Neural Information Processing Systems 27 (Nips 2014), vol.1, pp. 568-576, 2014.
38	Andres Mafla, Rafael S. Rezende, Lluis Gomez, Diane Larlus, and Dimosthenis Karatzas, "StacMR: Scene-Text Aware Cross-Modal Retrieval," in Proc. of Ieee Wint Conf Appl, pp. 2220-2230, 2021.
39	Philippe Schwaller, Teodoro Laino, Theophile Gaudin, Peter Bolgar, Christopher A. Hunter, Costas Bekas, and Alpha A. Lee, "Molecular Transformer: A Model for Uncertainty Calibrated Chemical Reaction Prediction," ACS Central Science, vol. 5(9), pp. 1572-1583, 2019. DOI
40	Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen, "CLIP2Video: Mastering Video-Text Retrieval via Image CLIP," arXiv:2106.11097v1 [cs.CV], 2021.
41	Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, "Densely Connected Convolutional Networks," in Proc. of 30th IEEE Conference on Computer Vision and Pattern Recognition (Cvpr 2017), pp. 2261-2269, 2017.
42	Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman, "Use What You Have: Video retrieval using representations from collaborative experts," in Proc. of BMVC2019, 2019.
43	Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan, "ConvBERT: Improving BERT with Span-based Dynamic Convolution," in Proc. of 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020.