Browse > Article
http://dx.doi.org/10.5351/KJAS.2021.34.2.191

Using similarity based image caption to aid visual question answering  

Kang, Joonseo (Department of Applied Statistics, Chung-Ang University)
Lim, Changwon (Department of Applied Statistics, Chung-Ang University)
Publication Information
The Korean Journal of Applied Statistics / v.34, no.2, 2021 , pp. 191-204 More about this Journal
Abstract
Visual Question Answering (VQA) and image captioning are tasks that require understanding of the features of images and linguistic features of text. Therefore, co-attention may be the key to both tasks, which can connect image and text. In this paper, we propose a model to achieve high performance for VQA by image caption generated using a pretrained standard transformer model based on MSCOCO dataset. Captions unrelated to the question can rather interfere with answering, so some captions similar to the question were selected to use based on a similarity to the question. In addition, stopwords in the caption could not affect or interfere with answering, so the experiment was conducted after removing stopwords. Experiments were conducted on VQA-v2 data to compare the proposed model with the deep modular co-attention network (MCAN) model, which showed good performance by using co-attention between images and text. As a result, the proposed model outperformed the MCAN model.
Keywords
visual question answering; multimodal data; co-attention; image captioning; text similarity;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, and Zhang L (2018). Bottom-up and top-down attention for image captioning and visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6077-6086.
2 Li Q, Tao Q, Joty S, Cai J, and Luo J (2018). VQA-E: Explaining, elaborating, and enhancing your answers for visual questions, arXiv preprint arXiv:1803.07464.
3 Lu J, Yang J, Batra D, and Parikh D (2017). Hierarchical question-image co-attention for visual question answering, arXiv preprint arXiv:1606.00061.
4 Mnih V, Heess N, Graves A, and Kavukcuoglu K (2014). Recurrent models of visual attention. In Advances in neural information processing systems (NIPS), 2204-2212.
5 Wu J, Hu Z, and Mooney R (2019). Generating question relevant captions to aid visual question answering, arXiv preprint arXiv:1906.00513.
6 Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015). Show, attend and tell: Neural image caption generation with visual attention, arXiv preprint arXiv:1502.03044.
7 Yu Z, Yu J, Cui Y, Tao D, and Tian Q (2019). Deep modular co-attention networks for visual question answering, arXiv preprint arXiv:1906.10770.
8 Yu Z, Yu J, Xiang C, Fan J, and Tao D (2017). Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. In Proceedings of the IEEE, 26, 2275-2290.
9 Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick L, and Parikh D (2015). VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, 2425-2433.
10 Kim JH, Jun J, and Zhang BT (2018). Bilinear attention networks. In Advances in Neural Information Processing Systems, 31, 1564-1574.
11 Ba JL, Kiros JR, and Hinton GE (2016). Layer Normalization, arXiv preprint arXiv:1607.06450.
12 Pennington J, Socher R, and Manning CD (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.
13 Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (NIPS), 577-585.
14 Herdade S, Kappeler A, Boakye K, and Soares J (2019). Image Captioning: Transforming Objects into Words. In Advances in Neural Information Processing Systems, Mit Press, Cambridge, MA, USA, 11137-11147.
15 Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, and Li FF (2016). Visual Genome: connecting language and vision using crowdsourced dense image annotations, arXiv preprint arXiv:1602.07332.
16 Teney D, Anderson P, He X, and Hengel A (2017). Tips and tricks for visual question answering: Learnings from the 2017 challenge, arXiv preprint arXiv:1708.02711.
17 Vaswani A, Shazeer M, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 6000-6010.
18 Loper E and Bird S (2002). NLTK: The natural language toolkit, ETMTNLP '02: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 1, 63--70.