DOI QR코드

DOI QR Code

KG_VCR: A Visual Commonsense Reasoning Model Using Knowledge Graph

KG_VCR: 지식 그래프를 이용하는 영상 기반 상식 추론 모델

  • Received : 2019.12.20
  • Accepted : 2020.02.17
  • Published : 2020.03.31

Abstract

Unlike the existing Visual Question Answering(VQA) problems, the new Visual Commonsense Reasoning(VCR) problems require deep common sense reasoning for answering questions: recognizing specific relationship between two objects in the image, presenting the rationale of the answer. In this paper, we propose a novel deep neural network model, KG_VCR, for VCR problems. In addition to make use of visual relations and contextual information between objects extracted from input data (images, natural language questions, and response lists), the KG_VCR also utilizes commonsense knowledge embedding extracted from an external knowledge base called ConceptNet. Specifically the proposed model employs a Graph Convolutional Neural Network(GCN) module to obtain commonsense knowledge embedding from the retrieved ConceptNet knowledge graph. By conducting a series of experiments with the VCR benchmark dataset, we show that the proposed KG_VCR model outperforms both the state of the art(SOTA) VQA model and the R2C VCR model.

기존의 영상 기반 질문-응답(VQA) 문제들과는 달리, 새로운 영상 기반 상식 추론(VCR) 문제들은 영상에 포함된 사물들 간의 관계 파악과 답변 근거 제시 등과 같이 추가적인 심층 상식 추론을 요구한다. 본 논문에서는 영상 기반 상식 추론 문제들을 위한 새로운 심층 신경망 모델인 KG_VCR을 제안한다. KG_VCR 모델은 입력 데이터(영상, 자연어 질문, 응답 리스트 등)에서 추출하는 사물들 간의 관계와 맥락 정보들을 이용할 뿐만 아니라, 외부 지식 베이스인 ConceptNet으로부터 구해내는 상식 임베딩을 함께 활용한다. 특히 제안 모델은 ConceptNet으로부터 검색해낸 연관 지식 그래프를 효과적으로 임베딩하기 위해 그래프 합성곱 신경망(GCN) 모듈을 채용한다. VCR 벤치마크 데이터 집합을 이용한 다양한 실험들을 통해, 본 논문에서는 제안 모델인 KG_VCR이 기존의 VQA 최고 모델과 R2C VCR 모델보다 더 높은 성능을 보인다는 것을 입증한다.

Keywords

References

  1. S. Antol, A. Agrawal, and J. Lu, et al., "VQA: Visual Question Answering," in Proceedings of the International Conference on Computer Vision (ICCV), pp.2425-2433, 2015.
  2. R. Zellers, Y. Bisk, and A. Farhadi, et al., "From Recognition to Cognition: Visual Commonsense Reasoning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6720-6731, 2019.
  3. P. Wang, Q. Wu, and C. Shen, et al., "FVQA: Fact-based Visual Question Answering," in Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol.40, pp.2413-2427, 2017.
  4. S. Shah, A. Mishra, and N. Yadati, et al., "KVQA: Knowledge-aware Visual Question Answering," in Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2019.
  5. M. Narasimhan, S. Lazebnik, and A. G.Schwing, "Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering," in Proceedings of the Conference on Neural Information Processing Systems (NIPS), pp.2654-2665, 2018.
  6. P. Anderson, X. He, and C. Buehler, et al., "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6077-6086, 2018.
  7. Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, "Stacked Attention Networks for Image Question Answering," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.21-29, 2016.
  8. J. Lu, J. Yang, and D. Batra, et al., "Hierarchical Question-Image Co-Attention for Visual Question Answering," in Proceedings of the Conference on Neural Information Processing Systems (NIPS), pp.289-297, 2016.
  9. M. Lao, Y. Guo, H. Wang, and X. Zhang, "Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering," in Proceedings of IEEE Access, Vol.6, pp.31516-41524, June. 2018. https://doi.org/10.1109/ACCESS.2018.2844789
  10. C. Yang, M. Jiang, B. Jiang, W. Zhou, and K. Li, "Co-Attention Network with Question Type for Visual Question Answering," in Proceedings of IEEE Access, Vol.7, pp.40771-40781, Mar. 2019. https://doi.org/10.1109/ACCESS.2019.2908035
  11. A. Soren, C. Bizer, and G. Kovilarov, et al., "DBpedia: A Nucleus for a Web of Open Data," in Proceedings of The semantic web. Springer, Berlin, Heidelberg, 2007.
  12. K. Bollacker, C. Evans, and P. Paritosh, et al., "Freebase: A Collaboratively Created Graph Database for Structing Human Knowledge," in Proceedings of ACM SIGMOD International Conference on Management of Data, pp.1247-1250, 2008.
  13. L. Hugo, and S. Singh, "ConceptNet-A Practical Commonsense Reasoning Tool-kit," British Telecommunications (BT) Technology Journal, Vol.22, pp.211-226, 2004.
  14. P. Wang, Q. Wu, and C. Shen, et al., "Explicit Knowledge-based Reasoning for Visual Question Answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  15. K. Marino, M. Rastegari, and A. Farhadi, et al., "OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3195-3204, 2019.
  16. J. Zhou, G. Cui, and Z. Zhang, et al., "Graph Neural Network: A Review of Methods and Applications," arXiv preprint arXiv preprint arXiv:1812.08434, 2018.
  17. Y. LeCun, B. Boser, and J. Denker, et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, Vol.1, Issue 4, pp.541-551, 1989. https://doi.org/10.1162/neco.1989.1.4.541
  18. S. Hochreiter, and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, Vol.9, Issue 8, pp.1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  19. T. N, and M. Welling, "Semi-Superviced Classification with Graph Convolutional Networks," in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  20. J. Yang, J, Lu and S. Lee, et al., "Graph R-CNN for Scene Graph Generation," in Proceedings of the European Conference on Computer Vision (ECCV), pp.670-685, 2018.
  21. Y. Cao, M, Fang and D. Tao, et al., "BAG: Bi-directional Attention Entity Graph Convolutional Network for Multi-hop Reasoning Question Answering," arXiv preprint arXiv:1904.04969, 2019.
  22. J. Devlin, M, Chang and K. Lee, et al., "BBert: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint arXiv:1810.04805, 2018.