LVLN : A Landmark-Based Deep Neural Network Model for Vision-and-Language Navigation

Hwang, Jisu;Kim, Incheol;

doi:10.3745/KTSDE.2019.8.9.379

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 8 Issue 9
/
Pages.379-390
/
2019
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

LVLN : A Landmark-Based Deep Neural Network Model for Vision-and-Language Navigation

LVLN: 시각-언어 이동을 위한 랜드마크 기반의 심층 신경망 모델

황지수 (경기대학교 컴퓨터과학과) ;
김인철 (경기대학교 컴퓨터과학과)

Received : 2019.07.05
Accepted : 2019.07.30
Published : 2019.09.30

https://doi.org/10.3745/KTSDE.2019.8.9.379 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a novel deep neural network model for Vision-and-Language Navigation (VLN) named LVLN (Landmark-based VLN). In addition to both visual features extracted from input images and linguistic features extracted from the natural language instructions, this model makes use of information about places and landmark objects detected from images. The model also applies a context-based attention mechanism in order to associate each entity mentioned in the instruction, the corresponding region of interest (ROI) in the image, and the corresponding place and landmark object detected from the image with each other. Moreover, in order to improve the success rate of arriving the target goal, the model adopts a progress monitor module for checking substantial approach to the target goal. Conducting experiments with the Matterport3D simulator and the Room-to-Room (R2R) benchmark dataset, we demonstrate high performance of the proposed model.

본 논문에서는 시각-언어 이동 문제를 위한 새로운 심층 신경망 모델인 LVLN을 제안한다. LVLN 모델에서는 자연어 지시의 언어적 특징과 입력 영상 전체의 시각적 특징들 외에, 자연어 지시에서 언급하는 주요 장소와 랜드마크 물체들을 입력 영상에서 탐지해내고 이 정보들을 추가적으로 이용한다. 또한 이 모델은 자연어 지시 내 각 개체와 영상 내 각 관심 영역, 그리고 영상에서 탐지된 개별 물체 및 장소 간의 서로 연관성을 높일 수 있도록 맥락 정보 기반의 주의 집중 메커니즘을 이용한다. 그뿐만 아니라, LVLN 모델은 에이전트의 목표 도달 성공율을 향상시키기 위해, 목표를 향한 실질적인 접근을 점검할 수 있는 진척 점검기 모듈도 포함하고 있다. Matterport3D 시뮬레이터와 Room-to-Room (R2R) 벤치마크 데이터 집합을 이용한 다양한 실험들을 통해, 본 논문에서 제안하는 LVLN 모델의 높은 성능을 확인할 수 있었다.

Keywords

References

A. Agrawal, J. and Lu, S. Antol, et al., "VQA: Visual Question Answering," in Proceedings of the International Conference on Computer Vision(ICCV), pp.2425-2433, 2015.
A. Das, S. Kottur, and K. Gupta, et al., "Visual Dialog," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
A. Das, S. Kottur, and K. Gupta, et al., "Embodied Question Answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol.5. 2018.
D. Gordon, A. Kembhavi, and M. Rastegari, et al., "IQA: Visual Question Answering in Interactive Environments," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
P. Anderson, Q. Wu, and D. Teney, et al., "Vision-and-Language Navigation: Interpreting Visually-grounded Navigation Instructions in Real Environments," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2018.
A. Chang, A. Dai, and T. Funkhouser, et al., "Matterport3D: Learning from RGB-D Data in Indoor Environments," in Proceedings of the International Conference on 3D Vision, Vol.5, 2017.
X. Wang, W. Xiong, and H. Wang, et al., "Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation," in Proceedings of the European Conference on Computer Vision(ECCV), pp.696-711, 2018.
X. Wang, Q. Huang, and A. Celikyilmaz, et al., "Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2019.
D. Fried, R. Hu, and A. Rohrbach, et al., "Speaker-Follower Models for Vision-and-Language Navigation," in Proceedings of the Conference on Neural Information Processing Systems(NIPS), Vol.28, 2018.
C. Ma, J. Lu, Z. and Z. wu, et al., "Self-Monitoring Navigation Agent via Auxiliary Progress Estimation," in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
C. Ma, Z. Wu, and G. Alregib, et al., "The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2019.
L. Ke, X. Li, and Y. Bisk, et al., "Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2019.
K. Wang, X. Long, and R. Li, et al., "A Discriminative Algorithm for Indoor Place Recognition based on Clustering of Features and Images," International Journal of Automation and Computing, Vol.14, pp.407-419, 2017. https://doi.org/10.1007/s11633-017-1081-z
A. Hanni, S. Chickerur, and I. Bidari, "Deep learning Framework for Scene based Indoor Location Recognition," in Proceedings of the International Conference on Technological Advancements in Power and Energy (TAP Energy), IEEE, 2017.
B. Zhou, A. Lapedriza and A. Khosla, et al., "Places: A 10 million Image Database for Scene Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.40, pp.1452-1464, 2017. https://doi.org/10.1109/tpami.2017.2723009
C. Szegedy, W. Liu, and Y. Jia, et al., "Going Deeper with Convolutions," in Proceedings. of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.1-9, 2015.
K. Simonyan, and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in Proceedings of the International Conference on Learning Representations(ICLR), 2015.
K. He, X. Zhang, and S. Ren, et al., "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
J. Deng, W. Dong, and R. Socher, et al, "ImageNet:A Large-Scale Hierarchical Image Database," in Proceedings of the Conference on Neural Information Processing Systems(NIPS), 2009.
N. Silberman, D. Hoiem, and P. Kohli, et al., "Indoor Segmentation and Support Inference from RGBD Images," in Proceedings of the European Conference on Computer Vision(ECCV), pp.746-760, 2012.
R. Grishick, J. Donahue, and T. Darrell, et al., "Rich Feature Hierarchies for Accurate Oobject Detection and Semantic Segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2014.
R. Girshick, "Fast R-CNN," in Proceedings of the IEEE International Conference on Computer Vision(ICCV), 2015.
S. Ren, K. He, and R. Girshick, et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in Proceedings of the Conference on Neural Information Processing Systems(NIPS), 2015.
K. He, G. Gkioxari, and P. Dollar, "Mask R-CNN," in Proceedings of the IEEE International Conference on Computer Vision(ICCV), 2017.
J. Redmon, S. Divvala, and R. Girshick, et al., "You Only Look Once: Unified, Real-Time Object Detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016.
W. Liu, D. Anguelov, and D. Erhan, et al., "Ssd: Single Shot Multibox Detector," in Proceedings of European Conference on Ccomputer Vision(ECCV), pp.21-37, Springer, Cham. 2016.
J. Redmon, and A. Farhadi, "YOLO9000: Better, Faster, Stronger," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
J. Redmon, and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv preprint arXiv:1804.02767, 2018.
T.-Y. Lin, M. Maire, and S. Belongie, et al., "Microsoft COCO: Common Objects in Context," in Proceedings of the European Conference on Computer Vision(ECCV). vol 13, pp.740-755, 2014.