[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTSDE.2019.8.9.379

LVLN : A Landmark-Based Deep Neural Network Model for Vision-and-Language Navigation

Hwang, Jisu (경기대학교 컴퓨터과학과)
Kim, Incheol (경기대학교 컴퓨터과학과)

Publication Information

KIPS Transactions on Software and Data Engineering / v.8, no.9, 2019 , pp. 379-390 More about this Journal

Abstract

In this paper, we propose a novel deep neural network model for Vision-and-Language Navigation (VLN) named LVLN (Landmark-based VLN). In addition to both visual features extracted from input images and linguistic features extracted from the natural language instructions, this model makes use of information about places and landmark objects detected from images. The model also applies a context-based attention mechanism in order to associate each entity mentioned in the instruction, the corresponding region of interest (ROI) in the image, and the corresponding place and landmark object detected from the image with each other. Moreover, in order to improve the success rate of arriving the target goal, the model adopts a progress monitor module for checking substantial approach to the target goal. Conducting experiments with the Matterport3D simulator and the Room-to-Room (R2R) benchmark dataset, we demonstrate high performance of the proposed model.

Keywords

Vision-and-Language Navigation; Deep Neural Network; Landmark; Attention; Progress Monitor;

Citations & Related Records

Reference

1	A. Agrawal, J. and Lu, S. Antol, et al., "VQA: Visual Question Answering," in Proceedings of the International Conference on Computer Vision(ICCV), pp.2425-2433, 2015.
2	A. Das, S. Kottur, and K. Gupta, et al., "Visual Dialog," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
3	A. Das, S. Kottur, and K. Gupta, et al., "Embodied Question Answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol.5. 2018.
4	D. Gordon, A. Kembhavi, and M. Rastegari, et al., "IQA: Visual Question Answering in Interactive Environments," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
5	P. Anderson, Q. Wu, and D. Teney, et al., "Vision-and-Language Navigation: Interpreting Visually-grounded Navigation Instructions in Real Environments," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2018.
6	A. Chang, A. Dai, and T. Funkhouser, et al., "Matterport3D: Learning from RGB-D Data in Indoor Environments," in Proceedings of the International Conference on 3D Vision, Vol.5, 2017.
7	X. Wang, W. Xiong, and H. Wang, et al., "Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation," in Proceedings of the European Conference on Computer Vision(ECCV), pp.696-711, 2018.
8	X. Wang, Q. Huang, and A. Celikyilmaz, et al., "Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2019.
9	D. Fried, R. Hu, and A. Rohrbach, et al., "Speaker-Follower Models for Vision-and-Language Navigation," in Proceedings of the Conference on Neural Information Processing Systems(NIPS), Vol.28, 2018.
10	C. Ma, J. Lu, Z. and Z. wu, et al., "Self-Monitoring Navigation Agent via Auxiliary Progress Estimation," in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
11	C. Ma, Z. Wu, and G. Alregib, et al., "The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2019.
12	L. Ke, X. Li, and Y. Bisk, et al., "Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation," in Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2019.
13	K. Wang, X. Long, and R. Li, et al., "A Discriminative Algorithm for Indoor Place Recognition based on Clustering of Features and Images," International Journal of Automation and Computing, Vol.14, pp.407-419, 2017. DOI
14	A. Hanni, S. Chickerur, and I. Bidari, "Deep learning Framework for Scene based Indoor Location Recognition," in Proceedings of the International Conference on Technological Advancements in Power and Energy (TAP Energy), IEEE, 2017.
15	B. Zhou, A. Lapedriza and A. Khosla, et al., "Places: A 10 million Image Database for Scene Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.40, pp.1452-1464, 2017. DOI
16	C. Szegedy, W. Liu, and Y. Jia, et al., "Going Deeper with Convolutions," in Proceedings. of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.1-9, 2015.
17	K. Simonyan, and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in Proceedings of the International Conference on Learning Representations(ICLR), 2015.
18	K. He, X. Zhang, and S. Ren, et al., "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
19	J. Deng, W. Dong, and R. Socher, et al, "ImageNet:A Large-Scale Hierarchical Image Database," in Proceedings of the Conference on Neural Information Processing Systems(NIPS), 2009.
20	N. Silberman, D. Hoiem, and P. Kohli, et al., "Indoor Segmentation and Support Inference from RGBD Images," in Proceedings of the European Conference on Computer Vision(ECCV), pp.746-760, 2012.
21	R. Grishick, J. Donahue, and T. Darrell, et al., "Rich Feature Hierarchies for Accurate Oobject Detection and Semantic Segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2014.
22	R. Girshick, "Fast R-CNN," in Proceedings of the IEEE International Conference on Computer Vision(ICCV), 2015.
23	S. Ren, K. He, and R. Girshick, et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in Proceedings of the Conference on Neural Information Processing Systems(NIPS), 2015.
24	K. He, G. Gkioxari, and P. Dollar, "Mask R-CNN," in Proceedings of the IEEE International Conference on Computer Vision(ICCV), 2017.
25	J. Redmon, S. Divvala, and R. Girshick, et al., "You Only Look Once: Unified, Real-Time Object Detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016.
26	W. Liu, D. Anguelov, and D. Erhan, et al., "Ssd: Single Shot Multibox Detector," in Proceedings of European Conference on Ccomputer Vision(ECCV), pp.21-37, Springer, Cham. 2016.
27	J. Redmon, and A. Farhadi, "YOLO9000: Better, Faster, Stronger," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
28	J. Redmon, and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv preprint arXiv:1804.02767, 2018.
29	T.-Y. Lin, M. Maire, and S. Belongie, et al., "Microsoft COCO: Common Objects in Context," in Proceedings of the European Conference on Computer Vision(ECCV). vol 13, pp.740-755, 2014.

KSCI

LVLN : A Landmark-Based Deep Neural Network Model for Vision-and-Language Navigation LVLN: 시각-언어 이동을 위한 랜드마크 기반의 심층 신경망 모델

LVLN : A Landmark-Based Deep Neural Network Model for Vision-and-Language Navigation