DOI QR코드

DOI QR Code

Hybrid Learning for Vision-and-Language Navigation Agents

시각-언어 이동 에이전트를 위한 복합 학습

  • Received : 2020.06.29
  • Accepted : 2020.07.17
  • Published : 2020.09.30

Abstract

The Vision-and-Language Navigation(VLN) task is a complex intelligence problem that requires both visual and language comprehension skills. In this paper, we propose a new learning model for visual-language navigation agents. The model adopts a hybrid learning that combines imitation learning based on demo data and reinforcement learning based on action reward. Therefore, this model can meet both problems of imitation learning that can be biased to the demo data and reinforcement learning with relatively low data efficiency. In addition, the proposed model uses a novel path-based reward function designed to solve the problem of existing goal-based reward functions. In this paper, we demonstrate the high performance of the proposed model through various experiments using both Matterport3D simulation environment and R2R benchmark dataset.

시각-언어 이동 문제는 시각 이해와 언어 이해 능력을 함께 요구하는 복합 지능 문제이다. 본 논문에서는 시각-언어 이동 에이전트를 위한 새로운 학습 모델을 제안한다. 이 모델은 데모 데이터에 기초한 모방 학습과 행동 보상에 기초한 강화 학습을 함께 결합한 복합 학습을 채택하고 있다. 따라서 이 모델은 데모 데이터에 편향될 수 있는 모방 학습의 문제와 상대적으로 낮은 데이터 효율성을 갖는 강화 학습의 문제를 상호 보완적으로 해소할 수 있다. 또한, 제안 모델에서는 기존의 목표 기반 보상 함수들의 문제점을 해결하기 위해 설계된 새로운 경로 기반 보상 함수를 이용한다. 본 논문에서는 Matterport3D 시뮬레이션 환경과 R2R 벤치마크 데이터 집합을 이용한 다양한 실험들을 통해, 제안 모델의 높은 성능을 입증하였다.

Keywords

References

  1. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. V. D. Hengel, "Visionand-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  2. X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. F. Wang, W. Y. Wang, and L. Zhang, "Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  3. H. Tan, L. Yu and M. Bansal, "Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout," in Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  4. A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, "Matterport3D: Learning from RGB-D Data in Indoor Environments," in Proceedings of the International Conference on 3D Vision, 2017.
  5. D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein and T. Darrell, "Speaker-Follower Models for Vision-and-Language Navigation," in Proceedings of the Neural Information Processing Systems (NIPS), Vol. 28, 2018.
  6. W. Xiong, X. Wang, H. Wang, and W. Y. Wang, "Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation," in Proceedings of the European Conference on Computer Vision (ECCV), pp. 696-711, 2018.
  7. G. Ilharco, V. Jain, A. Ku, E. Ie, and J. Baldridge, "General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping," in Proceedings of Neural Information Processing Systems (NeurIPS), 2019.
  8. M. A. Ranzato, S. Chopra, M. Auli, and W. Zaremba, "Sequence level training with recurrent neural networks." in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  9. R. Paulus, C. Xiong and R. Socher, "A Deep Reinforced Model for Abstractive Summarization," in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  10. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning," in Proceedings of the International Conference on Machine Learning (ICML), pp. 1928-1937, 2018.
  11. D. J. Berndt and J. Clifford, "Using Dynamic Time Warping to Find Patterns in Time Series," in KDD Workshop, pp. 359-370, 1994.