• Title/Summary/Keyword: Vision-and-Language Navigation

Search Result 7, Processing Time 0.039 seconds

Hybrid Learning for Vision-and-Language Navigation Agents (시각-언어 이동 에이전트를 위한 복합 학습)

  • Oh, Suntaek;Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.9
    • /
    • pp.281-290
    • /
    • 2020
  • The Vision-and-Language Navigation(VLN) task is a complex intelligence problem that requires both visual and language comprehension skills. In this paper, we propose a new learning model for visual-language navigation agents. The model adopts a hybrid learning that combines imitation learning based on demo data and reinforcement learning based on action reward. Therefore, this model can meet both problems of imitation learning that can be biased to the demo data and reinforcement learning with relatively low data efficiency. In addition, the proposed model uses a novel path-based reward function designed to solve the problem of existing goal-based reward functions. In this paper, we demonstrate the high performance of the proposed model through various experiments using both Matterport3D simulation environment and R2R benchmark dataset.

LVLN : A Landmark-Based Deep Neural Network Model for Vision-and-Language Navigation (LVLN: 시각-언어 이동을 위한 랜드마크 기반의 심층 신경망 모델)

  • Hwang, Jisu;Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.9
    • /
    • pp.379-390
    • /
    • 2019
  • In this paper, we propose a novel deep neural network model for Vision-and-Language Navigation (VLN) named LVLN (Landmark-based VLN). In addition to both visual features extracted from input images and linguistic features extracted from the natural language instructions, this model makes use of information about places and landmark objects detected from images. The model also applies a context-based attention mechanism in order to associate each entity mentioned in the instruction, the corresponding region of interest (ROI) in the image, and the corresponding place and landmark object detected from the image with each other. Moreover, in order to improve the success rate of arriving the target goal, the model adopts a progress monitor module for checking substantial approach to the target goal. Conducting experiments with the Matterport3D simulator and the Room-to-Room (R2R) benchmark dataset, we demonstrate high performance of the proposed model.

Path finding via VRML and VISION overlay for Autonomous Robotic (로봇의 위치보정을 통한 경로계획)

  • Sohn, Eun-Ho;Park, Jong-Ho;Kim, Young-Chul;Chong, Kil-To
    • Proceedings of the KIEE Conference
    • /
    • 2006.10c
    • /
    • pp.527-529
    • /
    • 2006
  • In this paper, we find a robot's path using a Virtual Reality Modeling Language and overlay vision. For correct robot's path we describe a method for localizing a mobile robot in its working environment using a vision system and VRML. The robt identifies landmarks in the environment, using image processing and neural network pattern matching techniques, and then its performs self-positioning with a vision system based on a well-known localization algorithm. After the self-positioning procedure, the 2-D scene of the vision is overlaid with the VRML scene. This paper describes how to realize the self-positioning, and shows the overlap between the 2-D and VRML scenes. The method successfully defines a robot's path.

  • PDF

Utilizing AI Foundation Models for Language-Driven Zero-Shot Object Navigation Tasks (언어-기반 제로-샷 물체 목표 탐색 이동 작업들을 위한 인공지능 기저 모델들의 활용)

  • Jeong-Hyun Choi;Ho-Jun Baek;Chan-Sol Park;Incheol Kim
    • The Journal of Korea Robotics Society
    • /
    • v.19 no.3
    • /
    • pp.293-310
    • /
    • 2024
  • In this paper, we propose an agent model for Language-Driven Zero-Shot Object Navigation (L-ZSON) tasks, which takes in a freeform language description of an unseen target object and navigates to find out the target object in an inexperienced environment. In general, an L-ZSON agent should able to visually ground the target object by understanding the freeform language description of it and recognizing the corresponding visual object in camera images. Moreover, the L-ZSON agent should be also able to build a rich spatial context map over the unknown environment and decide efficient exploration actions based on the map until the target object is present in the field of view. To address these challenging issues, we proposes AML (Agent Model for L-ZSON), a novel L-ZSON agent model to make effective use of AI foundation models such as Large Language Model (LLM) and Vision-Language model (VLM). In order to tackle the visual grounding issue of the target object description, our agent model employs GLEE, a VLM pretrained for locating and identifying arbitrary objects in images and videos in the open world scenario. To meet the exploration policy issue, the proposed agent model leverages the commonsense knowledge of LLM to make sequential navigational decisions. By conducting various quantitative and qualitative experiments with RoboTHOR, the 3D simulation platform and PASTURE, the L-ZSON benchmark dataset, we show the superior performance of the proposed agent model.

A Path tracking algorithm and a VRML image overlay method (VRML과 영상오버레이를 이용한 로봇의 경로추적)

  • Sohn, Eun-Ho;Zhang, Yuanliang;Kim, Young-Chul;Chong, Kil-To
    • Proceedings of the IEEK Conference
    • /
    • 2006.06a
    • /
    • pp.907-908
    • /
    • 2006
  • We describe a method for localizing a mobile robot in its working environment using a vision system and Virtual Reality Modeling Language (VRML). The robot identifies landmarks in the environment, using image processing and neural network pattern matching techniques, and then its performs self-positioning with a vision system based on a well-known localization algorithm. After the self-positioning procedure, the 2-D scene of the vision is overlaid with the VRML scene. This paper describes how to realize the self-positioning, and shows the overlap between the 2-D and VRML scenes. The method successfully defines a robot's path.

  • PDF

Implementation of Path Finding Method using 3D Mapping for Autonomous Robotic (3차원 공간 맵핑을 통한 로봇의 경로 구현)

  • Son, Eun-Ho;Kim, Young-Chul;Chong, Kil-To
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.14 no.2
    • /
    • pp.168-177
    • /
    • 2008
  • Path finding is a key element in the navigation of a mobile robot. To find a path, robot should know their position exactly, since the position error exposes a robot to many dangerous conditions. It could make a robot move to a wrong direction so that it may have damage by collision by the surrounding obstacles. We propose a method obtaining an accurate robot position. The localization of a mobile robot in its working environment performs by using a vision system and Virtual Reality Modeling Language(VRML). The robot identifies landmarks located in the environment. An image processing and neural network pattern matching techniques have been applied to find location of the robot. After the self-positioning procedure, the 2-D scene of the vision is overlaid onto a VRML scene. This paper describes how to realize the self-positioning, and shows the overlay between the 2-D and VRML scenes. The suggested method defines a robot's path successfully. An experiment using the suggested algorithm apply to a mobile robot has been performed and the result shows a good path tracking.

Image Caption Generation using Recurrent Neural Network (Recurrent Neural Network를 이용한 이미지 캡션 생성)

  • Lee, Changki
    • Journal of KIISE
    • /
    • v.43 no.8
    • /
    • pp.878-882
    • /
    • 2016
  • Automatic generation of captions for an image is a very difficult task, due to the necessity of computer vision and natural language processing technologies. However, this task has many important applications, such as early childhood education, image retrieval, and navigation for blind. In this paper, we describe a Recurrent Neural Network (RNN) model for generating image captions, which takes image features extracted from a Convolutional Neural Network (CNN). We demonstrate that our models produce state of the art results in image caption generation experiments on the Flickr 8K, Flickr 30K, and MS COCO datasets.