• 제목/요약/키워드: Path-Based Reward Function

검색결과 6건 처리시간 0.025초

시각-언어 이동 에이전트를 위한 복합 학습 (Hybrid Learning for Vision-and-Language Navigation Agents)

  • 오선택;김인철
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제9권9호
    • /
    • pp.281-290
    • /
    • 2020
  • 시각-언어 이동 문제는 시각 이해와 언어 이해 능력을 함께 요구하는 복합 지능 문제이다. 본 논문에서는 시각-언어 이동 에이전트를 위한 새로운 학습 모델을 제안한다. 이 모델은 데모 데이터에 기초한 모방 학습과 행동 보상에 기초한 강화 학습을 함께 결합한 복합 학습을 채택하고 있다. 따라서 이 모델은 데모 데이터에 편향될 수 있는 모방 학습의 문제와 상대적으로 낮은 데이터 효율성을 갖는 강화 학습의 문제를 상호 보완적으로 해소할 수 있다. 또한, 제안 모델에서는 기존의 목표 기반 보상 함수들의 문제점을 해결하기 위해 설계된 새로운 경로 기반 보상 함수를 이용한다. 본 논문에서는 Matterport3D 시뮬레이션 환경과 R2R 벤치마크 데이터 집합을 이용한 다양한 실험들을 통해, 제안 모델의 높은 성능을 입증하였다.

Weight Adjustment Scheme Based on Hop Count in Q-routing for Software Defined Networks-enabled Wireless Sensor Networks

  • Godfrey, Daniel;Jang, Jinsoo;Kim, Ki-Il
    • Journal of information and communication convergence engineering
    • /
    • 제20권1호
    • /
    • pp.22-30
    • /
    • 2022
  • The reinforcement learning algorithm has proven its potential in solving sequential decision-making problems under uncertainties, such as finding paths to route data packets in wireless sensor networks. With reinforcement learning, the computation of the optimum path requires careful definition of the so-called reward function, which is defined as a linear function that aggregates multiple objective functions into a single objective to compute a numerical value (reward) to be maximized. In a typical defined linear reward function, the multiple objectives to be optimized are integrated in the form of a weighted sum with fixed weighting factors for all learning agents. This study proposes a reinforcement learning -based routing protocol for wireless sensor network, where different learning agents prioritize different objective goals by assigning weighting factors to the aggregated objectives of the reward function. We assign appropriate weighting factors to the objectives in the reward function of a sensor node according to its hop-count distance to the sink node. We expect this approach to enhance the effectiveness of multi-objective reinforcement learning for wireless sensor networks with a balanced trade-off among competing parameters. Furthermore, we propose SDN (Software Defined Networks) architecture with multiple controllers for constant network monitoring to allow learning agents to adapt according to the dynamics of the network conditions. Simulation results show that our proposed scheme enhances the performance of wireless sensor network under varied conditions, such as the node density and traffic intensity, with a good trade-off among competing performance metrics.

Leveraging Reinforcement Learning for Generating Construction Workers' Moving Path: Opportunities and Challenges

  • Kim, Minguk;Kim, Tae Wan
    • 국제학술발표논문집
    • /
    • The 9th International Conference on Construction Engineering and Project Management
    • /
    • pp.1085-1092
    • /
    • 2022
  • Travel distance is a parameter mainly used in the objective function of Construction Site Layout Planning (CSLP) automation models. To obtain travel distance, common approaches, such as linear distance, shortest-distance algorithm, visibility graph, and access road path, concentrate only on identifying the shortest path. However, humans do not necessarily follow one shortest path but can choose a safer and more comfortable path according to their situation within a reasonable range. Thus, paths generated by these approaches may be different from the actual paths of the workers, which may cause a decrease in the reliability of the optimized construction site layout. To solve this problem, this paper adopts reinforcement learning (RL) inspired by various concepts of cognitive science and behavioral psychology to generate a realistic path that mimics the decision-making and behavioral processes of wayfinding of workers on the construction site. To do so, in this paper, the collection of human wayfinding tendencies and the characteristics of the walking environment of construction sites are investigated and the importance of taking these into account in simulating the actual path of workers is emphasized. Furthermore, a simulation developed by mapping the identified tendencies to the reward design shows that the RL agent behaves like a real construction worker. Based on the research findings, some opportunities and challenges were proposed. This study contributes to simulating the potential path of workers based on deep RL, which can be utilized to calculate the travel distance of CSLP automation models, contributing to providing more reliable solutions.

  • PDF

Obstacle Avoidance for Unmanned Air Vehicles Using Monocular-SLAM with Chain-Based Path Planning in GPS Denied Environments

  • Bharadwaja, Yathirajam;Vaitheeswaran, S.M;Ananda, C.M
    • 항공우주시스템공학회지
    • /
    • 제14권2호
    • /
    • pp.1-11
    • /
    • 2020
  • Detecting obstacles and generating a suitable path to avoid obstacles in real time is a prime mission requirement for UAVs. In areas, close to buildings and people, detecting obstacles in the path and estimating its own position (egomotion) in GPS degraded/denied environments are usually addressed with vision-based Simultaneous Localization and Mapping (SLAM) techniques. This presents possibilities and challenges for the feasible path generation with constraints of vehicle dynamics in the configuration space. In this paper, a near real-time feasible path is shown to be generated in the ORB-SLAM framework using a chain-based path planning approach in a force field with dynamic constraints on path length and minimum turn radius. The chain-based path plan approach generates a set of nodes which moves in a force field that permits modifications of path rapidly in real time as the reward function changes. This is different from the usual approach of generating potentials in the entire search space around UAV, instead a set of connected waypoints in a simulated chain. The popular ORB-SLAM, suited for real time approach is used for building the map of the environment and UAV position and the UAV path is then generated continuously in the shortest time to navigate to the goal position. The principal contribution are (a) Chain-based path planning approach with built in obstacle avoidance in conjunction with ORB-SLAM for the first time, (b) Generation of path with minimum overheads and (c) Implementation in near real time.

항만 구조물의 최적 정밀점검 시기 추정을 위한 추계학적 결정모형의 개발 (Development of Stochastic Decision Model for Estimation of Optimal In-depth Inspection Period of Harbor Structures)

  • 이철응
    • 한국해안·해양공학회논문집
    • /
    • 제28권2호
    • /
    • pp.63-72
    • /
    • 2016
  • 경사제 피복재와 같은 항만 구조물의 유지관리 계획에서 중요한 최적 정밀점검시기를 쉽게 결정할 수 있는 RRP(Renewal Reward Process)기반 기대할인비용모형인 추계학적 결정모형을 개발하였다. PIM(Periodic Inspection and Maintenance)과 CBIM(Condition-Based Inspection and Maintenance) 정책을 동시에 적용하여 이전 모형들의 한계성을 극복할 수 있는 수학적 모형을 수립하였다. 또한 모형에 연속복리계수를 도입하여 점검 및 보수보강과 관련된 비용들의 시간에 따른 가치변화를 고려하였다. 먼저 파괴율 함수가 일정한 조건에서 해석해를 유도하고, 분포함수에 따른 영향 등 다각적 민감도 분석을 수행하여 본 연구에서 유도된 해석해가 기존에 제시된 해석해를 포함하며 적용성이 더 우수함을 확인 할 수 있었다. 추계학적 확률과정을 이용하는 경우에도 본 연구에서 수립된 모형은 경사제 피복재와 같은 구조물의 추계학적 누적피해도의 비선형성을 올바로 해석할 수 있다. 특히 MCS(Monte-Carlo Simulation) 기반 표본경로기법을 사용하여 모형의 피해강도함수의 계수들을 비교적 쉽게 산정할 수 있었다. 마지막으로 본 연구에서 개발된 추계학적 결정 모형을 경사제 피복재에 만족스럽게 적용하였다. 누적피해의 거동 특성, 사용한계의 수준 그리고 구조물의 중요도에 따라 단위시간당 기대 총 비용이 최소가 되는 경사제의 피복재의 최적 정밀점검 시점을 비교적 쉽게 결정할 수 있었다.

심층 결정론적 정책 경사법을 이용한 선박 충돌 회피 경로 결정 (Determination of Ship Collision Avoidance Path using Deep Deterministic Policy Gradient Algorithm)

  • 김동함;이성욱;남종호;요시타카 후루카와
    • 대한조선학회논문집
    • /
    • 제56권1호
    • /
    • pp.58-65
    • /
    • 2019
  • The stability, reliability and efficiency of a smart ship are important issues as the interest in an autonomous ship has recently been high. An automatic collision avoidance system is an essential function of an autonomous ship. This system detects the possibility of collision and automatically takes avoidance actions in consideration of economy and safety. In order to construct an automatic collision avoidance system using reinforcement learning, in this work, the sequential decision problem of ship collision is mathematically formulated through a Markov Decision Process (MDP). A reinforcement learning environment is constructed based on the ship maneuvering equations, and then the three key components (state, action, and reward) of MDP are defined. The state uses parameters of the relationship between own-ship and target-ship, the action is the vertical distance away from the target course, and the reward is defined as a function considering safety and economics. In order to solve the sequential decision problem, the Deep Deterministic Policy Gradient (DDPG) algorithm which can express continuous action space and search an optimal action policy is utilized. The collision avoidance system is then tested assuming the $90^{\circ}$intersection encounter situation and yields a satisfactory result.