• 제목/요약/키워드: Reward Policy

검색결과 129건 처리시간 0.024초

Generating Cooperative Behavior by Multi-Agent Profit Sharing on the Soccer Game

  • Miyazaki, Kazuteru;Terada, Takashi;Kobayashi, Hiroaki
    • 한국지능시스템학회:학술대회논문집
    • /
    • 한국퍼지및지능시스템학회 2003년도 ISIS 2003
    • /
    • pp.166-169
    • /
    • 2003
  • Reinforcement learning if a kind of machine learning. It aims to adapt an agent to a given environment with a clue to a reward and a penalty. Q-learning [8] that is a representative reinforcement learning system treats a reward and a penalty at the same time. There is a problem how to decide an appropriate reward and penalty values. We know the Penalty Avoiding Rational Policy Making algorithm (PARP) [4] and the Penalty Avoiding Profit Sharing (PAPS) [2] as reinforcement learning systems to treat a reward and a penalty independently. though PAPS is a descendant algorithm of PARP, both PARP and PAPS tend to learn a local optimal policy. To overcome it, ion this paper, we propose the Multi Best method (MB) that is PAPS with the multi-start method[5]. MB selects the best policy in several policies that are learned by PAPS agents. By applying PS, PAPS and MB to a soccer game environment based on the SoccerBots[9], we show that MB is the best solution for the soccer game environment.

  • PDF

Optimal Control Of Two-Hop Routing In Dtns With Time-Varying Selfish Behavior

  • Wu, Yahui;Deng, Su;Huang, Hongbin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제6권9호
    • /
    • pp.2202-2217
    • /
    • 2012
  • The transmission opportunities between nodes in Delay Tolerant Network (DTNs) are uncertain, and routing algorithms in DTNs often need nodes serving as relays for others to carry and forward messages. Due to selfishness, nodes may ask the source to pay a certain reward, and the reward may be varying with time. Moreover, the reward that the source obtains from the destination may also be varying with time. For example, the sooner the destination gets the message, the more rewards the source may obtain. The goal of this paper is to explore efficient ways for the source to maximize its total reward in such complex applications when it uses the probabilistic two-hop routing policy. We first propose a theoretical framework, which can be used to evaluate the total reward that the source can obtain. Then based on the model, we prove that the optimal forwarding policy confirms to the threshold form by the Pontryagin's Maximum Principle. Simulations based on both synthetic and real motion traces show the accuracy of our theoretical framework. Furthermore, we demonstrate that the performance of the optimal forwarding policy with threshold form is better through extensive numerical results, which conforms to the result obtained by the Maximum Principle.

DDPG 알고리즘을 이용한 양팔 매니퓰레이터의 협동작업 경로상의 특이점 회피 경로 계획 (Singularity Avoidance Path Planning on Cooperative Task of Dual Manipulator Using DDPG Algorithm)

  • 이종학;김경수;김윤재;이장명
    • 로봇학회논문지
    • /
    • 제16권2호
    • /
    • pp.137-146
    • /
    • 2021
  • When controlling manipulator, degree of freedom is lost in singularity so specific joint velocity does not propagate to the end effector. In addition, control problem occurs because jacobian inverse matrix can not be calculated. To avoid singularity, we apply Deep Deterministic Policy Gradient(DDPG), algorithm of reinforcement learning that rewards behavior according to actions then determines high-reward actions in simulation. DDPG uses off-policy that uses 𝝐-greedy policy for selecting action of current time step and greed policy for the next step. In the simulation, learning is given by negative reward when moving near singulairty, and positive reward when moving away from the singularity and moving to target point. The reward equation consists of distance to target point and singularity, manipulability, and arrival flag. Dual arm manipulators hold long rod at the same time and conduct experiments to avoid singularity by simulated path. In the learning process, if object to be avoided is set as a space rather than point, it is expected that avoidance of obstacles will be possible in future research.

정보보안 정책 인식과 정보보안 관여성, 준수 의도성이 정보보안 행동에 미치는 영향 분석: 보상 차원과 공정성 차원을 중심으로 (Analysis of the Effects of Information Security Policy Awareness, Information Security Involvement, and Compliance Behavioral Intention on Information Security behavior : Focursing on Reward and Fairness)

  • 허성호;황인호
    • 융합정보논문지
    • /
    • 제10권12호
    • /
    • pp.91-99
    • /
    • 2020
  • 본 연구의 목적은 정보보안 정책 인식, 정보보안 관여성, 준수 의도성이 정보보안 행동에 미치는 영향력을 분석하는 것이다. 연구 방법은 보상 차원과 공정성 차원의 교차설계로 구성되었고, 조직적인 차원의 정책이 개인의 의사결정 수준에서 발생하는 정보처리 단계를 통해 정보보안 준수의도로 나타나는 과정에 주안점을 두었다. 연구 결과, 보상 차원은 준수 의도성에 유의미한 영향을 미치고 있었으며, 심리적 조건의 영향력이 물질적 조건보다 더 큰 것으로 나타났다. 공정성 차원은 정보보안 정책 인식, 정보보안 관여성, 정보보안 행동에 유의미한 영향을 미치고 있었으며, 형평성 조건의 영향력이 동등성 조건보다 더 큰 것으로 나타났다. 결과적으로 도출한 결과 모형은 측정변인으로 재구성된 복합 매개모형으로 확인되었고, 개인과 조직의 문화적 환경에 의한 시너지 관점에서 필요한 연구 방향을 논의하였다.

시연에 의해 유도된 탐험을 통한 시각 기반의 물체 조작 (Visual Object Manipulation Based on Exploration Guided by Demonstration)

  • 김두준;조현준;송재복
    • 로봇학회논문지
    • /
    • 제17권1호
    • /
    • pp.40-47
    • /
    • 2022
  • A reward function suitable for a task is required to manipulate objects through reinforcement learning. However, it is difficult to design the reward function if the ample information of the objects cannot be obtained. In this study, a demonstration-based object manipulation algorithm called stochastic exploration guided by demonstration (SEGD) is proposed to solve the design problem of the reward function. SEGD is a reinforcement learning algorithm in which a sparse reward explorer (SRE) and an interpolated policy using demonstration (IPD) are added to soft actor-critic (SAC). SRE ensures the training of the critic of SAC by collecting prior data and IPD limits the exploration space by making SEGD's action similar to the expert's action. Through these two algorithms, the SEGD can learn only with the sparse reward of the task without designing the reward function. In order to verify the SEGD, experiments were conducted for three tasks. SEGD showed its effectiveness by showing success rates of more than 96.5% in these experiments.

Exploring reward efficacy in traffic management using deep reinforcement learning in intelligent transportation system

  • Paul, Ananya;Mitra, Sulata
    • ETRI Journal
    • /
    • 제44권2호
    • /
    • pp.194-207
    • /
    • 2022
  • In the last decade, substantial progress has been achieved in intelligent traffic control technologies to overcome consistent difficulties of traffic congestion and its adverse effect on smart cities. Edge computing is one such advanced progress facilitating real-time data transmission among vehicles and roadside units to mitigate congestion. An edge computing-based deep reinforcement learning system is demonstrated in this study that appropriately designs a multiobjective reward function for optimizing different objectives. The system seeks to overcome the challenge of evaluating actions with a simple numerical reward. The selection of reward functions has a significant impact on agents' ability to acquire the ideal behavior for managing multiple traffic signals in a large-scale road network. To ascertain effective reward functions, the agent is trained withusing the proximal policy optimization method in several deep neural network models, including the state-of-the-art transformer network. The system is verified using both hypothetical scenarios and real-world traffic maps. The comprehensive simulation outcomes demonstrate the potency of the suggested reward functions.

신임해양경찰관의 성격 요인과 보상 요인이 PSM에 미치는 영향에 관한 연구 (A Study on the Effects of Newly Appointed Coast Guard Officers Personality Factors and Compensation Factors on PSM)

  • 김종길
    • 해양환경안전학회지
    • /
    • 제26권7호
    • /
    • pp.838-844
    • /
    • 2020
  • 본 연구는 신임해양경찰공무원을 대상으로 성격과 보상이 PSM에 영향을 미치는가에 대하여 연구한 결과 첫째, 성격 요인의 하위변인 중 신경성은 PSM의 하위변인인 공공정책 호감도와 공익몰입, 동정심에 영향을 미쳤다. 또한, 성격 요인의 하위요인 중 외향성 요인은 PSM의 하위변인 중 자기희생에만 영향을 미쳤다. 둘째, 보상 요인은 대부분 PSM에 영향을 미치지 않는 것으로 나타났고 이중 내적 보상은 PSM 하위변인 중 동정심에 영향을 미치는 것으로 분석되었다. 이러한 연구를 기반으로 성격 요인이 공공봉사 동기에 영향을 미치는 요인임이 확인된 결과 채용에서 반영할 수 있는 제도적인 개선이 필요하며 보상과 공공봉사 동기와의 관계를 검증하기 위한 연구의 필요성이 제기된다.

종합병원 간호사들의 노동보상과 직업몰입에 관한 연구 (Work Rewards and Occupational Commitment of Hospital Nurses)

  • 고종욱;서영준
    • 보건행정학회지
    • /
    • 제12권3호
    • /
    • pp.77-98
    • /
    • 2002
  • The purpose of this study is to empirically investigate the determinants of occupational commitment of hospital nurses. For this study, a causal model of occupational commitment of hospital nurses was constructed based on the exchange theory. The sample of this study consisted of 329 nurses from S general hospitals located in Seoul and south-eastern area of Korea. Data were collected with self-administered questionnaires and analyzed using hierarchical multiple regression. It was found that four task reward variables(variety, significance, workload and resource inadequacy), one social reward variable(supervisory support) and two organizational reward variables(promotional chances and pay) had significant net effect on hospital nurses' occupational commitment. The implications of these findings were discussed and the suggestions for future research wert advanced.

공 던지기 로봇의 정책 예측 심층 강화학습 (Deep Reinforcement Learning of Ball Throwing Robot's Policy Prediction)

  • 강영균;이철수
    • 로봇학회논문지
    • /
    • 제15권4호
    • /
    • pp.398-403
    • /
    • 2020
  • Robot's throwing control is difficult to accurately calculate because of air resistance and rotational inertia, etc. This complexity can be solved by using machine learning. Reinforcement learning using reward function puts limit on adapting to new environment for robots. Therefore, this paper applied deep reinforcement learning using neural network without reward function. Throwing is evaluated as a success or failure. AI network learns by taking the target position and control policy as input and yielding the evaluation as output. Then, the task is carried out by predicting the success probability according to the target location and control policy and searching the policy with the highest probability. Repeating this task can result in performance improvements as data accumulates. And this model can even predict tasks that were not previously attempted which means it is an universally applicable learning model for any new environment. According to the data results from 520 experiments, this learning model guarantees 75% success rate.

리더십, 평가 및 보상, KMS특성이 KMS 이용에 미치는 영향 (The Effects of Leadership, Appraisal, Reward, and KMS Characteristics on KMS Use)

  • 이홍재;박성종
    • 디지털융복합연구
    • /
    • 제10권6호
    • /
    • pp.7-15
    • /
    • 2012
  • 본 연구는 정부조직의 KMS 이용 활성화방안을 논의하기 위해 기관장의 리더십과 제도적 요인, KMS 특성, 그리고 KMS 이용 간의 구조적 영향 관계를 분석하였다. 실증분석결과, 기관장 리더십은 평가 및 보상에 정(+)의 영향을 미치고, 평가 역시 보상에 유의미한 정(+)의 영향을 미치는 것으로 나타났다. 그리고 KMS 이용에 관한 보상과 지식품질은 KMS 이용에 유의미한 영향을 미치는 것으로 분석되었다. 하지만 기관장 리더십과 평가, KMS 품질은 KMS 이용에 유의미한 영향을 미치지 않는 것으로 분석되었다. 이러한 분석결과를 토대로 본 연구에서는 정부조직에서 KMS 이용에 관한 이론적 실천적 함의를 제시하였다.