• Title/Summary/Keyword: Reward Policy

Search Result 129, Processing Time 0.023 seconds

Generating Cooperative Behavior by Multi-Agent Profit Sharing on the Soccer Game

  • Miyazaki, Kazuteru;Terada, Takashi;Kobayashi, Hiroaki
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.09a
    • /
    • pp.166-169
    • /
    • 2003
  • Reinforcement learning if a kind of machine learning. It aims to adapt an agent to a given environment with a clue to a reward and a penalty. Q-learning [8] that is a representative reinforcement learning system treats a reward and a penalty at the same time. There is a problem how to decide an appropriate reward and penalty values. We know the Penalty Avoiding Rational Policy Making algorithm (PARP) [4] and the Penalty Avoiding Profit Sharing (PAPS) [2] as reinforcement learning systems to treat a reward and a penalty independently. though PAPS is a descendant algorithm of PARP, both PARP and PAPS tend to learn a local optimal policy. To overcome it, ion this paper, we propose the Multi Best method (MB) that is PAPS with the multi-start method[5]. MB selects the best policy in several policies that are learned by PAPS agents. By applying PS, PAPS and MB to a soccer game environment based on the SoccerBots[9], we show that MB is the best solution for the soccer game environment.

  • PDF

Optimal Control Of Two-Hop Routing In Dtns With Time-Varying Selfish Behavior

  • Wu, Yahui;Deng, Su;Huang, Hongbin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.6 no.9
    • /
    • pp.2202-2217
    • /
    • 2012
  • The transmission opportunities between nodes in Delay Tolerant Network (DTNs) are uncertain, and routing algorithms in DTNs often need nodes serving as relays for others to carry and forward messages. Due to selfishness, nodes may ask the source to pay a certain reward, and the reward may be varying with time. Moreover, the reward that the source obtains from the destination may also be varying with time. For example, the sooner the destination gets the message, the more rewards the source may obtain. The goal of this paper is to explore efficient ways for the source to maximize its total reward in such complex applications when it uses the probabilistic two-hop routing policy. We first propose a theoretical framework, which can be used to evaluate the total reward that the source can obtain. Then based on the model, we prove that the optimal forwarding policy confirms to the threshold form by the Pontryagin's Maximum Principle. Simulations based on both synthetic and real motion traces show the accuracy of our theoretical framework. Furthermore, we demonstrate that the performance of the optimal forwarding policy with threshold form is better through extensive numerical results, which conforms to the result obtained by the Maximum Principle.

Singularity Avoidance Path Planning on Cooperative Task of Dual Manipulator Using DDPG Algorithm (DDPG 알고리즘을 이용한 양팔 매니퓰레이터의 협동작업 경로상의 특이점 회피 경로 계획)

  • Lee, Jonghak;Kim, Kyeongsoo;Kim, Yunjae;Lee, Jangmyung
    • The Journal of Korea Robotics Society
    • /
    • v.16 no.2
    • /
    • pp.137-146
    • /
    • 2021
  • When controlling manipulator, degree of freedom is lost in singularity so specific joint velocity does not propagate to the end effector. In addition, control problem occurs because jacobian inverse matrix can not be calculated. To avoid singularity, we apply Deep Deterministic Policy Gradient(DDPG), algorithm of reinforcement learning that rewards behavior according to actions then determines high-reward actions in simulation. DDPG uses off-policy that uses 𝝐-greedy policy for selecting action of current time step and greed policy for the next step. In the simulation, learning is given by negative reward when moving near singulairty, and positive reward when moving away from the singularity and moving to target point. The reward equation consists of distance to target point and singularity, manipulability, and arrival flag. Dual arm manipulators hold long rod at the same time and conduct experiments to avoid singularity by simulated path. In the learning process, if object to be avoided is set as a space rather than point, it is expected that avoidance of obstacles will be possible in future research.

Analysis of the Effects of Information Security Policy Awareness, Information Security Involvement, and Compliance Behavioral Intention on Information Security behavior : Focursing on Reward and Fairness (정보보안 정책 인식과 정보보안 관여성, 준수 의도성이 정보보안 행동에 미치는 영향 분석: 보상 차원과 공정성 차원을 중심으로)

  • Hu, Sung-ho;Hwang, In-ho
    • Journal of Convergence for Information Technology
    • /
    • v.10 no.12
    • /
    • pp.91-99
    • /
    • 2020
  • The aim of this study to assess the effect of information security policy awareness, information security involvement, compliance behavioral intention on information security behavior The research method is composed of a cross-sectional design of reward and fairness. This paper focuses on the process of organizational policy on the information security compliance intention in the individual decision-making process. As a result, the reward had a significant effect on compliance behavioral intention, and it was found that influence of the psychological reward-based condition was greater than the material reward-based condition. The fairness had a significant effect on information security policy awareness, information security involvement, information security behavior, and it was found that influence of the equity-based condition was greater than the equality-based condition. The exploration model was verified as a multiple mediation model. In addition, the discussion presented the necessary research direction from the perspective of synergy by the cultural environment of individuals and organizations.

Visual Object Manipulation Based on Exploration Guided by Demonstration (시연에 의해 유도된 탐험을 통한 시각 기반의 물체 조작)

  • Kim, Doo-Jun;Jo, HyunJun;Song, Jae-Bok
    • The Journal of Korea Robotics Society
    • /
    • v.17 no.1
    • /
    • pp.40-47
    • /
    • 2022
  • A reward function suitable for a task is required to manipulate objects through reinforcement learning. However, it is difficult to design the reward function if the ample information of the objects cannot be obtained. In this study, a demonstration-based object manipulation algorithm called stochastic exploration guided by demonstration (SEGD) is proposed to solve the design problem of the reward function. SEGD is a reinforcement learning algorithm in which a sparse reward explorer (SRE) and an interpolated policy using demonstration (IPD) are added to soft actor-critic (SAC). SRE ensures the training of the critic of SAC by collecting prior data and IPD limits the exploration space by making SEGD's action similar to the expert's action. Through these two algorithms, the SEGD can learn only with the sparse reward of the task without designing the reward function. In order to verify the SEGD, experiments were conducted for three tasks. SEGD showed its effectiveness by showing success rates of more than 96.5% in these experiments.

Exploring reward efficacy in traffic management using deep reinforcement learning in intelligent transportation system

  • Paul, Ananya;Mitra, Sulata
    • ETRI Journal
    • /
    • v.44 no.2
    • /
    • pp.194-207
    • /
    • 2022
  • In the last decade, substantial progress has been achieved in intelligent traffic control technologies to overcome consistent difficulties of traffic congestion and its adverse effect on smart cities. Edge computing is one such advanced progress facilitating real-time data transmission among vehicles and roadside units to mitigate congestion. An edge computing-based deep reinforcement learning system is demonstrated in this study that appropriately designs a multiobjective reward function for optimizing different objectives. The system seeks to overcome the challenge of evaluating actions with a simple numerical reward. The selection of reward functions has a significant impact on agents' ability to acquire the ideal behavior for managing multiple traffic signals in a large-scale road network. To ascertain effective reward functions, the agent is trained withusing the proximal policy optimization method in several deep neural network models, including the state-of-the-art transformer network. The system is verified using both hypothetical scenarios and real-world traffic maps. The comprehensive simulation outcomes demonstrate the potency of the suggested reward functions.

A Study on the Effects of Newly Appointed Coast Guard Officers Personality Factors and Compensation Factors on PSM (신임해양경찰관의 성격 요인과 보상 요인이 PSM에 미치는 영향에 관한 연구)

  • Kim, Jong-Gil
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • v.26 no.7
    • /
    • pp.838-844
    • /
    • 2020
  • The purpose of this study is to examine whether personality and reward affect PSM of new maritime police officer recruits. Results are as follows. First, among the personality factors, the neuroticism sub-factor had significant positive effect on PSM sub-factors favorable perception towards public policy, commitment to public interest, and empathy. Also, the extroversion sub-factor had significant effect only on the self-sacrifice sub-factor of PSM. Second, the reward factor mostly didn't have significant effect on PSM. The internal reward sub-factor had significant effect on the empathy sub-factor of PSM. These results confirm that personality factors have significant impact on public service motivation. This implicates the need for policy improvements to reflect this result, and further empirical studies on the relationship between reward and public service motivation.

Work Rewards and Occupational Commitment of Hospital Nurses (종합병원 간호사들의 노동보상과 직업몰입에 관한 연구)

  • 고종욱;서영준
    • Health Policy and Management
    • /
    • v.12 no.3
    • /
    • pp.77-98
    • /
    • 2002
  • The purpose of this study is to empirically investigate the determinants of occupational commitment of hospital nurses. For this study, a causal model of occupational commitment of hospital nurses was constructed based on the exchange theory. The sample of this study consisted of 329 nurses from S general hospitals located in Seoul and south-eastern area of Korea. Data were collected with self-administered questionnaires and analyzed using hierarchical multiple regression. It was found that four task reward variables(variety, significance, workload and resource inadequacy), one social reward variable(supervisory support) and two organizational reward variables(promotional chances and pay) had significant net effect on hospital nurses' occupational commitment. The implications of these findings were discussed and the suggestions for future research wert advanced.

Deep Reinforcement Learning of Ball Throwing Robot's Policy Prediction (공 던지기 로봇의 정책 예측 심층 강화학습)

  • Kang, Yeong-Gyun;Lee, Cheol-Soo
    • The Journal of Korea Robotics Society
    • /
    • v.15 no.4
    • /
    • pp.398-403
    • /
    • 2020
  • Robot's throwing control is difficult to accurately calculate because of air resistance and rotational inertia, etc. This complexity can be solved by using machine learning. Reinforcement learning using reward function puts limit on adapting to new environment for robots. Therefore, this paper applied deep reinforcement learning using neural network without reward function. Throwing is evaluated as a success or failure. AI network learns by taking the target position and control policy as input and yielding the evaluation as output. Then, the task is carried out by predicting the success probability according to the target location and control policy and searching the policy with the highest probability. Repeating this task can result in performance improvements as data accumulates. And this model can even predict tasks that were not previously attempted which means it is an universally applicable learning model for any new environment. According to the data results from 520 experiments, this learning model guarantees 75% success rate.

The Effects of Leadership, Appraisal, Reward, and KMS Characteristics on KMS Use (리더십, 평가 및 보상, KMS특성이 KMS 이용에 미치는 영향)

  • Lee, Hong-Jae;Park, Sung-Jong
    • Journal of Digital Convergence
    • /
    • v.10 no.6
    • /
    • pp.7-15
    • /
    • 2012
  • The purpose of this study is to examine the causal relationships among leadership, appraisal, reward, knowledge quality, knowledge management system(KMS) quality, and KMS use. The results of data analysis by structured equation model(SEM) indicate that leadership significantly influences appraisal and reward. Appraisal affects reward too. Reward and knowledge quality affect KMS Use, but leadership, appraisal, and KMS quality don't affect KMS use. Based on the results, the theoretical and practical implications of this study are discussed.