• Title/Summary/Keyword: OpenAI Gym

Search Result 11, Processing Time 0.023 seconds

DQN Reinforcement Learning for Acrobot in OpenAI Gym Environment (OpenAI Gym 환경의 Acrobot에 대한 DQN 강화학습)

  • Myung-Ju Kang
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.07a
    • /
    • pp.35-36
    • /
    • 2023
  • 본 논문에서는 OpenAI Gym 환경에서 제공하는 Acrobot-v1에 대해 DQN(Deep Q-Networks) 강화학습으로 학습시키고, 이 때 적용되는 활성화함수의 성능을 비교분석하였다. DQN 강화학습에 적용한 활성화함수는 ReLU, ReakyReLU, ELU, SELU 그리고 softplus 함수이다. 실험 결과 평균적으로 Leaky_ReLU 활성화함수를 적용했을 때의 보상 값이 높았고, 최대 보상 값은 SELU 활성화 함수를 적용할 때로 나타났다.

  • PDF

Comparison of Activation Functions of Reinforcement Learning in OpenAI Gym Environments (OpenAI Gym 환경에서 강화학습의 활성화함수 비교 분석)

  • Myung-Ju Kang
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.01a
    • /
    • pp.25-26
    • /
    • 2023
  • 본 논문에서는 OpenAI Gym 환경에서 제공하는 CartPole-v1에 대해 강화학습을 통해 에이전트를 학습시키고, 학습에 적용되는 활성화함수의 성능을 비교분석하였다. 본 논문에서 적용한 활성화함수는 Sigmoid, ReLU, ReakyReLU 그리고 softplus 함수이며, 각 활성화함수를 DQN(Deep Q-Networks) 강화학습에 적용했을 때 보상 값을 비교하였다. 실험결과 ReLU 활성화함수를 적용하였을 때의 보상이 가장 높은 것을 알 수 있었다.

  • PDF

Experimental Analysis of A3C and PPO in the OpenAI Gym Environment (OpenAI Gym 환경에서 A3C와 PPO의 실험적 분석)

  • Hwang, Gyu-Young;Lim, Hyun-Kyo;Heo, Joo-Seong;Han, Youn-Hee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.05a
    • /
    • pp.545-547
    • /
    • 2019
  • Policy Gradient 방식의 학습은 최근 강화학습 분야에서 많이 연구되고 있는 주제로, 본 논문에서는 강화학습을 적용시킬 수 있는 OpenAi Gym 의 'CartPole-v0' 와 'Pendulum-v0' 환경에서 Policy Gradient 방식의 Asynchronous Advantage Actor-Critic (A3C) 알고리즘과 Proximal Policy Optimization (PPO) 알고리즘의 학습 성능을 비교 분석한 결과를 제시한다. 딥러닝 모델 등 두 알고리즘이 동일하게 지닐 수 있는 조건들은 가능한 동일하게 맞추면서 Episode 진행에 따른 Score 변화 과정을 실험하였다. 본 실험을 통해서 두 가지 서로 다른 환경에서 PPO 가 A3C 보다 더 나은 성능을 보임을 확인하였다.

DQN Reinforcement Learning for Mountain-Car in OpenAI Gym Environment (OpenAI Gym 환경의 Mountain-Car에 대한 DQN 강화학습)

  • Myung-Ju Kang
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2024.01a
    • /
    • pp.375-377
    • /
    • 2024
  • 본 논문에서는 OpenAI Gym 환경에서 프로그램으로 간단한 제어가 가능한 Mountain-Car-v0 게임에 대해 DQN(Deep Q-Networks) 강화학습을 진행하였다. 본 논문에서 적용한 DQN 네트워크는 입력층 1개, 은닉층 3개, 출력층 1개로 구성하였고, 입력층과 은닉층에서의 활성화함수는 ReLU를, 출력층에서는 Linear함수를 활성화함수로 적용하였다. 실험은 Mountain-Car-v0에 대해 DQN 강화학습을 진행했을 때 각 에피소드별로 획득한 보상 결과를 살펴보고, 보상구간에 포함된 횟수를 분석하였다. 실험결과 전체 100회의 에피소드 중 보상을 50 이상 획득한 에피소드가 85개로 나타났다.

  • PDF

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return (멀티-스텝 누적 보상을 활용한 Max-Mean N-Step 시간차 학습)

  • Hwang, Gyu-Young;Kim, Ju-Bong;Heo, Joo-Seong;Han, Youn-Hee
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.10 no.5
    • /
    • pp.155-162
    • /
    • 2021
  • n-step TD learning is a combination of Monte Carlo method and one-step TD learning. If appropriate n is selected, n-step TD learning is known as an algorithm that performs better than Monte Carlo method and 1-step TD learning, but it is difficult to select the best values of n. In order to solve the difficulty of selecting the values of n in n-step TD learning, in this paper, using the characteristic that overestimation of Q can improve the performance of initial learning and that all n-step returns have similar values for Q ≈ Q*, we propose a new learning target, which is composed of the maximum and the mean of all k-step returns for 1 ≤ k ≤ n. Finally, in OpenAI Gym's Atari game environment, we compare the proposed algorithm with n-step TD learning and proved that the proposed algorithm is superior to n-step TD learning algorithm.

Uncertainty Sequence Modeling Approach for Safe and Effective Autonomous Driving (안전하고 효과적인 자율주행을 위한 불확실성 순차 모델링)

  • Yoon, Jae Ung;Lee, Ju Hong
    • Smart Media Journal
    • /
    • v.11 no.9
    • /
    • pp.9-20
    • /
    • 2022
  • Deep reinforcement learning(RL) is an end-to-end data-driven control method that is widely used in the autonomous driving domain. However, conventional RL approaches have difficulties in applying it to autonomous driving tasks due to problems such as inefficiency, instability, and uncertainty. These issues play an important role in the autonomous driving domain. Although recent studies have attempted to solve these problems, they are computationally expensive and rely on special assumptions. In this paper, we propose a new algorithm MCDT that considers inefficiency, instability, and uncertainty by introducing a method called uncertainty sequence modeling to autonomous driving domain. The sequence modeling method, which views reinforcement learning as a decision making generation problem to obtain high rewards, avoids the disadvantages of exiting studies and guarantees efficiency, stability and also considers safety by integrating uncertainty estimation techniques. The proposed method was tested in the OpenAI Gym CarRacing environment, and the experimental results show that the MCDT algorithm provides efficient, stable and safe performance compared to the existing reinforcement learning method.

How the Learning Speed and Tendency of Reinforcement Learning Agents Change with Prior Knowledge (사전 지식에 의한 강화학습 에이전트의 학습 속도와 경향성 변화)

  • Kim, Jisoo;Lee, Eun Hun;Kim, Hyeoncheol
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.05a
    • /
    • pp.512-515
    • /
    • 2020
  • 학습 속도가 느린 강화학습을 범용적으로 활용할 수 있도록 연구가 활발하게 이루어지고 있다. 사전 지식을 제공해서 학습 속도를 높일 수 있지만, 잘못된 사전 지식을 제공했을 위험이 존재한다. 본 연구는 불확실하거나 잘못된 사전 지식이 학습에 어떤 영향을 미치는지 살펴본다. OpenAI Gym 라이브러리를 이용해서 만든 Gamble 환경, Cliff 환경, 그리고 Maze 환경에서 실험을 진행했다. 그 결과 사전 지식을 통해 에이전트의 행동에 경향성을 부여할 수 있다는 것을 확인했다. 또한, 경로탐색에 있어서 잘못된 사전 지식이 얼마나 학습을 방해하는지 알아보았다.

Random Balance between Monte Carlo and Temporal Difference in off-policy Reinforcement Learning for Less Sample-Complexity (오프 폴리시 강화학습에서 몬테 칼로와 시간차 학습의 균형을 사용한 적은 샘플 복잡도)

  • Kim, Chayoung;Park, Seohee;Lee, Woosik
    • Journal of Internet Computing and Services
    • /
    • v.21 no.5
    • /
    • pp.1-7
    • /
    • 2020
  • Deep neural networks(DNN), which are used as approximation functions in reinforcement learning (RN), theoretically can be attributed to realistic results. In empirical benchmark works, time difference learning (TD) shows better results than Monte-Carlo learning (MC). However, among some previous works show that MC is better than TD when the reward is very rare or delayed. Also, another recent research shows when the information observed by the agent from the environment is partial on complex control works, it indicates that the MC prediction is superior to the TD-based methods. Most of these environments can be regarded as 5-step Q-learning or 20-step Q-learning, where the experiment continues without long roll-outs for alleviating reduce performance degradation. In other words, for networks with a noise, a representative network that is regardless of the controlled roll-outs, it is better to learn MC, which is robust to noisy rewards than TD, or almost identical to MC. These studies provide a break with that TD is better than MC. These recent research results show that the way combining MC and TD is better than the theoretical one. Therefore, in this study, based on the results shown in previous studies, we attempt to exploit a random balance with a mixture of TD and MC in RL without any complicated formulas by rewards used in those studies do. Compared to the DQN using the MC and TD random mixture and the well-known DQN using only the TD-based learning, we demonstrate that a well-performed TD learning are also granted special favor of the mixture of TD and MC through an experiments in OpenAI Gym.

Q-Learning Policy and Reward Design for Efficient Path Selection (효율적인 경로 선택을 위한 Q-Learning 정책 및 보상 설계)

  • Yong, Sung-Jung;Park, Hyo-Gyeong;You, Yeon-Hwi;Moon, Il-Young
    • Journal of Advanced Navigation Technology
    • /
    • v.26 no.2
    • /
    • pp.72-77
    • /
    • 2022
  • Among the techniques of reinforcement learning, Q-Learning means learning optimal policies by learning Q functions that perform actionsin a given state and predict future efficient expectations. Q-Learning is widely used as a basic algorithm for reinforcement learning. In this paper, we studied the effectiveness of selecting and learning efficient paths by designing policies and rewards based on Q-Learning. In addition, the results of the existing algorithm and punishment compensation policy and the proposed punishment reinforcement policy were compared by applying the same number of times of learning to the 8x8 grid environment of the Frozen Lake game. Through this comparison, it was analyzed that the Q-Learning punishment reinforcement policy proposed in this paper can significantly increase the learning speed compared to the application of conventional algorithms.

Q-Learning Policy Design to Speed Up Agent Training (에이전트 학습 속도 향상을 위한 Q-Learning 정책 설계)

  • Yong, Sung-jung;Park, Hyo-gyeong;You, Yeon-hwi;Moon, Il-young
    • Journal of Practical Engineering Education
    • /
    • v.14 no.1
    • /
    • pp.219-224
    • /
    • 2022
  • Q-Learning is a technique widely used as a basic algorithm for reinforcement learning. Q-Learning trains the agent in the direction of maximizing the reward through the greedy action that selects the largest value among the rewards of the actions that can be taken in the current state. In this paper, we studied a policy that can speed up agent training using Q-Learning in Frozen Lake 8×8 grid environment. In addition, the training results of the existing algorithm of Q-learning and the algorithm that gave the attribute 'direction' to agent movement were compared. As a result, it was analyzed that the Q-Learning policy proposed in this paper can significantly increase both the accuracy and training speed compared to the general algorithm.