1. Introduction
Deep neural networks (DNN) as function approximators in reinforcement learning (RL) [1,2] has significantly enlarged dealing with real environment. Theoretical results could be helped mainly on linear function approximators within a set of narrow environment. However, their theoretical assumptions do not consider the genuine real-world application domains of deep RL, such as high input dimensionality of feature patterns or non-linear function approximators. Such expressive parameterization in the DNN also brings up thousands of practical issues. It tends to be sensitive to hyper-parameter. Moreover, poor hyper-parameter settings lead to unstable or non-convergent training or diverge to infinity. Also, Deep RL [3] is more likely to exhibit high sample complexity, which is impractical to the real-world problems like DNN. It is regarded that batch policy gradient RL offers stability of learning than Deep RL. However, it leads to the high-variance requiring large batches because their estimations are continually growing. So, TD-style [4] techniques, for example Q-learning or actor-critic, are regarded helpful for sample-efficient but still biased, and require expensive hyper-parameters setting for better stabilization. Also, MC of RL [1,2] is regarded as that it can offer nearly unbiased because MC policy gradient is common for on-policy techniques. However, MC might be occasionally suffering from high variance. Therefore, to deal with high variance gradient and such a difficult optimal parameterization, there are some previous valuable research works, such as mixing value-based back-ups in MC [4]. However, most those works require huge amount of samples and some complicated formular of rewards for dealing with these real-world problems, which is very intensive in terms of Big-Data. Off-policy Q-learning [3,5] and off-policy actor-critic [4] can use all samples by TD-learning combined with experience replay in deep neural network for sample-efficient. Such architectures are significantly useful in terms of Big-Data samples. Still, there are some issues on a convergence of TD-learning, which is not guaranteed with non-linear function approximators. Non-convergence and instability issues still require extensive hyper-parameter tuning and human-interactions[6].
In some benchmark works, TD based on a combination of MC theory and dynamic programming (DP) [2,4] theory has been used for better empirical results rather than theoretical results. Moreover, some recent works show that finite-horizon MC is a little superior than TD, when it comes to sparse or delayed rewards. Moreover, recent researches show that a technique based on MC prediction might outperform TD-based methods on complex control works in partially observables. Most of these environments can be regarded as 5-step Q-learning or 20-step Q-learning, where the experiment continues without long roll-outs for alleviating deterioration of performance results. In other words, for networks with a noise, a representative network that is regardless of the controlled roll-outs, it is better to learn MC, which is robust to noisy rewards than TD, or almost identical to MC. These studies provide a break with that TD is better than MC. The key point of recent research results is to suggest the ways combining MC and TD [4]. In value-based deep RL architectures with bootstrapping samples of Big-Data such as DQN [3], TD have been regarded as superior than MC. However, recent empirical researches have shown that MC is more stable to noisy and sparse rewards or a balance of TD and MC [4] are more practical for training an AI agent. Our study focus on discrete action sets and algorithms involving a prediction of value function, which can be learned via a combination of TD and MC to make value-based methods perform better.
Therefore, in this study, based on the results shown in previous researches, we attempt to exploit a random balance with a mixture of TD and MC in RL without any complicated formulas by rewards used in those researches do. We also demonstrate that DQN with a well-performed TD leaning are also granted special favor of the mixture of TD and MC at random. Moreover, our proposed algorithm goes through experimental comparison with the well-known DQN using only the TD-based learning. The result shows that our proposed algorithm has shorter training time than the well-known DQN.
2. BackGrounds
Reinforcement learning (RL) consists of an artificial intelligence (AI) agent acting in an environment over discrete time-steps. An environment is defined by states s, actions a, a reward function r : s×a→r, a transition probability p(st+1|st,at) and a discount factor γ∈ [0,1]. Let π∗ denote an optimal policy such as Qπ∗(s,a) ≥ Qπ (s,a) for every s ∈ S, a ∈ A and any policy π. In terms of Q-learning, the policy πis a maximum in every update of Big-Data. The objective of the equation is to find a policy π(at|st) to maximize the expected sum of all future rewards through the episode such as Rt = ∑ Ti=t ri. (1). To avoid divergence for long episodes, long-distant rewards can be decayed by a discount factor γ or truncated until the explicit steps τ(horizon) such as Rγt = ∑ Ti=t γi−tri = rt + γrt+1+ γ2 rt+2+ ... ; Rτt = ∑t+τi=tri.(2).
For a given policy π, the value function and the action-value function are defined as expected returns that are conditioned on observation or the observation-action pair respectively such as Vπ (st) = Eπ[Rt|st], Qπ (st,at)=EπRt|st,at]. (3). Optimal value and action-value functions are defined such as V*(st) = maxπVπ(st), Q*(st,at) = maxπQπ (st,at). (4). In value-based RL such as Q-learning, the value or action-value are estimated by a function approximator V with parameters θ. The function approximation is trained by minimizing a loss between the current estimate and a target value such as L(θ) = (V(st;θ) − Vtarget)2 .(5). Updates on the target each step makes the value or action-value stable.
Monte Carlo (MC) trains the AI agent with the formular, Vtarget = Rγt or Vtarget = Rτt. [0,1] This target requires propagation of the forward before a training step can take place, for example by the step τfor finite-horizon return Rτt or the end of the episode for the discounted return Rγt. It might increase the variance of the target value. However, it cannot be biased because it is not approximated. An alternative to MC training is temporal difference (TD), which estimates the return by bootstrapping samples of Big-Data from the function approximators, after acting for a fixed number of steps n such as Vtarget = ∑t+n−1i=tγi−tri+γnV(st+n; θ). (6). TD learning is used within finite-horizon returns. TD applied to the action-value function is the well-known Q-learning.
Usually, the classic Q-learning algorithms are used in synergy with deep neural networks, which can oscillate or diverge because the Q-value estimated by Q-learning algorithm are approximated. This limitation is caused by correlated Big-Data or continuously repeated updates. Therefore, some benchmark works [1, 2, 3] use an experience replay memories which samples at random from the mini-batch (st, at, rt, st+1) from the memory buffers, D. So, the training are smoothly over many experience Big-Data. It is explicit to take advantage of the deep neural network (DNN) in RL because the all experiences of the trajectories, is buffered in D for Q(st+1, at+1,θ) at random and also reused by MC training. Figure 1 shows that Q-learning goes into DNN. So, it is called DQN, which can take advantage of the experience replay memories.
(그림 1) DQN 알고리즘 [3]
(Figure 1) The DQN algorithm [3]
The Deep Neural Network (DNN) architecture is based on multiple stacked layers of neurons. A neuron is a non-linear transformation architecture of the linear sum of Big-Data sample inputs. Normally, the first layer models the data itself. Stacked hidden layers in the Neural Network(NN) is constructed as arrays of neurons receiving the inputs from the previous layer. The a neuron activator as a function on the top of the stacked layers in the NN is using composite functions, which show that supervised training of a DNN with non-linearities architecture is faster. So, most stacked hidden layers are composed of the activators such as Rectifi ed Linear Units (ReLU) [7].
3. THE PROPOSED ALGORITHM
For better exploration and bias-variance balance, some researchers suggests a mixture of TD and MC [4], which is applicable to high-dimensional discrete environments, partially observable or sparse by using DQN with replay memories. TD-based AI training has been used for more frequently and efficiently because of empirical reasons since the breaking results ATARI-BREAKOUT [3]. Based on the research [4], we propose the technique to exploit a random balance with a mixture of TD and MC in RL training, specifically DQN. Figure 2 shows our proposed algorithm of off-policy RL, DQN with Monte Carlo and Temporal Difference Balance at random. Mixture of MC and TD is accomplished using β1 and β2 random probabilities in [0,1] for TD and MD, respectively.
(그림 2) 본 논문에서 제안하는 알고리즘
(Figure 2) The proposed algorithm
We attempt to combine Monte Carlo and Temporal Difference with the truncated-steps (horizon) through the whole roll-out. Our proposed algorithm is different from the results of the previous research [4]. It is simpler and random because we are targeting the goal without any complicated formula for the reward of the truncated-steps of exploration. We follow a random probability β1 for performing a gradient descent of TD and β2 for performing a gradient descent of MC. β2 is ( 1 - β1 ). So, the expression is yj = β1*y_TD + β2*y_MC. It is fundamentally based on a random policy with β1 and β2, respectively, for TD and MC. We demonstrate that both TD and MC methods benefit from our method through experimental comparison of the classic DQN using only TD on high-dimensional discrete action environments, such as the well-known environment OpenAI Gym [8].
3.1 Algorithm Description
< off-policy RL, DQN with Monte Carlo and Temporal Difference Balance >
1. Initialize replay memory D and action-value function Q with random weights
2. Initialize the sequence with the start state, s.
3. The agent learns the policy max αQ*(φ(st),a);θ) or follows another policy with probability ε.
4. For better exploration, an experience composed of a tuple, such as (state φ(s), action a, reward r, new state φ(s’)) is in D selected randomly at every training step.
5. Sample a random mini-batch of transitions from D, such as (state φ(s), action a, reward r, new state φ(s’))
6. For non-terminal state, reward based on maxα’Q*((φ (st),a‘);θ) is decayed by γor for terminal state, reward is the current reward, r.
7. The weights for performing the gradient descent (rj+γ maxα’Q*(ϕ(sj+1),α’;θ)−Q(ϕ(sj),αj;θ)) for a target DQN with replay memories with probability random β1.
8. Steps 3–7 are repeated for training.
9. Before the next episode, the weights for performing the gradient descent (rj+γmaxα’Q*(ϕ(sj+1),α’;θ)−Q(ϕ(sj),αj; θ)) for a target DQN with the whole memories in every step with probability random β2.
10. Steps 2–9 are repeated for training.
(그림 3) 본 논문에서 제안하는 알고리즘의 의사코드
(Figure 3) The pseudo code of the proposed algorithm
4. Performance Evaluation
We exploit OpenAI Gym [8] for our proposed algorithms, DQN with TD and MC balance at random. We consider classic control environments in OpenAI Gym, such as CartPole-V0 [9]. With benchmark approaches [1,3], we exploit an experience replay for better exploration. Moreover, in our proposed method, Q-learning stores past experiences at each time step in a buffer D, which is known as a replay memory. Our emulators from OpenAI Gym [8] can apply mini-batch updates in D. After the experience replay memory D, the agent’s actions of the emulator follow ε-greedy policy. In terms of DQN with the replay memory D for better exploration, we also follow the theory of the target Q-network of the breakthrough researches in [1,3]. The Q-learning agent calculates the TD-error with the current estimated Q-value[2]. Updates based on the target network is slower than those on the current network.
In CartPole-V0 [9], a pole is attached with an unactuated joint to a cart moving along a frictionless track. It is controlled by forcing +1 or −1 to the cart. The pole starts upright, and the goal is to prevent it from falling over [9]. A +1 reward is given to every time step in which the pole remains upright [9]. The episode stops when the pole is more than 15° from the vertical direction or the cart moves more than 2.4 units from the center [9]. In our simulation, CartPole-V0 defines “solving” as if the average reward is more than 490 or equal to 500 over 10 consecutive runs [10]. The agent of CartPole-V0 receives −100 reward if it falls over prior to the max-length of the episode [10]. Our proposed algorithm with MC and TD balance at random is implemented using TensorFlow [11] and Keras [12]. For the proposed algorithm, we follow similar previous studies [10]. Figure 3 shows the pseudo code of the proposed algorithm. For CartPole-V0 [9] by OpenAI Gym [8], the discount-factor, GAMMA = 0.95, the learning rate, ALPHA = 0.001, the size of replay buffer, D = 10,000, the size of mini-bach = 64, the maximum of exploration = 1.0, the minimum of exploration = 0.01, and the random epsilon decay = 0.995 are the same as with the previous DQN studies [10].
In Figure 3, we consider that Q_target for TD-learning is trained every episode, but G_target for MC-learning is trained in the end of whole episodes. The mixture update are accomplished based on the probabilities β1 and β2. Followed by the observables of the environment, β1 and β2 are able to be marginally interacted by system designers. For G_target, we should check the length of the whole episodes. Moreover, we suggest that when the roll-out is toward to the maximum length, training of the AI agent with MC-based will be finished earlier than the maximum length because the value function can be approximately close to the best prediction. Therefore, we do not need the whole episodes any more.
(그림 4) 결과 중 가장 최고의 경우
(Figure 4) The best case
(그림 5) 결과 중 가장 최악의 경우
(Figure 5) The worst case
Figure 4 and Figure 5 display most of the results for best and worst cases, respectively. The proposed algorithm, off-policy RL, DQN with MD and TD balance at random can yield better results than the class DQN using only TD-learning in most cases. In the best case, our proposed algorithm reaches the maximum reward earlier than the classic DQN. However, in the worst case, our proposed algorithm is almost same with the class DQN. Therefore, we can attempt a different type of deep neural network (DNN) or hyper-parameter settings to demonstrate that our proposed model can enhance the exploration with only MC and TD balance at random. We are convinced that we can enhance the behavior policy. For the purpose, we have more runs. Table 1 shows that the quantitative comparison between our proposed algorithm and the class DQN using only TD-learning. Actually, most cases are similar. However, in therms of “in score 150”, our proposed algorithm is better than the classic DQN using only TD-learning. After this we consider that the sample-efficient policy gradient such as DDPG would be possible to have better results. Our next step is about to combine MC and TD not only off-policy RL, DQN but also policy gradient methods, such as DDPG [6, 13, 14]. Moreover, we will attempt to exploit the different type of deep neural networks such as CNN. [14]
(표 1) 정량적 비교 결과
(Table 1) Quantitative Comparison
5. Conclusion
In this paper, we have suggested the technique to exploit a random balance with a mixture of TD and MC in off-policy RL, representative DQN. We demonstrate DQN with TD and MC balance at random, which is trained with a random probability β1 for performing a gradient descent of TD and β2 for performing a gradient descent of MC. We attempt to exploit a random balance with a mixture of TD and MC in RL without any complicated formulas for better exploration and easier deployment. We also demonstrate that a well-performed TD learning are also granted special favor of the mixture of TD and MC through an experiments in OpenAI Gym. Our proposed method goes through experimental comparison with the classic DQN using only TD-learning. The result shows that our proposed algorithm has shorter training time than the classic DQN using only TD-learning. We will attempt to exploit a random balance with a mixture of TD and MC in policy gradient of RL and the different type of deep neural network.
☆ A preliminary version of this paper was presented at ICONI 2019 and was selected as an outstanding paper.
참고문헌
- D. Silver, A. Huang, C. J. Maddison, A.Guez, L.t Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, Vol 529, No. 7587, pp. 484-489, 2016. https://doi.org/10.1038/nature16961
- R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. https://doi.org/10.1016/S1364-6613(99)01331-5
- Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." NIPS 2013. http://www.cs.toronto.edu/-vmnih/docs/dqn.pdf
- A. Amiranashvili, A. Dosovitskiy, V. Koltun and T. Brox, TD OR NOT TD: Analyzing The Role Of Temporal Differencing In Deep Reinforcement Learning, ICLR 2018. http://arxiv.org/abs/1806.01175
- S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, S. Levine, Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic, ICLR 2017. http://arxiv.org/abs/1611.02247
- T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, ICLR 2016. https://arxiv.org/abs/1509.02971
- V. Nair and G. E. Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML 2010. https://www.cs.toronto.edu/-hinton/absps/reluICML.pdf
- OpenAI Gym: https://gym.openai.com
- Cart-Pole-V0: https://github.com/openai/gym/wiki/Cart-Pole-v0
- Cart-Pole-DQN: https://github.com/rlcode/reinforcement-learning-kr/blob/master/2-cartpole/1-dqn/cartpole_dqn.py, 8 Jul. 2017.
- Tensorflow: https://github.com/tensorflow/tensorflow, 31 Oct. 2019.
- Keras : https://keras.io/api/ Oct. 2019.
- G. Sun, G. O. Boateng, H. Huang and W. Jiang, "A Reinforcement Learning Framework for Autonomous Cell Activation and Customized Energy-Efficient Resource Allocation in C-RANs," KSII Transactions on Internet and Information Systems, vol. 13, no. 8, pp. 3821-3841, 2019. https://doi.org/10.3837/tiis.2019.08.001
- R. Mu and X. Zeng, "A Review of Deep Learning Research," KSII Transactions on Internet and Information Systems, vol. 13, no. 4, pp. 1738-1764, 2019. https://doi.org/10.3837/tiis.2019.04.001