[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTCCS.2021.10.5.155

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return

Hwang, Gyu-Young (한국기술교육대학교 컴퓨터공학과 미래융합공학전공)
Kim, Ju-Bong (한국기술교육대학교 컴퓨터공학과 미래융합공학전공)
Heo, Joo-Seong (한국기술교육대학교 컴퓨터공학과 미래융합공학전공)
Han, Youn-Hee (한국기술교육대학교 컴퓨터공학과)

Publication Information

KIPS Transactions on Computer and Communication Systems / v.10, no.5, 2021 , pp. 155-162 More about this Journal

Abstract

n-step TD learning is a combination of Monte Carlo method and one-step TD learning. If appropriate n is selected, n-step TD learning is known as an algorithm that performs better than Monte Carlo method and 1-step TD learning, but it is difficult to select the best values of n. In order to solve the difficulty of selecting the values of n in n-step TD learning, in this paper, using the characteristic that overestimation of Q can improve the performance of initial learning and that all n-step returns have similar values for Q ≈ Q^*, we propose a new learning target, which is composed of the maximum and the mean of all k-step returns for 1 ≤ k ≤ n. Finally, in OpenAI Gym's Atari game environment, we compare the proposed algorithm with n-step TD learning and proved that the proposed algorithm is superior to n-step TD learning algorithm.

Keywords

Reinforcement Learning; Q-learning; DQN; n-step Temporal-Difference Learning;

Citations & Related Records

Reference

1	S. L. Chen, H. Z. Wu, X. L. Han, and L. Xiao, "Multi-Step Truncated Q Learning Algorithm," In 2005 International Conference on Machine Learning and Cybernetics, Vol.1, pp.194-198. 2005.
2	J. Hernandez-Garcia and R. Sutton, "Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target," 2019. Cite arXiv:1901.07510Comment: NIPS Deep Learning Workshop 2018.
3	R. S. Sutton and A. G. Barto, "Reinforcement Learning: An Introduction," The MIT Press, Second Edn., 2018.
4	Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, "Dueling Network Architectures for Deep Reinforcement Learning," In Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:1995-2003, 2016.
5	OpenAI. OpenAI Gym Docs [Internet], https://gym.openai.com/docs/. Accessed: 2020-11-20.
6	K. De Asis and R. Sutton, "Per-decision Multi-step Temporal Difference Learning with Control Variates," arXiv:1807.01830, 2018.
7	R. S. Sutton, "Learning to Predict by the Methods of Temporal Differences," Machine Learning, Vol.3, No.1, pp.9-44, 1988. DOI
8	H. Seijen and R. Sutton, "True Online TD(λ)," In Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of Machine Learning Research, pp.692-700. PMLR, Bejing, China, 2014.
9	K. D. Asis, J. Hernandez-Garcia, G. Holland, et al., "Multi-Step Reinforcement Learning: A Unifying Algorithm," In Association for the Advancement of Artificial Intelligence, 2018.
10	A. R. Mahmood, H. Yu, and R. Sutton, "Multi-step Off-policy Learning without Importance Sampling Ratios," arXiv: 1702.03006, 2017.
11	L. Yang, M. Shi, Q. Zheng, W. Meng, and G. Pan, "A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning," In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp.2984-2990, 2018.
12	C. J. C. H. Watkins and P. Dayan, "Q-learning," Machine Learning, Vol.8, No.3, pp.279-292, 1992. DOI
13	V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with Deep Reinforcement Learning," Cite arXiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
14	V. Mnih, et al., "Human-level Control through Deep Reinforcement Learning," Nature, Vol.518, pp.529-533, 2015. DOI
15	S. Thrun A. and Schwartz, "Issues in using function approximation for reinforcement learning," In Proceedings of the Fourth Connectionist Models Summer School, Erlbaum, 1993.
16	H. van Hasselt, "Double Q-learning," In Advances in Neural Information Processing Systems, Vol.23, pp.2613-2621, 2010.
17	H. van Hasselt, A. Guez, and D. Silver, "Deep Reinforcement Learning with Double Q-learning," In Association for the Advancement of Artificial Intelligence, 2016.
18	M. Hessel, et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning," In Association for the Advancement of Artificial Intelligence, pp.3215-3222, AAAI Press, 2018.
19	C. J. C. H. Watkins, "Learning from delayed rewards," (Doctoral dissertation, Cambridge University).
20	T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized Experience Replay," 2015. Cite arXiv:1511.05952Comment: Published at ICLR 2016.
21	J. Peng and R. J. Williams, "Incremental Multi-Step Q-Learning," Machine Learning, Vol.22, No.1, pp.283-290, 1996. DOI
22	S. J. Bradtke and M. O. Duff, "Reinforcement learning methods for continuous-time Markov decision problems," In Proceedings of the 7th International Conference on Neural Information Processing Systems, MIT Press, Cambridge, MA, USA, pp.393-400, 1994.
23	M. J. Kearns and S. P. Singh, "Bias-variance error bounds for temporal difference updates," In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp.142-147. San Francisco, CA, USA, 2000.
24	D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Hasselt, and D. Silver, "Distributed prioritized experience replay," 2018. Cite arXiv:1803.00933Comment: Published at ICLR 2018.
25	Q. Lan, Y. Pan, A. Fyshe, M. White, "Maxmin Q-learning: Controlling the Estimation Bias of Q-learning," 2020. Cite arXiv:2002.06487Comment: Published at ICLR 2020.

KSCI

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return 멀티-스텝 누적 보상을 활용한 Max-Mean N-Step 시간차 학습

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return