Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return

Hwang, Gyu-Young;Kim, Ju-Bong;Heo, Joo-Seong;Han, Youn-Hee;

doi:10.3745/KTCCS.2021.10.5.155

KIPS Transactions on Computer and Communication Systems (정보처리학회논문지:컴퓨터 및 통신 시스템)

Volume 10 Issue 5
/
Pages.155-162
/
2021
/
2287-5891(pISSN)
/
2734-049X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return

멀티-스텝 누적 보상을 활용한 Max-Mean N-Step 시간차 학습

황규영 (한국기술교육대학교 컴퓨터공학과 미래융합공학전공) ;
김주봉 (한국기술교육대학교 컴퓨터공학과 미래융합공학전공) ;
허주성 (한국기술교육대학교 컴퓨터공학과 미래융합공학전공) ;
한연희 (한국기술교육대학교 컴퓨터공학과)

Received : 2020.12.10
Accepted : 2021.02.20
Published : 2021.05.31

https://doi.org/10.3745/KTCCS.2021.10.5.155 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

n-step TD learning is a combination of Monte Carlo method and one-step TD learning. If appropriate n is selected, n-step TD learning is known as an algorithm that performs better than Monte Carlo method and 1-step TD learning, but it is difficult to select the best values of n. In order to solve the difficulty of selecting the values of n in n-step TD learning, in this paper, using the characteristic that overestimation of Q can improve the performance of initial learning and that all n-step returns have similar values for Q ≈ Q^*, we propose a new learning target, which is composed of the maximum and the mean of all k-step returns for 1 ≤ k ≤ n. Finally, in OpenAI Gym's Atari game environment, we compare the proposed algorithm with n-step TD learning and proved that the proposed algorithm is superior to n-step TD learning algorithm.

n-스텝 시간차 학습은 몬테카를로 방법과 1-스텝 시간차 학습을 결합한 것으로, 적절한 n을 선택할 경우 몬테카를로 방법과 1-스텝 시간차 학습보다 성능이 좋은 알고리즘으로 알려져 있지만 최적의 n을 선택하는 것에 어려움이 있다. n-스텝 시간차 학습에서 n값 선택의 어려움을 해소하기 위해, 본 논문에서는 Q의 과대평가가 초기 학습의 성능을 높일 수 있다는 특징과 Q ≈ Q^* 경우, 모든 n-스텝 누적 보상이 비슷한 값을 가진다는 성질을 이용하여 1 ≤ k ≤ n에 대한 모든 k-스텝 누적 보상의 최댓값과 평균으로 구성된 새로운 학습 타겟인 Ω-return을 제안한다. 마지막으로 OpenAI Gym의 Atari 게임 환경에서 n-스텝 시간차 학습과의 성능 비교 평가를 진행하여 본 논문에서 제안하는 알고리즘이 n-스텝 시간차 학습 알고리즘보다 성능이 우수하다는 것을 입증한다.

Keywords

Acknowledgement

이 논문은 정부(교육부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업임(No. 2018R1A6A1A03025526 및 No. NRF-2020R1I1A3065610).

References

R. S. Sutton, "Learning to Predict by the Methods of Temporal Differences," Machine Learning, Vol.3, No.1, pp.9-44, 1988. https://doi.org/10.1007/BF00115009
H. Seijen and R. Sutton, "True Online TD(λ)," In Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of Machine Learning Research, pp.692-700. PMLR, Bejing, China, 2014.
K. D. Asis, J. Hernandez-Garcia, G. Holland, et al., "Multi-Step Reinforcement Learning: A Unifying Algorithm," In Association for the Advancement of Artificial Intelligence, 2018.
S. L. Chen, H. Z. Wu, X. L. Han, and L. Xiao, "Multi-Step Truncated Q Learning Algorithm," In 2005 International Conference on Machine Learning and Cybernetics, Vol.1, pp.194-198. 2005.
K. De Asis and R. Sutton, "Per-decision Multi-step Temporal Difference Learning with Control Variates," arXiv:1807.01830, 2018.
A. R. Mahmood, H. Yu, and R. Sutton, "Multi-step Off-policy Learning without Importance Sampling Ratios," arXiv: 1702.03006, 2017.
L. Yang, M. Shi, Q. Zheng, W. Meng, and G. Pan, "A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning," In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp.2984-2990, 2018.
C. J. C. H. Watkins and P. Dayan, "Q-learning," Machine Learning, Vol.8, No.3, pp.279-292, 1992. https://doi.org/10.1007/BF00992698
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with Deep Reinforcement Learning," Cite arXiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
V. Mnih, et al., "Human-level Control through Deep Reinforcement Learning," Nature, Vol.518, pp.529-533, 2015. https://doi.org/10.1038/nature14236
S. Thrun A. and Schwartz, "Issues in using function approximation for reinforcement learning," In Proceedings of the Fourth Connectionist Models Summer School, Erlbaum, 1993.
H. van Hasselt, "Double Q-learning," In Advances in Neural Information Processing Systems, Vol.23, pp.2613-2621, 2010.
H. van Hasselt, A. Guez, and D. Silver, "Deep Reinforcement Learning with Double Q-learning," In Association for the Advancement of Artificial Intelligence, 2016.
R. S. Sutton and A. G. Barto, "Reinforcement Learning: An Introduction," The MIT Press, Second Edn., 2018.
J. Peng and R. J. Williams, "Incremental Multi-Step Q-Learning," Machine Learning, Vol.22, No.1, pp.283-290, 1996. https://doi.org/10.1007/BF00114731
M. Hessel, et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning," In Association for the Advancement of Artificial Intelligence, pp.3215-3222, AAAI Press, 2018.
C. J. C. H. Watkins, "Learning from delayed rewards," (Doctoral dissertation, Cambridge University).
Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, "Dueling Network Architectures for Deep Reinforcement Learning," In Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:1995-2003, 2016.
T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized Experience Replay," 2015. Cite arXiv:1511.05952Comment: Published at ICLR 2016.
S. J. Bradtke and M. O. Duff, "Reinforcement learning methods for continuous-time Markov decision problems," In Proceedings of the 7th International Conference on Neural Information Processing Systems, MIT Press, Cambridge, MA, USA, pp.393-400, 1994.
J. Hernandez-Garcia and R. Sutton, "Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target," 2019. Cite arXiv:1901.07510Comment: NIPS Deep Learning Workshop 2018.
M. J. Kearns and S. P. Singh, "Bias-variance error bounds for temporal difference updates," In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp.142-147. San Francisco, CA, USA, 2000.
D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Hasselt, and D. Silver, "Distributed prioritized experience replay," 2018. Cite arXiv:1803.00933Comment: Published at ICLR 2018.
Q. Lan, Y. Pan, A. Fyshe, M. White, "Maxmin Q-learning: Controlling the Estimation Bias of Q-learning," 2020. Cite arXiv:2002.06487Comment: Published at ICLR 2020.
OpenAI. OpenAI Gym Docs [Internet], https://gym.openai.com/docs/. Accessed: 2020-11-20.

KIPS Transactions on Computer and Communication Systems (정보처리학회논문지:컴퓨터 및 통신 시스템)

Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return

멀티-스텝 누적 보상을 활용한 Max-Mean N-Step 시간차 학습

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)