Max-Mean N-step Temporal-Difference Learning Using Multi-Step Return |
Hwang, Gyu-Young
(한국기술교육대학교 컴퓨터공학과 미래융합공학전공)
Kim, Ju-Bong (한국기술교육대학교 컴퓨터공학과 미래융합공학전공) Heo, Joo-Seong (한국기술교육대학교 컴퓨터공학과 미래융합공학전공) Han, Youn-Hee (한국기술교육대학교 컴퓨터공학과) |
1 | S. L. Chen, H. Z. Wu, X. L. Han, and L. Xiao, "Multi-Step Truncated Q Learning Algorithm," In 2005 International Conference on Machine Learning and Cybernetics, Vol.1, pp.194-198. 2005. |
2 | J. Hernandez-Garcia and R. Sutton, "Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target," 2019. Cite arXiv:1901.07510Comment: NIPS Deep Learning Workshop 2018. |
3 | R. S. Sutton and A. G. Barto, "Reinforcement Learning: An Introduction," The MIT Press, Second Edn., 2018. |
4 | Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, "Dueling Network Architectures for Deep Reinforcement Learning," In Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:1995-2003, 2016. |
5 | OpenAI. OpenAI Gym Docs [Internet], https://gym.openai.com/docs/. Accessed: 2020-11-20. |
6 | K. De Asis and R. Sutton, "Per-decision Multi-step Temporal Difference Learning with Control Variates," arXiv:1807.01830, 2018. |
7 | R. S. Sutton, "Learning to Predict by the Methods of Temporal Differences," Machine Learning, Vol.3, No.1, pp.9-44, 1988. DOI |
8 | H. Seijen and R. Sutton, "True Online TD(λ)," In Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of Machine Learning Research, pp.692-700. PMLR, Bejing, China, 2014. |
9 | K. D. Asis, J. Hernandez-Garcia, G. Holland, et al., "Multi-Step Reinforcement Learning: A Unifying Algorithm," In Association for the Advancement of Artificial Intelligence, 2018. |
10 | A. R. Mahmood, H. Yu, and R. Sutton, "Multi-step Off-policy Learning without Importance Sampling Ratios," arXiv: 1702.03006, 2017. |
11 | L. Yang, M. Shi, Q. Zheng, W. Meng, and G. Pan, "A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning," In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp.2984-2990, 2018. |
12 | C. J. C. H. Watkins and P. Dayan, "Q-learning," Machine Learning, Vol.8, No.3, pp.279-292, 1992. DOI |
13 | V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with Deep Reinforcement Learning," Cite arXiv:1312.5602Comment: NIPS Deep Learning Workshop 2013. |
14 | V. Mnih, et al., "Human-level Control through Deep Reinforcement Learning," Nature, Vol.518, pp.529-533, 2015. DOI |
15 | S. Thrun A. and Schwartz, "Issues in using function approximation for reinforcement learning," In Proceedings of the Fourth Connectionist Models Summer School, Erlbaum, 1993. |
16 | H. van Hasselt, "Double Q-learning," In Advances in Neural Information Processing Systems, Vol.23, pp.2613-2621, 2010. |
17 | H. van Hasselt, A. Guez, and D. Silver, "Deep Reinforcement Learning with Double Q-learning," In Association for the Advancement of Artificial Intelligence, 2016. |
18 | M. Hessel, et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning," In Association for the Advancement of Artificial Intelligence, pp.3215-3222, AAAI Press, 2018. |
19 | C. J. C. H. Watkins, "Learning from delayed rewards," (Doctoral dissertation, Cambridge University). |
20 | T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized Experience Replay," 2015. Cite arXiv:1511.05952Comment: Published at ICLR 2016. |
21 | J. Peng and R. J. Williams, "Incremental Multi-Step Q-Learning," Machine Learning, Vol.22, No.1, pp.283-290, 1996. DOI |
22 | S. J. Bradtke and M. O. Duff, "Reinforcement learning methods for continuous-time Markov decision problems," In Proceedings of the 7th International Conference on Neural Information Processing Systems, MIT Press, Cambridge, MA, USA, pp.393-400, 1994. |
23 | M. J. Kearns and S. P. Singh, "Bias-variance error bounds for temporal difference updates," In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp.142-147. San Francisco, CA, USA, 2000. |
24 | D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Hasselt, and D. Silver, "Distributed prioritized experience replay," 2018. Cite arXiv:1803.00933Comment: Published at ICLR 2018. |
25 | Q. Lan, Y. Pan, A. Fyshe, M. White, "Maxmin Q-learning: Controlling the Estimation Bias of Q-learning," 2020. Cite arXiv:2002.06487Comment: Published at ICLR 2020. |