[KSCI] Korea Science Citation Index Service

Implementation of the Agent using Universal On-line Q-learning by Balancing Exploration and Exploitation in Reinforcement Learning

박찬건 (연세대학교 컴퓨터학과)
양성봉 (연세대학교 컴퓨터산업공학부)

Publication Information

Journal of KIISE:Software and Applications / v.30, no.7_8, 2003 , pp. 672-680 More about this Journal

Abstract

A shopbot is a software agent whose goal is to maximize buyer´s satisfaction through automatically gathering the price and quality information of goods as well as the services from on-line sellers. In the response to shopbots´ activities, sellers on the Internet need the agents called pricebots that can help them maximize their own profits. In this paper we adopts Q-learning, one of the model-free reinforcement learning methods as a price-setting algorithm of pricebots. A Q-learned agent increases profitability and eliminates the cyclic price wars when compared with the agents using the myoptimal (myopically optimal) pricing strategy Q-teaming needs to select a sequence of state-action fairs for the convergence of Q-teaming. When the uniform random method in selecting state-action pairs is used, the number of accesses to the Q-tables to obtain the optimal Q-values is quite large. Therefore, it is not appropriate for universal on-line learning in a real world environment. This phenomenon occurs because the uniform random selection reflects the uncertainty of exploitation for the optimal policy. In this paper, we propose a Mixed Nonstationary Policy (MNP), which consists of both the auxiliary Markov process and the original Markov process. MNP tries to keep balance of exploration and exploitation in reinforcement learning. Our experiment results show that the Q-learning agent using MNP converges to the optimal Q-values about 2.6 time faster than the uniform random selection on the average.

shopbot이란 온라인상의 판매자로부터 상품에 대한 가격과 품질에 관한 정보를 자동적으로 수집함으로써 소비자의 만족을 최대화하는 소프트웨어 에이전트이다 이러한 shopbot에 대응해서 인터넷상의 판매자들은 그들에게 최대의 이익을 가져다 줄 수 있는 에이전트인 pricebot을 필요로 할 것이다. 본 논문에서는 pricebot의 가격결정 알고리즘으로 비 모델 강화 학습(model-free reinforcement learning) 방법중의 하나인 Q-학습(Q-learning)을 사용한다. Q-학습된 에이전트는 근시안적인 최적(myopically optimal 또는 myoptimal) 가격 결정 전략을 사용하는 에이전트에 비해 이익을 증가시키고 주기적 가격 전쟁(cyclic price war)을 감소시킬 수 있다. Q-학습 과정 중 Q-학습의 수렴을 위해 일련의 상태-행동(state-action)을 선택하는 것이 필요하다. 이러한 선택을 위해 균일 임의 선택방법 (Uniform Random Selection, URS)이 사용될 경우 최적 값의 수렴을 위해서 Q-테이블을 접근하는 회수가 크게 증가한다. 따라서 URS는 실 세계 환경에서의 범용적인 온라인 학습에는 부적절하다. 이와 같은 현상은 URS가 최적의 정책에 대한 이용(exploitation)의 불확실성을 반영하기 때문에 발생하게 된다. 이에 본 논문에서는 보조 마르코프 프로세스(auxiliary Markov process)와 원형 마르코프 프로세스(original Markov process)로 구성되는 혼합 비정적 정책 (Mixed Nonstationary Policy, MNP)을 제안한다. MNP가 적용된 Q-학습 에이전트는 original controlled process의 실행 시에 Q-학습에 의해 결정되는 stationary greedy 정책을 사용하여 학습함으로써 auxiliary Markov process와 original controlled process에 의해 평가 측정된 최적 정책에 대해 1의 확률로 exploitation이 이루어질 수 있도록 하여, URS에서 발생하는 최적 정책을 위한 exploitation의 불확실성의 문제를 해결하게 된다. 다양한 실험 결과 본 논문에서 제한한 방식이 URS 보다 평균적으로 약 2.6배 빠르게 최적 Q-값에 수렴하여 MNP가 적용된 Q-학습 에이전트가 범용적인 온라인 Q-학습이 가능함을 보였다.

Keywords

Reinforcement learning; Q-learning; Adaptive multi-agent systems; Agent economies; Shopbot; Pricebot;

Citations & Related Records

Reference

1	M. Sridharan and G. Tesauro, 'Multi-agent Q-learning and regression trees for automated pricing decisions,' Proc. 17th Int'l Conf. Machine Learning, Stanford, CA, 2000
2	S. Haykin, Neural Network, 2ndEd, Prentice-Hall, New Jersey, 1999, p. 625
3	R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press/Bradford Books, 1998, pp. 4-5, 26-27
4	G. Cybenko, R. Gray and K. Moizumi, 'Q-Learning: A Tutorial and Extensions,' Proc. Conf. Mathematics of Artificial Neural Networks, Oxford University, England, July, 1995
5	J. Hu and M. P. Wellman, 'Multiagent reinforcement learning: theoretical framework and an algorithm,' Proc. Int'l Conf. Machine Learning, 1998
6	G. Tesauro, 'Pricing in agent economies using neural networks and multi-agent Q-learning,' Proc. IJCAI-99 Workshop, Learning About, From and With Other Agents, Stockholm, Sweden, Aug. 1999
7	A. Greenwald, J. Kephart, and G. Tesauro, 'Strategic Pricebot Dynamics,' Proc. 1st ACM Conf. Electronic Commerce, Oct. 1999 DOI
8	A. Greenwald and J. O. Kephart, 'Shopbots and Pricebots,' Proc. Int'l J Conf. Artifical Intelligence, Stockholm, Sweden, 1999
9	C. J. C. H. Watkins, 'Learning from delayed rewards,' Ph. D. thesis, Cambridge University, 1989
10	G. Tesauro and J. O. Kephart, 'Pricing in agent economies using multi-agent Q-learning,' Proc. Workshop, Game Theoretic and Decision Theoretic Agents, London, England, July, 1999
11	G. J. Tesauro and J. O. Kephart, 'Foresight-based pricing algorithms in an economy of software agents,' Proc. ICE-93, 1998, pp. 37-44 DOI
12	T. M. Mitchell, Machine Learning, McGraw-Hill, 1997, pp. 378-379, p. 382

1	The Improvement of Convergence Rate in n-Queen Problem Using Reinforcement learning / [Lim SooYeon;Son KiJun;Park SeongBae;Lee SangJo;] / Journal of the Korean Institute of Intelligent Systems
2	Topic directed Web Spidering using Reinforcement Learning / [Lim, Soo-Yeon;] / Journal of the Korean Institute of Intelligent Systems

KSCI

Implementation of the Agent using Universal On-line Q-learning by Balancing Exploration and Exploitation in Reinforcement Learning 강화 학습에서의 탐색과 이용의 균형을 통한 범용적 온라인 Q-학습이 적용된 에이전트의 구현

Implementation of the Agent using Universal On-line Q-learning by Balancing Exploration and Exploitation in Reinforcement Learning