Browse > Article

Implementation of the Agent using Universal On-line Q-learning by Balancing Exploration and Exploitation in Reinforcement Learning  

박찬건 (연세대학교 컴퓨터학과)
양성봉 (연세대학교 컴퓨터산업공학부)
Abstract
A shopbot is a software agent whose goal is to maximize buyer´s satisfaction through automatically gathering the price and quality information of goods as well as the services from on-line sellers. In the response to shopbots´ activities, sellers on the Internet need the agents called pricebots that can help them maximize their own profits. In this paper we adopts Q-learning, one of the model-free reinforcement learning methods as a price-setting algorithm of pricebots. A Q-learned agent increases profitability and eliminates the cyclic price wars when compared with the agents using the myoptimal (myopically optimal) pricing strategy Q-teaming needs to select a sequence of state-action fairs for the convergence of Q-teaming. When the uniform random method in selecting state-action pairs is used, the number of accesses to the Q-tables to obtain the optimal Q-values is quite large. Therefore, it is not appropriate for universal on-line learning in a real world environment. This phenomenon occurs because the uniform random selection reflects the uncertainty of exploitation for the optimal policy. In this paper, we propose a Mixed Nonstationary Policy (MNP), which consists of both the auxiliary Markov process and the original Markov process. MNP tries to keep balance of exploration and exploitation in reinforcement learning. Our experiment results show that the Q-learning agent using MNP converges to the optimal Q-values about 2.6 time faster than the uniform random selection on the average.
Keywords
Reinforcement learning; Q-learning; Adaptive multi-agent systems; Agent economies; Shopbot; Pricebot;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. Sridharan and G. Tesauro, 'Multi-agent Q-learning and regression trees for automated pricing decisions,' Proc. 17th Int'l Conf. Machine Learning, Stanford, CA, 2000
2 S. Haykin, Neural Network, 2ndEd, Prentice-Hall, New Jersey, 1999, p. 625
3 R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press/Bradford Books, 1998, pp. 4-5, 26-27
4 G. Cybenko, R. Gray and K. Moizumi, 'Q-Learning: A Tutorial and Extensions,' Proc. Conf. Mathematics of Artificial Neural Networks, Oxford University, England, July, 1995
5 J. Hu and M. P. Wellman, 'Multiagent reinforcement learning: theoretical framework and an algorithm,' Proc. Int'l Conf. Machine Learning, 1998
6 G. Tesauro, 'Pricing in agent economies using neural networks and multi-agent Q-learning,' Proc. IJCAI-99 Workshop, Learning About, From and With Other Agents, Stockholm, Sweden, Aug. 1999
7 A. Greenwald, J. Kephart, and G. Tesauro, 'Strategic Pricebot Dynamics,' Proc. 1st ACM Conf. Electronic Commerce, Oct. 1999   DOI
8 A. Greenwald and J. O. Kephart, 'Shopbots and Pricebots,' Proc. Int'l J Conf. Artifical Intelligence, Stockholm, Sweden, 1999
9 C. J. C. H. Watkins, 'Learning from delayed rewards,' Ph. D. thesis, Cambridge University, 1989
10 G. Tesauro and J. O. Kephart, 'Pricing in agent economies using multi-agent Q-learning,' Proc. Workshop, Game Theoretic and Decision Theoretic Agents, London, England, July, 1999
11 G. J. Tesauro and J. O. Kephart, 'Foresight-based pricing algorithms in an economy of software agents,' Proc. ICE-93, 1998, pp. 37-44   DOI
12 T. M. Mitchell, Machine Learning, McGraw-Hill, 1997, pp. 378-379, p. 382