Journal on Communications ›› 2013, Vol. 34 ›› Issue (11): 129-139.doi: 10.3969/j.issn.1000-436x.2013.11.015

• academic paper • Previous Articles     Next Articles

Bayesian Q learning method with Dyna architecture and prioritized sweeping

Jun YU1,Quan LIU1,2,Qi-ming FU1,Hong-kun SUN1,Gui-xing CHEN1   

  1. 1 School of Computer Science and Technology, Soochow University, Suzhou 215006, China
    2 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
  • Online:2013-11-25 Published:2017-06-23
  • Supported by:
    The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The Natural Science Foundation of Jiangsu Province;The High School Natural Foundation of Jiangsu Province;The High School Natural Foundation of Jiangsu Province;The Foundation of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University

Abstract:

In order to balance this trade-off, a probability distribution was used in Bayesian Q learning method to de-scribe the uncertainty of the Q value and choose actions with this distribution. But the slow convergence is a big problem for Bayesian Q-Learning. In allusion to the above problems, a novel B ian Q learning algorithm with Dyna architec-ture and prioritized sweeping, called Dyna-PS-BayesQL was proposed. The algorithm mainly includes two parts: in the learning part, it models the transition function and reward function according to collected samples, and update Q value function by Bayesian Q-learning, in the programming part, it updates the Q value function by using prioritized sweeping and dynamic programming methods based on the constructed model, which can improve the efficiency of using the his-torical information. Applying the Dyna-PS-BayesQL to the chain problem and maze navigation problem, the results show that the proposed algorithm can get a good performance of balancing the exploration and exploitation in the learning process, and get a better convergence performance.

Key words: reinforcement learning, Markov decision process, prioritized sweeping, Dyna architecture, Bayesian Q learning

No Suggested Reading articles found!