基于优先级扫描Dyna结构的贝叶斯Q学习方法

通信学报

基于优先级扫描Dyna结构的贝叶斯Q学习方法

于俊1，刘全1,2，傅启明1，孙洪坤1，陈桂兴1

1. 苏州大学计算机科学与技术学院，江苏苏州 215006；2. 吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012

出版日期:2013-11-25 发布日期:2013-11-15
基金资助:
国家自然科学基金资助项目(61070223, 61103045, 61070122, 61272005)；江苏省自然科学基金资助项目(BK2012616)；江苏省高校自然科学研究基金资助项目(09KJA520002, 09KJB520012)；吉林大学符号计算与知识工程教育部重点实验室基金资助项目(93K172012K04)

Bayesian Q learning method with Dyna architecture and prioritized sweeping

Online:2013-11-25 Published:2013-11-15

摘要/Abstract

摘要： 贝叶斯Q学习方法使用概率分布来描述Q值的不确定性，并结合Q值分布来选择动作，以达到探索与利用的平衡。然而贝叶斯Q学习存在着收敛速度慢且收敛精度低的问题。针对上述问题，提出一种基于优先级扫描Dyna结构的贝叶斯Q学习方法—Dyna-PS-BayesQL。该方法主要分为2部分：在学习部分，对环境的状态迁移函数及奖赏函数建模，并使用贝叶斯Q学习更新动作值函数的参数；在规划部分，基于建立的模型，使用优先级扫描方法和动态规划方法对动作值函数进行规划更新，以提高对历史经验信息的利用，从而提升方法收敛速度及收敛精度。将Dyna-PS-BayesQL应用于链问题和迷宫导航问题，实验结果表明，该方法能较好地平衡探索与利用，且具有较优的收敛速度及收敛精度。

Abstract: In order to balance this trade-off, a probability distribution was used in Bayesian Q learning method to describe the uncertainty of the Q value and choose actions with this distribution. But the slow convergence is a big problem for Bayesian Q-Learning. In allusion to the above problems, a novel Bayesian Q learning algorithm with Dyna architecture and prioritized sweeping, called Dyna-PS-BayesQL was proposed. The algorithm mainly includes two parts: in the learning part, it models the transition function and reward function according to collected samples, and update Q value function by Bayesian Q-learning, in the programming part, it updates the Q value function by using prioritized sweeping and dynamic programming methods based on the constructed model, which can improve the efficiency of using the historical information. Applying the Dyna-PS-BayesQL to the chain problem and maze navigation problem, the results show that the proposed algorithm can get a good performance of balancing the exploration and exploitation in the learning process, and get a better convergence performance.

于俊1，刘全1,2，傅启明1，孙洪坤1，陈桂兴1. 基于优先级扫描Dyna结构的贝叶斯Q学习方法[J]. 通信学报.