基于优先级扫描Dyna结构的贝叶斯Q学习方法

doi:10.3969/j.issn.1000-436x.2013.11.015

通信学报 ›› 2013, Vol. 34 ›› Issue (11): 129-139.doi: 10.3969/j.issn.1000-436x.2013.11.015

基于优先级扫描Dyna结构的贝叶斯Q学习方法

于俊¹,刘全^1,²,傅启明¹,孙洪坤¹,陈桂兴¹

¹ 苏州大学计算机科学与技术学院，江苏苏州 215006
² 吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012

出版日期:2013-11-25 发布日期:2017-06-23
基金资助:
国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;江苏省自然科学基金资助项目;江苏省高校自然科学研究基金资助项目;江苏省高校自然科学研究基金资助项目;吉林大学符号计算与知识工程教育部重点实验室基金资助项目

Bayesian Q learning method with Dyna architecture and prioritized sweeping

Jun YU¹,Quan LIU^1,²,Qi-ming FU¹,Hong-kun SUN¹,Gui-xing CHEN¹

¹ School of Computer Science and Technology, Soochow University, Suzhou 215006, China
² Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

Online:2013-11-25 Published:2017-06-23
Supported by:
The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The Natural Science Foundation of Jiangsu Province;The High School Natural Foundation of Jiangsu Province;The High School Natural Foundation of Jiangsu Province;The Foundation of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University

摘要/Abstract

摘要：

贝叶斯Q学习方法使用概率分布来描述Q值的不确定性，并结合Q值分布来选择动作，以达到探索与利用的平衡。然而贝叶斯Q学习存在着收敛速度慢且收敛精度低的问题。针对上述问题，提出一种基于优先级扫描Dyna结构的贝叶斯Q学习方法—Dyna-PS-BayesQL。该方法主要分为2部分：在学习部分，对环境的状态迁移函数及奖赏函数建模，并使用贝叶斯Q学习更新动作值函数的参数；在规划部分，基于建立的模型，使用优先级扫描方法和动态规划方法对动作值函数进行规划更新，以提高对历史经验信息的利用，从而提升方法收敛速度及收敛精度。将Dyna-PS-BayesQL应用于链问题和迷宫导航问题，实验结果表明，该方法能较好地平衡探索与利用，且具有较优的收敛速度及收敛精度。

关键词: 强化学习, 马尔科夫决策过程, 优先级扫描, Dyna结构, 贝叶斯Q学习

Abstract:

In order to balance this trade-off, a probability distribution was used in Bayesian Q learning method to de-scribe the uncertainty of the Q value and choose actions with this distribution. But the slow convergence is a big problem for Bayesian Q-Learning. In allusion to the above problems, a novel B ian Q learning algorithm with Dyna architec-ture and prioritized sweeping, called Dyna-PS-BayesQL was proposed. The algorithm mainly includes two parts: in the learning part, it models the transition function and reward function according to collected samples, and update Q value function by Bayesian Q-learning, in the programming part, it updates the Q value function by using prioritized sweeping and dynamic programming methods based on the constructed model, which can improve the efficiency of using the his-torical information. Applying the Dyna-PS-BayesQL to the chain problem and maze navigation problem, the results show that the proposed algorithm can get a good performance of balancing the exploration and exploitation in the learning process, and get a better convergence performance.

Key words: reinforcement learning, Markov decision process, prioritized sweeping, Dyna architecture, Bayesian Q learning

于俊,刘全,傅启明,孙洪坤,陈桂兴. 基于优先级扫描Dyna结构的贝叶斯Q学习方法[J]. 通信学报, 2013, 34(11): 129-139.

Jun YU,Quan LIU,Qi-ming FU,Hong-kun SUN,Gui-xing CHEN. Bayesian Q learning method with Dyna architecture and prioritized sweeping[J]. Journal on Communications, 2013, 34(11): 129-139.

图/表 11

图1

图2

表1

图3

表2

图4

图5

图6

图7

图8

表3

参考文献 19

[1]	SUTTON R S , BARTO A G . Reinforcement Learning: An Introduc-tion[M]. Cambridge: MIT Press 1998.
[2]	徐昕 . 增强学习与近似动态规划[M]．北京: 科学出版社, 2010. XU X . Reinforcement Learning and Approximate Dynamic Program-ming[M]. Beijing: Science Press, 2010.
[3]	刘全，傅启明，龚声蓉等. 最小状态变元平均奖赏的强化学习方法[J]. 通信学报, 2011,32 (1): 66-71. LIU Q , FU Q M , GONG S R , et al. Reinforcement learning algorithm based on minimum state method and average reward[J]. Journal on Communications, 2011,32 (1): 66-71.
[4]	肖飞，刘全，傅启明等. 基于自适应势函数塑造奖赏机制的梯度下降Sarsa (?) 算法[J]. 通信学报, 2013,34 (1): 77-88. XIAO F , LIU Q , FU Q M , et al. Gradient descent Sarsa(?)algorithm based on the adaptive potential function shaping reward mechanism[J]. Journal on Communications, 2013,34 (1): 77-88.
[5]	SZEPESVáRI C . Algorithms for Reinforcement Learning[M]. San Rafael: Morgan Claypool 2010.
[6]	WATKINS C . Learning From Delayed Rewards[D]. Cambridge: Kings's College, University of Cambridge 1989.
[7]	SUTTON R S . Dyna, an integrated architecture for learning, planning, and reacting[J]. SIGART Bulletin, 1991,2: 160-163.
[8]	SUTTON R S , SZEPESVáRI C , GERAMIFARD A , et al. Dyna-style planning with linear function approximation and prioritized sweep-ing[A]. Proceedings of the 24th Conference on Uncertai y in Artifi-cial Intelligence[C]. Finland: AUAI, 2008.
[9]	WINGATE D , SEPPI K D . Prioritized methods for accelerating MDP solvers[J]. Journal of Machine Learning Research, 2005,6: 851-881.
[10]	MEULEAU N , BOURGINE P . Exploration of multi-state environ-ments: local measures and back-propagation of uncertainty[J]. Ma-chine Learning, 1999,35 (2): 117-154.
[11]	COGGAN M . Exploration and exploitation in reinforcement learn-ing[A]. Proceedings of the 4th International Conference on Computa-tional Intelligence and Multimedia Applications[C]. Japan, 2001.
[12]	ALEXANDER L , STREHL , MICHAEL L . A theoretical analysis of mod-el-based interval estimation[A]. Proceedings of the 22nd International Conference on Machine Learning[C]. New York: ACM, 2005.
[13]	MEULEAU N , BOURGINE P . Exploration of multi-state environ-ments: local measures and back-propagation of uncertainty[J]. Ma-chine Learning, 1999,35 (2): 117-154.
[14]	DEARDEN R , FRIEDMAN N , RUSSELL S . Bayesian Q learning[A]. Proceedings of 15th International Conference on Artifi ial Intelli-gence[C]. Menlo Park: AAAI Press, 1998.
[15]	DEARDEN R , FRIEDMAN N , ANDRE D . Model based Bayesian exploration[A]. Proceedings of 15th Conference on Uncertainty in Ar-tificial Intelligence[C]. San Francisco: Morgan Kaufmann, 1999.
[16]	ASMUTH J , MICHAEL L , et al Potential-based shaping in mod-el-based reinforcement learning[A]. Proceedings of the 23th AAAI Conference on Artificial Intelligence[C]. Chicago: AAAI Press, 2008.
[17]	PENG J , WILLIAMS R J . Efficient learning and planning within the dyna framework[J]. Adaptive Behavior, 1993,2: 437-454.
[18]	DEGROOT M , SCHERVISH M . Probability and Statistics[M]. New York: Person Edition 2010.
[19]	TEACY W , CHALKIADAKIS G , FARINELLI A . Decentralised Bayesian reinforcement learning for online agent collaboration[A]. Proceedings of 11th International Joint Conference on tonomous Agents and Multi-Agent Systems[C]. Spain: IFAAMAS, 2012.

链问题		步数
链问题	500		1 000
QL semi-uniform	783		1582
QL Boltzmann	829.8		1 692.8
Bayes-VPI-Mom	905.8		1 973.5
Dyna-PS-BayesQL	1 603.5		3 209.7

迷宫导航问题		步数
迷宫导航问题	10 000		20 000
QL semi-uniform	72		549
QL Boltzmann	56		148
Bayes-VPI-Mom	69		160
Dyna-PS-BayesQL	334		1 246

参数	平均次数
(1,2,2,20)	50.5
(1,2,2,50)	55.9
(1,2,2,200)	69.1

基于优先级扫描Dyna结构的贝叶斯Q学习方法

Bayesian Q learning method with Dyna architecture and prioritized sweeping

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 19

相关文章 15

Metrics

推荐阅读 0

[1]	马玲, 樊漆亮, 许婷, 郭冠琛, 张圣林, 孙永谦, 张玉志. 基于强化学习的在线离线混部云环境下的调度框架[J]. 通信学报, 2023, 44(6): 90-102.
[2]	金彪, 李逸康, 姚志强, 陈瑜霖, 熊金波. GenFedRL：面向深度强化学习智能体的通用联邦强化学习框架[J]. 通信学报, 2023, 44(6): 183-197.
[3]	李元诚, 秦永泰. 基于深度强化学习的软件定义安全中台QoS实时优化算法[J]. 通信学报, 2023, 44(5): 181-192.
[4]	周大成, 陈鸿昶, 何威振, 程国振, 扈红超. 基于深度强化学习的微服务多维动态防御策略研究[J]. 通信学报, 2023, 44(4): 50-63.
[5]	许国良, 谭峰, 冉泳屹, 陈丰. 面向多波束卫星系统的波束跳变与覆盖控制联合优化算法[J]. 通信学报, 2023, 44(4): 78-86.
[6]	许文俊, 吴思雷, 王凤玉, 林兰, 李国军, 张治. 基于多智能体强化学习的大规模灾后用户分布式覆盖优化[J]. 通信学报, 2022, 43(8): 1-16.
[7]	沙宗轩, 霍如, 孙闯, 汪硕, 黄韬. 基于深度强化学习的转发效能感知流量调度算法[J]. 通信学报, 2022, 43(8): 30-40.
[8]	马帅, 李兵, 盛海鸿, 谷荣妍, 周辉, 王洪梅, 王悦, 李世银. 基于深度强化学习的可见光定位通信一体化功率分配研究[J]. 通信学报, 2022, 43(8): 121-130.
[9]	张宇, 程旻. NDN中边缘计算与缓存的联合优化[J]. 通信学报, 2022, 43(8): 164-175.
[10]	左珮良, 侯少龙, 郭超, 蒋华, 王文博. 基于强化学习的多层卫星网络边缘安全决策方法[J]. 通信学报, 2022, 43(6): 189-199.
[11]	张先超, 赵耀, 叶海军, 樊锐. 无线网络多用户干扰下智能发射功率控制算法[J]. 通信学报, 2022, 43(2): 15-21.
[12]	李传煌, 陈泱婷, 唐晶晶, 楼佳丽, 谢仁华, 方春涛, 王伟明, 陈超. QL-STCT：一种SDN链路故障智能路由收敛方法[J]. 通信学报, 2022, 43(2): 131-142.
[13]	陈晋音, 胡书隆, 邢长友, 张国敏. 面向智能渗透攻击的欺骗防御方法[J]. 通信学报, 2022, 43(10): 106-120.
[14]	苏新, 孟蕾蕾, 周一青, CELIMUGE Wu. 基于深度强化学习的海洋移动边缘计算卸载方法[J]. 通信学报, 2022, 43(10): 133-145.
[15]	杜丽娜, 卓力, 杨硕, 李嘉锋, 张菁. 基于强化学习的移动视频流业务码率自适应算法研究进展[J]. 通信学报, 2021, 42(9): 205-217.