基于值函数迁移的启发式Sarsa算法

doi:10.11959/j.issn.1000-436x.2018133

通信学报 ›› 2018, Vol. 39 ›› Issue (8): 37-47.doi: 10.11959/j.issn.1000-436x.2018133

• 论文Ⅰ：人工智能与网络安全 • 上一篇下一篇

基于值函数迁移的启发式Sarsa算法

陈建平^1,^2,³,杨正霞^1,^2,³,刘全⁴,吴宏杰^1,^2,³,徐杨⁵,傅启明^1,^2,³()

¹ 苏州科技大学电子与信息工程学院，江苏苏州 215009
² 苏州科技大学江苏省建筑智慧节能重点实验室，江苏苏州 215009
³ 苏州科技大学苏州市移动网络技术与应用重点实验室，江苏苏州 215009
⁴ 苏州大学计算机科学与技术学院，江苏苏州 215000
⁵ 浙江纺织服装职业技术学院信息工程学院，浙江宁波 315000

修回日期:2018-07-13 出版日期:2018-08-01 发布日期:2018-09-13
作者简介:陈建平（1963-），男，江苏南京人，博士，苏州科技大学教授，主要研究方向为大数据分析与应用、建筑节能、智能信息处理。|杨正霞（1992-），女，江苏扬州人，苏州科技大学硕士生，主要研究方向为强化学习、迁移学习、建筑节能。|刘全（1969-），男，内蒙古牙克石人，博士，苏州大学教授、博士生导师，主要研究方向为智能信息处理、自动推理与机器学习。|吴宏杰（1977-），男，江苏苏州人，博士，苏州科技大学副教授，主要研究方向为深度学习、模式识别、生物信息。|徐杨（1980-），女，河北深州人，浙江纺织服装职业技术学院讲师，主要研究方向为数据分析与应用、智能化与个性化教学。|傅启明（1985-），男，江苏淮安人，博士，苏州科技大学讲师，主要研究方向为强化学习、深度学习及建筑节能。
基金资助:
国家自然科学基金资助项目(61502329);国家自然科学基金资助项目(61772357);国家自然科学基金资助项目(61750110519);国家自然科学基金资助项目(61772355);国家自然科学基金资助项目(61702055);国家自然科学基金资助项目(61672371);国家自然科学基金资助项目(61602334);江苏省自然科学基金资助项目(BK20140283);江苏省重点研发计划基金资助项目(BE2017663);江苏省高校自然科学基金资助项目(13KJB520020);苏州市应用基础研究计划工业部分基金资助项目(SYG201422)

Heuristic Sarsa algorithm based on value function transfer

Jianping CHEN^1,^2,³,Zhengxia YANG^1,^2,³,Quan LIU⁴,Hongjie WU^1,^2,³,Yang XU⁵,Qiming FU^1,^2,³()

¹ Institute of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou 215009,China
² Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology,Suzhou 215009,China
³ Suzhou Key Laboratory of Mobile Networking and Applied Technologies,Suzhou University of Science and Technology,Suzhou 215009,China
⁴ School of Computer Science and Technology,Soochow University,Suzhou 215000,China
⁵ Institute of Information Engineering,Zhejiang Fashion Institute of Technology College,Ningbo 315000,China

Revised:2018-07-13 Online:2018-08-01 Published:2018-09-13
Supported by:
The National Natural Science Foundation of China(61502329);The National Natural Science Foundation of China(61772357);The National Natural Science Foundation of China(61750110519);The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61672371);The National Natural Science Foundation of China(61602334);The Natural Science Foundation of Jiangsu Province(BK20140283);The Key Research and Development Program of Jiangsu Province(BE2017663);High School Natural Science Foundation of Jiangsu Province(13KJB520020);Suzhou Industrial Application of Basic Research Program Part(SYG201422)

摘要/Abstract

摘要：

针对 Sarsa 算法存在的收敛速度较慢的问题，提出一种改进的基于值函数迁移的启发式 Sarsa 算法（VFT-HSA）。该算法将Sarsa算法与值函数迁移方法相结合，引入自模拟度量方法，在相同的状态空间和动作空间下，对新任务与历史任务之间的不同状态进行相似性度量，对满足条件的历史状态进行值函数迁移，提高算法的收敛速度。此外，该算法结合启发式探索方法，引入贝叶斯推理，结合变分推理衡量信息增益，并运用获取的信息增益构建内在奖赏函数作为探索因子，进而加快算法的收敛速度。将所提算法用于经典的Grid World问题，并与Sarsa算法、Q-Learning算法以及收敛性能较好的VFT-Sarsa算法、IGP-Sarsa算法进行比较，实验表明，所提算法具有较快的收敛速度和较好的稳定性。

关键词: 强化学习, 值函数迁移, 自模拟度量, 变分贝叶斯

Abstract:

With the problem of slow convergence for traditional Sarsa algorithm,an improved heuristic Sarsa algorithm based on value function transfer was proposed.The algorithm combined traditional Sarsa algorithm and value function transfer method,and the algorithm introduced bisimulation metric and used it to measure the similarity between new tasks and historical tasks in which those two tasks had the same state space and action space and speed up the algorithm convergence.In addition,combined with heuristic exploration method,the algorithm introduced Bayesian inference and used variational inference to measure information gain.Finally,using the obtained information gain to build intrinsic reward function model as exploring factors,to speed up the convergence of the algorithm.Applying the proposed algorithm to the traditional Grid World problem,and compared with the traditional Sarsa algorithm,the Q-Learning algorithm,and the VFT-Sarsa algorithm,the IGP-Sarsa algorithm with better convergence performance,the experiment results show that the proposed algorithm has faster convergence speed and better convergence stability.

Key words: reinforcement learning, value function transfer, bisimulation metric, variational Bayes

中图分类号:

TP391

陈建平,杨正霞,刘全,吴宏杰,徐杨,傅启明. 基于值函数迁移的启发式Sarsa算法[J]. 通信学报, 2018, 39(8): 37-47.

Jianping CHEN,Zhengxia YANG,Quan LIU,Hongjie WU,Yang XU,Qiming FU. Heuristic Sarsa algorithm based on value function transfer[J]. Journal on Communications, 2018, 39(8): 37-47.

图/表 10

图1

图2

图3

图4

图5

图6

图7

图8

图9

表1

参考文献 19

[1]	SUTTON R S , BARTO G A . Reinforcement learning:an introduction[M]. Cambridge: MIT PressPress, 1998.
[2]	SCHMIDHUBER J , INFORMATIK T T . On learning how to learn learning strategies[R]. Germany:Technische University, 1995.
[3]	AMMAR H B , EATON E , LUNA J M ,et al. Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning[C]// The 15th International Conference on Artificial Intelligence. 2015: 3345-3351.
[4]	GUPTA A , DEVIN C , LIU Y X ,et al. Learning invariant feature spaces to transfer skills with reinforcement learning[C]// The 5th International Conference on Learning Representations. 2017: 2147-2153.
[5]	LAROCHE R , BARLIER M . Transfer reinforcement learning with shared dynamics[C]// The 31th International Conference on the Association for the Advance of Artificial Intelligence. 2017: 2147-2153.
[6]	BARRETO A , DABNEY W , MUNOS R ,et al. Successor features for transfer in reinforcement learning[C]// The 32th International Conference on Neural Information Processing Systems. 2017: 4055-4065.
[7]	DEARDEN R , NIR F , STUART R . Bayesian Q-learning[C]// The 21th International Conference on the Association for the Advance of Artificial Intelligence. 1998: 761-768.
[8]	GUEZ A , SILVER D , DAYAN P . Scalable and efficient Bayes- adaptive reinforcement learning based on Monte-Carlo tree search[J]. Journal of Artificial Intelligence Research, 2013,48(1): 841-883.
[9]	LITTLE D Y , SOMMER F T . Learning and exploration in action-perception loops[J]. Frontiers in Neural Circuits, 2013,7(7): 37-56.
[10]	MANSOUR Y , SLIVKINS A , SYRGKANIS V . Bayesian incentive-compatible bandit exploration[C]// The 16th International Conference on Economics and Computation. 2015: 565-582.
[11]	VIEN N A , LEE S G , CHUNG T C . Bayes-adaptive hierarchical MDPs[J]. Applied Intelligence, 2016,45(1): 112-126.
[12]	WU B , FENG Y . Monte-Carlo Bayesian reinforcement learning using a compact factored representation[C]// The 4th International Conference on Information Science and Control Engineering. 2017: 466-469.
[13]	傅启明, 刘全, 伏玉琛 ,等. 一种高斯过程的带参近似策略迭代算法[J]. 软件学报, 2013,24(11): 2676-2687.
	FU Q M , LIU Q , FU Y C ,et al. Parametric approximation policy strategy iteration algorithm based on Gaussian process[J]. Journal of Software, 2013,24(11): 2676-2687.
[14]	GIVAN R , DEAN T , GREIG M . Equivalence notions and model minimization in Markov decision processes[J]. Artificial Intelligence, 2003,147(1): 163-223.
[15]	FERNS N , PANANGADEN P , PRECUP D . Metrics for finite Markov decision processes[C]// The 20th International Conference on Uncertainty in Artificial Intelligence. 2004: 162-169.
[16]	BEAL M J . Variational algorithms for approximate Bayesian inference[D]. London:University of London, 2003.
[17]	傅启明, 刘全, 尤树华 ,等. 一种新的基于值函数迁移的快速Sarsa算法[J]. 电子学报, 2014,42(11): 2157-2161.
	FU Q M , LIU Q , YOU S H ,et al. A novel fast sarsa algorithm based on value function transfer[J]. Acta Electronica Sinica, 2014,42(11): 2157-2161.
[18]	MIERING M , HASSELT H V . The QV family compared to other reinforcement learning algorithms[C]// The 17th International Conference on Approximate Dynamic Programming and Reinforcement Learning. 2008: 101-108.
[19]	CHUNG J J , LAWRANCE N R J , SUKKARIEH S . Gaussian processes for informative exploration in reinforcement learning[C]// The 20th International Conference on Robotics and Automation. 2013: 2633-2639.

问题规模				η
问题规模	0.3	0.5	0.6	0.8	2	5	8
5×6	456	36	102	236	543	956	1 026
10×10	612	156	84	156	456	1 023	1 456

基于值函数迁移的启发式Sarsa算法

Heuristic Sarsa algorithm based on value function transfer

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 19

相关文章 15

Metrics

推荐阅读 0

[1]	马玲, 樊漆亮, 许婷, 郭冠琛, 张圣林, 孙永谦, 张玉志. 基于强化学习的在线离线混部云环境下的调度框架[J]. 通信学报, 2023, 44(6): 90-102.
[2]	金彪, 李逸康, 姚志强, 陈瑜霖, 熊金波. GenFedRL：面向深度强化学习智能体的通用联邦强化学习框架[J]. 通信学报, 2023, 44(6): 183-197.
[3]	李元诚, 秦永泰. 基于深度强化学习的软件定义安全中台QoS实时优化算法[J]. 通信学报, 2023, 44(5): 181-192.
[4]	周大成, 陈鸿昶, 何威振, 程国振, 扈红超. 基于深度强化学习的微服务多维动态防御策略研究[J]. 通信学报, 2023, 44(4): 50-63.
[5]	许国良, 谭峰, 冉泳屹, 陈丰. 面向多波束卫星系统的波束跳变与覆盖控制联合优化算法[J]. 通信学报, 2023, 44(4): 78-86.
[6]	许文俊, 吴思雷, 王凤玉, 林兰, 李国军, 张治. 基于多智能体强化学习的大规模灾后用户分布式覆盖优化[J]. 通信学报, 2022, 43(8): 1-16.
[7]	沙宗轩, 霍如, 孙闯, 汪硕, 黄韬. 基于深度强化学习的转发效能感知流量调度算法[J]. 通信学报, 2022, 43(8): 30-40.
[8]	马帅, 李兵, 盛海鸿, 谷荣妍, 周辉, 王洪梅, 王悦, 李世银. 基于深度强化学习的可见光定位通信一体化功率分配研究[J]. 通信学报, 2022, 43(8): 121-130.
[9]	张宇, 程旻. NDN中边缘计算与缓存的联合优化[J]. 通信学报, 2022, 43(8): 164-175.
[10]	左珮良, 侯少龙, 郭超, 蒋华, 王文博. 基于强化学习的多层卫星网络边缘安全决策方法[J]. 通信学报, 2022, 43(6): 189-199.
[11]	张先超, 赵耀, 叶海军, 樊锐. 无线网络多用户干扰下智能发射功率控制算法[J]. 通信学报, 2022, 43(2): 15-21.
[12]	李传煌, 陈泱婷, 唐晶晶, 楼佳丽, 谢仁华, 方春涛, 王伟明, 陈超. QL-STCT：一种SDN链路故障智能路由收敛方法[J]. 通信学报, 2022, 43(2): 131-142.
[13]	陈晋音, 胡书隆, 邢长友, 张国敏. 面向智能渗透攻击的欺骗防御方法[J]. 通信学报, 2022, 43(10): 106-120.
[14]	苏新, 孟蕾蕾, 周一青, CELIMUGE Wu. 基于深度强化学习的海洋移动边缘计算卸载方法[J]. 通信学报, 2022, 43(10): 133-145.
[15]	杜丽娜, 卓力, 杨硕, 李嘉锋, 张菁. 基于强化学习的移动视频流业务码率自适应算法研究进展[J]. 通信学报, 2021, 42(9): 205-217.