增量式双自然策略梯度的行动者评论家算法

doi:10.11959/j.issn.1000-436x.2017089

通信学报 ›› 2017, Vol. 38 ›› Issue (4): 166-177.doi: 10.11959/j.issn.1000-436x.2017089

增量式双自然策略梯度的行动者评论家算法

章鹏¹,刘全^1,^2,³,钟珊¹,翟建伟¹,钱炜晟¹

¹ 苏州大学计算机科学与技术学院，江苏苏州 215006
² 软件新技术与产业化协同创新中心，江苏南京 210000
³ 吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012

修回日期:2017-03-03 出版日期:2017-04-01 发布日期:2017-07-20
作者简介:章鹏（1992-），男，江苏仪征人，苏州大学硕士生，主要研究方向为连续空间强化学习。|刘全（1969-），男，内蒙古牙克石人，苏州大学教授、博士生导师，主要研究方向为强化学习、智能信息处理和自动推理。|钟珊（1983-），女，湖南双峰人，苏州大学博士生，主要研究方向为机器学习和深度学习。|翟建伟（1992-），男，江苏盐城人，苏州大学硕士生，主要研究方向为深度学习和深度强化学习。|钱炜晟（1992-），男，江苏常熟人，苏州大学硕士生，主要研究方向为部分可观察马氏问题的近似规划方法。
基金资助:
国家自然科学基金资助项目(61272005);国家自然科学基金资助项目(61303108);国家自然科学基金资助项目(61373094);国家自然科学基金资助项目(61472262);国家自然科学基金资助项目(61502323);国家自然科学基金资助项目(61502329);江苏省自然科学基金资助项目(BK2012616);江苏省高校自然科学研究基金资助项目(13KJB520020);吉林大学符号计算与知识工程教育部重点实验室基金资助项目(93K172014K04);苏州市应用基础研究计划工业部分基金资助项目(SYG201422);苏州市应用基础研究计划工业部分基金资助项目(SYG201308)

Actor-critic algorithm with incremental dual natural policy gradient

Peng ZHANG¹,Quan LIU^1,^2,³,Shan ZHONG¹,Jian-wei ZHAI¹,Wei-sheng QIAN¹

¹ School of Computer Science and Technology,Soochow University,Suzhou 215006,China
² Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
³ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China

Revised:2017-03-03 Online:2017-04-01 Published:2017-07-20
Supported by:
The National Natural Science Foundation of China(61272005);The National Natural Science Foundation of China(61303108);The National Natural Science Foundation of China(61373094);The National Natural Science Foundation of China(61472262);The National Natural Science Foundation of China(61502323);The National Natural Science Foundation of China(61502329);The Natural Science Foundation of Jiangsu Province(BK2012616);The High School Natural Foundation of Jiangsu Province(13KJB520020);The Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04);Suzhou Industrial Application of Basic Research Program(SYG201422);Suzhou Industrial Application of Basic Research Program(SYG201308)

摘要/Abstract

摘要：

针对强化学习中已有连续动作空间算法未能充分考虑最优动作的选取方法和利用动作空间的知识，提出一种对自然梯度进行改进的行动者评论家算法。该算法采用最大化期望回报作为目标函数，对动作区间上界和下界进行加权来求最优动作，然后通过线性函数逼近器来近似动作区间上下界的权值，将最优动作求解转换为对双策略参数向量的求解。为了加快上下界的参数向量学习速率，设计了增量的Fisher信息矩阵和动作上下界权值的资格迹，并定义了双策略梯度的增量式自然行动者评论家算法。为了证明该算法的有效性，将该算法与其他连续动作空间的经典强化学习算法在3个强化学习的经典测试实验中进行比较。实验结果表明，所提算法具有收敛速度快和收敛稳定性好的优点。

关键词: 强化学习, 自然梯度, 行动者评论家, 连续空间

Abstract:

The existed algorithms for continuous action space failed to consider the way of selecting optimal action and utilizing the knowledge of the action space,so an efficient actor-critic algorithm was proposed by improving the natural gradient.The objective of the proposed algorithm was to maximize the expected return.Upper and the lower bounds of the action range were weighted to obtain the optimal action.The two bounds were approximated by linear function.Afterward,the problem of obtaining the optimal action was transferred to the learning of double policy parameter vectors.To speed the learning,the incremental Fisher information matrix and the eligibilities of both bounds were designed.At three reinforcement learning problems,compared with other representative methods with continuous action space,the simulation results show that the proposed algorithm has the advantages of rapid convergence rate and high convergence stability.

Key words: reinforcement learning, natural gradient, actor-critic, continuous space

中图分类号:

TP181

章鹏,刘全,钟珊,翟建伟,钱炜晟. 增量式双自然策略梯度的行动者评论家算法[J]. 通信学报, 2017, 38(4): 166-177.

Peng ZHANG,Quan LIU,Shan ZHONG,Jian-wei ZHAI,Wei-sheng QIAN. Actor-critic algorithm with incremental dual natural policy gradient[J]. Journal on Communications, 2017, 38(4): 166-177.

图/表 11

图1

图2

表1

图3

图4

图5

图6

表2

图7

图8

图9

参考文献 21

[1]	SUTTON R S , BARTO A G . Reinforcement learning:an introduction[M]. Cambridge Massachusetts: MIT pressPress, 1998.
[2]	BUSONIU L , BABUSKA R , SCHUTTER B D ,et al. Reinforcement learning and dynamic programming using function approximators[M]. Florida: CRC PressPress, 2010.
[3]	LEE D , SEO H , JUNG M W . Neural basis of reinforcement learning and decision making[J]. Annual Review of Neuroscience, 2012,35(5): 287-308.
[4]	WIERING M , VAN O M . Reinforcement learning:STATE-OFTHE-Art[M]. Berlin Heidelberg:Springer. 2014.
[5]	SUTTON R S , MCALLESTER D A , SINGH S P ,et al. Policy gradient methods for reinforcement learning with function approximation[C]// NIPS. 1999,99: 1057-1063.
[6]	PETERS J , SCHAAL S . NATURAL A C[J]. Neurocomputing, 2008,71(7-9): 1180-1190.
[7]	PETERS J , VIJAYAKUMAR S , SCHAAL S . Reinforcement learning for humanoid robotics[J]. Autonomous Robot, 2003,12(1): 1-20.
[8]	VAN H H . Reinforcement learning in continuous state and action spaces[M]. Reinforcement Learning. Springer Berlin Heidelberg, 2012: 207-251.
[9]	WIERSTRA D , SCHAUL T , PETERS J ,et al. Natural evolution strategies[C]// 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). 2008: 3381-3387.
[10]	SUN Y , WIERSTRA D , SCHAUL T ,et al. Efficient natural evolution strategies[C]// The 11th Annual Conference on Genetic and Evolutionary Computation. 2009: 539-546.
[11]	RUBINSTEIN R Y , KROESE D P . The cross-entropy method[J]. Information Science ＆ Statistics, 2008,50(1): 92-92.
[12]	BOTEV Z I , KROESE D P , RUBINSTEIN R Y ,et al. The cross-entropy method for optimization[J]. Machine Learning:Theory and Applications,Chennai:Elsevier BV, 2013,31: 35-59.
[13]	MARTIN H J A , DE LOPE J . x＜α＞:an effective algorithm for continuous actions reinforcement learning problems[C]// The IEEE Industrial Electronics Society. 2009: 2063-2068.
[14]	LILLICRAP T P , HUNT J J , PRITZEL A ,et al. Continuous control with deep reinforcement learning[J]. Computer Science, 2015,8(6): A187.
[15]	GU S , LILLICRAP T , SUTSKEVER I ,et al. Continuous deep Q-learning with model-based acceleration[J]. arXiv Preprint arXiv:1603.00748, 2016.
[16]	KHAMASSI M , TZAFESTAS C . Active exploration in parameterized reinforcement learning[J]. arXiv preprint arXiv:1610, 2016.
[17]	BHATNAGAR S , GHAVAMZADEH M , LEE M ,et al. Incremental natural actor-critic algorithms[C]// Advances in neural information processing systems. 2007: 105-112.
[18]	VIJAY R , KONDA , JOHN N. . Tsitsiklis.actor-critic algorithms[J]. Siam Journal on Control ＆ Optimization, 2001,42(4): 1008-1014.
[19]	BERENJI H R , KHEDKAR P . Learning and tuning fuzzy logic controllers through reinforcements[J]. IEEE Transactions on Neural Networks, 1992,3(5): 724-740.
[20]	SINGH S P , SUTTON R S . Reinforcement learning with replacing eligibility traces[J]. Machine Learning, 1996,22(1-3): 123-158.
[21]	SUTTON R S . Generalization in reinforcement learning:successful examples using sparse coarse coding[J]. Neural Information Processing Systems, 1996: 1038-1044.

算法名称	首次成功	成功率	平均步数
IDNPG-AC	245.4	85.33%	2 601.5
CACLA	809.8	12.38%	444.5
INAC	643.0	63.68%	1 939.5
INAC-S	510.2	70.10%	2 129.7

小车水平位置	小车速度为-0.07时的加速度	小车速度为-0.042时的加速度	小车速度为-0.014时的加速度	小车速度为0.014时的加速度	小车速度为0.042时的加速度
-1.20	1	1	1	1	1
-1.03	-1	1	1	1	1
-0.86	1	-1	-1	1	1
-0.69	1	-1	1	1	1
-0.52	1	-1	-1	1	1
-0.35	1	-1	-1	1	1
-0.18	1	-1	-1	1	1
-0.01	1	-1	-1	1	1
0.16	1	-1	-1	-1	1
0.33	1	1	1	1	0.56

增量式双自然策略梯度的行动者评论家算法

Actor-critic algorithm with incremental dual natural policy gradient

在线阅读

PDF下载

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 21

相关文章 15

Metrics

推荐阅读 0

[1]	马玲, 樊漆亮, 许婷, 郭冠琛, 张圣林, 孙永谦, 张玉志. 基于强化学习的在线离线混部云环境下的调度框架[J]. 通信学报, 2023, 44(6): 90-102.
[2]	金彪, 李逸康, 姚志强, 陈瑜霖, 熊金波. GenFedRL：面向深度强化学习智能体的通用联邦强化学习框架[J]. 通信学报, 2023, 44(6): 183-197.
[3]	李元诚, 秦永泰. 基于深度强化学习的软件定义安全中台QoS实时优化算法[J]. 通信学报, 2023, 44(5): 181-192.
[4]	周大成, 陈鸿昶, 何威振, 程国振, 扈红超. 基于深度强化学习的微服务多维动态防御策略研究[J]. 通信学报, 2023, 44(4): 50-63.
[5]	许国良, 谭峰, 冉泳屹, 陈丰. 面向多波束卫星系统的波束跳变与覆盖控制联合优化算法[J]. 通信学报, 2023, 44(4): 78-86.
[6]	许文俊, 吴思雷, 王凤玉, 林兰, 李国军, 张治. 基于多智能体强化学习的大规模灾后用户分布式覆盖优化[J]. 通信学报, 2022, 43(8): 1-16.
[7]	沙宗轩, 霍如, 孙闯, 汪硕, 黄韬. 基于深度强化学习的转发效能感知流量调度算法[J]. 通信学报, 2022, 43(8): 30-40.
[8]	马帅, 李兵, 盛海鸿, 谷荣妍, 周辉, 王洪梅, 王悦, 李世银. 基于深度强化学习的可见光定位通信一体化功率分配研究[J]. 通信学报, 2022, 43(8): 121-130.
[9]	张宇, 程旻. NDN中边缘计算与缓存的联合优化[J]. 通信学报, 2022, 43(8): 164-175.
[10]	左珮良, 侯少龙, 郭超, 蒋华, 王文博. 基于强化学习的多层卫星网络边缘安全决策方法[J]. 通信学报, 2022, 43(6): 189-199.
[11]	张先超, 赵耀, 叶海军, 樊锐. 无线网络多用户干扰下智能发射功率控制算法[J]. 通信学报, 2022, 43(2): 15-21.
[12]	李传煌, 陈泱婷, 唐晶晶, 楼佳丽, 谢仁华, 方春涛, 王伟明, 陈超. QL-STCT：一种SDN链路故障智能路由收敛方法[J]. 通信学报, 2022, 43(2): 131-142.
[13]	陈晋音, 胡书隆, 邢长友, 张国敏. 面向智能渗透攻击的欺骗防御方法[J]. 通信学报, 2022, 43(10): 106-120.
[14]	苏新, 孟蕾蕾, 周一青, CELIMUGE Wu. 基于深度强化学习的海洋移动边缘计算卸载方法[J]. 通信学报, 2022, 43(10): 133-145.
[15]	杜丽娜, 卓力, 杨硕, 李嘉锋, 张菁. 基于强化学习的移动视频流业务码率自适应算法研究进展[J]. 通信学报, 2021, 42(9): 205-217.