基于重要性采样的优势估计器

doi:10.11959/j.issn.1000-436x.2019122

通信学报 ›› 2019, Vol. 40 ›› Issue (5): 108-116.doi: 10.11959/j.issn.1000-436x.2019122

基于重要性采样的优势估计器

刘全^1,^2,^3,⁴,姜玉斌¹,胡智慧¹

¹ 苏州大学计算机科学与技术学院，江苏苏州 215006
² 苏州大学江苏省计算机信息处理技术重点实验室，江苏苏州 215006
³ 吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012
⁴ 软件新技术与产业化协同创新中心，江苏南京 210093

修回日期:2019-04-25 出版日期:2019-05-25 发布日期:2019-05-30
作者简介:刘全（1969- ），男，内蒙古牙克石人，博士，苏州大学教授、博士生导师，主要研究方向为智能信息处理、自动推理与机器学习。|姜玉斌（1994- ），男，江苏盐城人，苏州大学硕士生，主要研究方向为强化学习、深度强化学习。|胡智慧（1994- ），女，江苏徐州人，苏州大学硕士生，主要研究方向为强化学习、深度强化学习。
基金资助:
国家自然科学基金资助项目(61772355);国家自然科学基金资助项目(61702055);国家自然科学基金资助项目(61472262);国家自然科学基金资助项目(61502323);国家自然科学基金资助项目(61502329);江苏省高等学校自然科学研究重大基金资助项目(18KJA520011);江苏省高等学校自然科学研究重大基金资助项目(17KJA520004);吉林大学符号计算与知识工程教育部重点实验室基金资助项目(93K172014K04);吉林大学符号计算与知识工程教育部重点实验室基金资助项目(93K172017K18);苏州市应用基础研究计划工业部分基金资助项目(SYG201422)

Advantage estimator based on importance sampling

Quan LIU^1,^2,^3,⁴,Yubin JIANG¹,Zhihui HU¹

¹ School of Computer Science and Technology,Soochow University,Suzhou 215006,China
² Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou 215006,China
³ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
⁴ Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China

Revised:2019-04-25 Online:2019-05-25 Published:2019-05-30
Supported by:
The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61472262);The National Natural Science Foundation of China(61502323);The National Natural Science Foundation of China(61502329);Jiangsu Province Natural Science Research University Major Projects(18KJA520011);Jiangsu Province Natural Science Research University Major Projects(17KJA520004);Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04);Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172017K18);Suzhou Industrial Application of Basic Research Program Part(SYG201422)

摘要/Abstract

摘要：

在连续动作任务中，深度强化学习通常采用高斯分布作为策略函数。针对高斯分布策略函数由于截断动作导致算法收敛速度变慢的问题，提出了一种重要性采样优势估计器（ISAE）。该估计器在通用优势估计器（GAE）的基础上，引入了重要性采样机制，通过计算边界动作的目标策略与行动策略比率修正截断动作带来的值函数偏差，提高了算法的收敛速度。此外，ISAE引入了L参数，通过限制重要性采样率的范围，提高了样本的可靠度，保证了网络参数的稳定。为了验证ISAE的有效性，将ISAE与近端策略优化结合并与其他算法在MuJoCo平台上进行比较。实验结果表明，ISAE具有更快的收敛速度。

关键词: 强化学习, 重要性采样, 深度强化学习, 优势函数

Abstract:

In continuous action tasks,deep reinforcement learning usually uses Gaussian distribution as a policy function.Aiming at the problem that the Gaussian distribution policy function slows down due to the clipped action,an importance sampling advantage estimator was proposed.Based on the general advantage estimator,an importance sampling mechanism was introduced by the estimator to improve the convergence speed of the algorithm and correct the deviation of the value function caused by calculating the target strategy and action strategy ratio of the boundary action.In addition,the L parameter was introduced by ISAE which improved the reliability of the sample and limited the stability of the network parameters by limiting the range of the importance sampling rate.In order to verify the effectiveness of the ISAE,applying it to proximal policy optimization and comparing it with other algorithms on the MuJoCo platform.Experimental results show that ISAE has a faster convergence rate.

Key words: reinforcement learning, importance sampling, deep reinforcement learning, advantage function

中图分类号:

TP391

刘全,姜玉斌,胡智慧. 基于重要性采样的优势估计器[J]. 通信学报, 2019, 40(5): 108-116.

Quan LIU,Yubin JIANG,Zhihui HU. Advantage estimator based on importance sampling[J]. Journal on Communications, 2019, 40(5): 108-116.

图/表 7

图1

图2

表1

MuJoCo平台8个任务的状态空间和动作空间"

任务名	状态空间	动作空间
Ant	$ℝ^{111}$	-1.0,1.0]⁸
Hopper	$ℝ^{11}$	-1.0,1.0]³
HalfCheetah	$ℝ^{17}$	-1.0,1.0]⁶
InvertedDoublePendulum	$ℝ^{11}$	-1.0,1.0]¹
InvertedPendulum	$ℝ^{4}$	-3.0,3.0]¹
Reacher	$ℝ^{11}$	-1.0,1.0]²
Swimmer	$ℝ^{8}$	-1.0,1.0]²
Walker2d	$ℝ^{17}$	-1.0,1.0]⁶

表1

图3

图4

图5

表2

参考文献 23

[1]	SUTTON R S , BARTO A G . Introduction to reinforcement learning[M]. Cambridge: MIT pressPress, 1998.
[2]	刘全, 傅启明, 龚声蓉 . 最小状态变元平均奖赏的强化学习方法[J]. 通信学报, 2011,32(1): 66-71.
	LIU Q , FU Q M , GONG S R . Reinforcement learning algorithm based on minimum state method and average reward[J]. Journal on Communications, 2011,32(1): 66-71.
[3]	TANG J , DENG C , HUANG G B . Extreme learning machine for multilayer perceptron[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016,27(4): 809-821.
[4]	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . Imagenet classification with deep convolutional neural networks[C]// Advances in Neural Information Processing Systems. 2012: 1097-1105.
[5]	VEERIAH V , VAN S H , SUTTON R S . Forward actor-Critic for nonlinear function approximation in reinforcement learning[C]// Conference on Autonomous Agents and Multiagent Systems. 2017: 556-564.
[6]	LECUN Y , BENGIO Y , HINTON G . Deep learning[J]. Nature, 2015,521(7553): 436-444.
[7]	MNIH V , KAVUKCUOFLU K , SILVER D ,et al. Human-level control through deep reinforcement learning[J]. Nature, 2015,518(7540): 529-533.
[8]	MNIH V , BADIA A P , MIRZA M ,et al. Asynchronous methods for deep reinforcement learning[C]// International Conference on Machine Learning. 2016: 1928-1937.
[9]	VAN H , GUEZ A , SILVER D . Deep reinforcement learning with double Q-learning[C]// Thirtieth AAAI Conference on Artificial Intelligence. 2016: 2094-2100.
[10]	WANG Z , SCHAUL T , HESSEL M ,et al. Dueling network architectures for deep reinforcement learning[C]// International Conference on Machine Learning. 2016: 1995-2003.
[11]	SAMEJIMA K , DOYA K , KAWATO M . Inter-module credit assignment in modular reinforcement learning[J]. Neural Networks, 2003,16(7): 985-994.
[12]	SINGH S P , SUTTON R S . Reinforcement learning with replacing eligibility traces[J]. Machine Learning, 1996,22(1-3): 123-158.
[13]	WATKINS C J C H . Learning from delayed rewards[D]. Cambridge:King’s College, 1989.
[14]	SUTTON R S . Temporal credit assignment in reinforcement learning[D]. Amherst:University of Massachusetts, 1984.
[15]	VAN S H , MAHMOOD A R , PILARSKI P M ,et al. True online temporal-difference learning[J]. The Journal of Machine Learning Research, 2016,17(1): 5057-5096.
[16]	HO J , ERMON S . Generative adversarial imitation learning[C]// Advances in Neural Information Processing Systems. 2016: 4565-4573.
[17]	MNIH V , BADIA A P , MIRZA M ,et al. Asynchronous methods for deep reinforcement learning[C]// International Conference on Machine Learning. 2016: 1928-1937.
[18]	SCHULMAN J , LEVINE S , ABBEEL P ,et al. Trust region policy optimization[C]// International Conference on Machine Learning. 2015: 1889-1897.
[19]	CHUA K , CALANDRA R , MCALLISTER R ,et al. Deep reinforcement learning in a handful of trials using probabilistic dynamics models[C]// Advances in Neural Information Processing Systems. 2018.
[20]	FUJITA Y , MAEDA S . Clipped action policy gradient[C]// International Conference on Machine Learning. 2018: 1592-1601.
[21]	THODOROFF P , DURAND A , PINEAU J ,et al. Temporal regularization for Markov decision process[C]// Advances in Neural Information Processing Systems. 2018: 1779-1789.
[22]	DOYA K . Reinforcement learning in continuous time and space[J]. Neural Computation, 2000,12(1): 219-245.
[23]	HESSEL M , MODAYIL J , VAN H H ,et al. Rainbow:combining improvements in deep reinforcement learning[C]// Thirty-Second AAAI Conference on Artificial Intelligence. 2018: 3215-3222.

算法	Ant	Hopper	HalfCheetah	InvertedDouble Pendulum	Inverted Pendulum	Reacher	Swimmer	Walker2d
PPO-GAE	240.90	1 725.07	1 535.71	4 807.24	646.94	-6.98	104.24	2 345.71
PPO-CAPG	286.56	1 962.92	1 569.24	5 185.03	687.78	-7.29	103.80	2 693.94
PPO-ISAE(L=0.90)	387.74	2 288.21	1 777.03	6 040.82	857.58	-8.15	96.48	3 105.64
PPO-ISAE(L=0.95)	405.51	2 104.51	1 875.05	5 696.81	864.05	-7.24	100.48	2 924.72

基于重要性采样的优势估计器

Advantage estimator based on importance sampling

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 23

相关文章 15

Metrics

推荐阅读 0

[1]	马玲, 樊漆亮, 许婷, 郭冠琛, 张圣林, 孙永谦, 张玉志. 基于强化学习的在线离线混部云环境下的调度框架[J]. 通信学报, 2023, 44(6): 90-102.
[2]	金彪, 李逸康, 姚志强, 陈瑜霖, 熊金波. GenFedRL：面向深度强化学习智能体的通用联邦强化学习框架[J]. 通信学报, 2023, 44(6): 183-197.
[3]	李元诚, 秦永泰. 基于深度强化学习的软件定义安全中台QoS实时优化算法[J]. 通信学报, 2023, 44(5): 181-192.
[4]	周大成, 陈鸿昶, 何威振, 程国振, 扈红超. 基于深度强化学习的微服务多维动态防御策略研究[J]. 通信学报, 2023, 44(4): 50-63.
[5]	许国良, 谭峰, 冉泳屹, 陈丰. 面向多波束卫星系统的波束跳变与覆盖控制联合优化算法[J]. 通信学报, 2023, 44(4): 78-86.
[6]	许文俊, 吴思雷, 王凤玉, 林兰, 李国军, 张治. 基于多智能体强化学习的大规模灾后用户分布式覆盖优化[J]. 通信学报, 2022, 43(8): 1-16.
[7]	沙宗轩, 霍如, 孙闯, 汪硕, 黄韬. 基于深度强化学习的转发效能感知流量调度算法[J]. 通信学报, 2022, 43(8): 30-40.
[8]	马帅, 李兵, 盛海鸿, 谷荣妍, 周辉, 王洪梅, 王悦, 李世银. 基于深度强化学习的可见光定位通信一体化功率分配研究[J]. 通信学报, 2022, 43(8): 121-130.
[9]	张宇, 程旻. NDN中边缘计算与缓存的联合优化[J]. 通信学报, 2022, 43(8): 164-175.
[10]	左珮良, 侯少龙, 郭超, 蒋华, 王文博. 基于强化学习的多层卫星网络边缘安全决策方法[J]. 通信学报, 2022, 43(6): 189-199.
[11]	张先超, 赵耀, 叶海军, 樊锐. 无线网络多用户干扰下智能发射功率控制算法[J]. 通信学报, 2022, 43(2): 15-21.
[12]	李传煌, 陈泱婷, 唐晶晶, 楼佳丽, 谢仁华, 方春涛, 王伟明, 陈超. QL-STCT：一种SDN链路故障智能路由收敛方法[J]. 通信学报, 2022, 43(2): 131-142.
[13]	陈晋音, 胡书隆, 邢长友, 张国敏. 面向智能渗透攻击的欺骗防御方法[J]. 通信学报, 2022, 43(10): 106-120.
[14]	苏新, 孟蕾蕾, 周一青, CELIMUGE Wu. 基于深度强化学习的海洋移动边缘计算卸载方法[J]. 通信学报, 2022, 43(10): 133-145.
[15]	杜丽娜, 卓力, 杨硕, 李嘉锋, 张菁. 基于强化学习的移动视频流业务码率自适应算法研究进展[J]. 通信学报, 2021, 42(9): 205-217.