基于重要性采样的优势估计器

doi:10.11959/j.issn.1000-436x.2019122

Journal on Communications ›› 2019, Vol. 40 ›› Issue (5): 108-116.doi: 10.11959/j.issn.1000-436x.2019122

• Papers • Previous Articles Next Articles

Advantage estimator based on importance sampling

Quan LIU^1,^2,^3,⁴,Yubin JIANG¹,Zhihui HU¹

¹ School of Computer Science and Technology,Soochow University,Suzhou 215006,China
² Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou 215006,China
³ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
⁴ Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China

Revised:2019-04-25 Online:2019-05-25 Published:2019-05-30
Supported by:
The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61472262);The National Natural Science Foundation of China(61502323);The National Natural Science Foundation of China(61502329);Jiangsu Province Natural Science Research University Major Projects(18KJA520011);Jiangsu Province Natural Science Research University Major Projects(17KJA520004);Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04);Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172017K18);Suzhou Industrial Application of Basic Research Program Part(SYG201422)

Abstract

Abstract:

In continuous action tasks,deep reinforcement learning usually uses Gaussian distribution as a policy function.Aiming at the problem that the Gaussian distribution policy function slows down due to the clipped action,an importance sampling advantage estimator was proposed.Based on the general advantage estimator,an importance sampling mechanism was introduced by the estimator to improve the convergence speed of the algorithm and correct the deviation of the value function caused by calculating the target strategy and action strategy ratio of the boundary action.In addition,the L parameter was introduced by ISAE which improved the reliability of the sample and limited the stability of the network parameters by limiting the range of the importance sampling rate.In order to verify the effectiveness of the ISAE,applying it to proximal policy optimization and comparing it with other algorithms on the MuJoCo platform.Experimental results show that ISAE has a faster convergence rate.

Key words: reinforcement learning, importance sampling, deep reinforcement learning, advantage function

CLC Number:

TP391

Quan LIU,Yubin JIANG,Zhihui HU. Advantage estimator based on importance sampling[J]. Journal on Communications, 2019, 40(5): 108-116.

Figures/Tables 7

任务名	状态空间	动作空间
Ant	$ℝ^{111}$	-1.0,1.0]⁸
Hopper	$ℝ^{11}$	-1.0,1.0]³
HalfCheetah	$ℝ^{17}$	-1.0,1.0]⁶
InvertedDoublePendulum	$ℝ^{11}$	-1.0,1.0]¹
InvertedPendulum	$ℝ^{4}$	-3.0,3.0]¹
Reacher	$ℝ^{11}$	-1.0,1.0]²
Swimmer	$ℝ^{8}$	-1.0,1.0]²
Walker2d	$ℝ^{17}$	-1.0,1.0]⁶

References 23

[1]	SUTTON R S , BARTO A G . Introduction to reinforcement learning[M]. Cambridge: MIT pressPress, 1998.
[2]	刘全, 傅启明, 龚声蓉 . 最小状态变元平均奖赏的强化学习方法[J]. 通信学报, 2011,32(1): 66-71.
	LIU Q , FU Q M , GONG S R . Reinforcement learning algorithm based on minimum state method and average reward[J]. Journal on Communications, 2011,32(1): 66-71.
[3]	TANG J , DENG C , HUANG G B . Extreme learning machine for multilayer perceptron[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016,27(4): 809-821.
[4]	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . Imagenet classification with deep convolutional neural networks[C]// Advances in Neural Information Processing Systems. 2012: 1097-1105.
[5]	VEERIAH V , VAN S H , SUTTON R S . Forward actor-Critic for nonlinear function approximation in reinforcement learning[C]// Conference on Autonomous Agents and Multiagent Systems. 2017: 556-564.
[6]	LECUN Y , BENGIO Y , HINTON G . Deep learning[J]. Nature, 2015,521(7553): 436-444.
[7]	MNIH V , KAVUKCUOFLU K , SILVER D ,et al. Human-level control through deep reinforcement learning[J]. Nature, 2015,518(7540): 529-533.
[8]	MNIH V , BADIA A P , MIRZA M ,et al. Asynchronous methods for deep reinforcement learning[C]// International Conference on Machine Learning. 2016: 1928-1937.
[9]	VAN H , GUEZ A , SILVER D . Deep reinforcement learning with double Q-learning[C]// Thirtieth AAAI Conference on Artificial Intelligence. 2016: 2094-2100.
[10]	WANG Z , SCHAUL T , HESSEL M ,et al. Dueling network architectures for deep reinforcement learning[C]// International Conference on Machine Learning. 2016: 1995-2003.
[11]	SAMEJIMA K , DOYA K , KAWATO M . Inter-module credit assignment in modular reinforcement learning[J]. Neural Networks, 2003,16(7): 985-994.
[12]	SINGH S P , SUTTON R S . Reinforcement learning with replacing eligibility traces[J]. Machine Learning, 1996,22(1-3): 123-158.
[13]	WATKINS C J C H . Learning from delayed rewards[D]. Cambridge:King’s College, 1989.
[14]	SUTTON R S . Temporal credit assignment in reinforcement learning[D]. Amherst:University of Massachusetts, 1984.
[15]	VAN S H , MAHMOOD A R , PILARSKI P M ,et al. True online temporal-difference learning[J]. The Journal of Machine Learning Research, 2016,17(1): 5057-5096.
[16]	HO J , ERMON S . Generative adversarial imitation learning[C]// Advances in Neural Information Processing Systems. 2016: 4565-4573.
[17]	MNIH V , BADIA A P , MIRZA M ,et al. Asynchronous methods for deep reinforcement learning[C]// International Conference on Machine Learning. 2016: 1928-1937.
[18]	SCHULMAN J , LEVINE S , ABBEEL P ,et al. Trust region policy optimization[C]// International Conference on Machine Learning. 2015: 1889-1897.
[19]	CHUA K , CALANDRA R , MCALLISTER R ,et al. Deep reinforcement learning in a handful of trials using probabilistic dynamics models[C]// Advances in Neural Information Processing Systems. 2018.
[20]	FUJITA Y , MAEDA S . Clipped action policy gradient[C]// International Conference on Machine Learning. 2018: 1592-1601.
[21]	THODOROFF P , DURAND A , PINEAU J ,et al. Temporal regularization for Markov decision process[C]// Advances in Neural Information Processing Systems. 2018: 1779-1789.
[22]	DOYA K . Reinforcement learning in continuous time and space[J]. Neural Computation, 2000,12(1): 219-245.
[23]	HESSEL M , MODAYIL J , VAN H H ,et al. Rainbow:combining improvements in deep reinforcement learning[C]// Thirty-Second AAAI Conference on Artificial Intelligence. 2018: 3215-3222.

Metrics

Recommended 0

No Suggested Reading articles found!

算法	Ant	Hopper	HalfCheetah	InvertedDouble Pendulum	Inverted Pendulum	Reacher	Swimmer	Walker2d
PPO-GAE	240.90	1 725.07	1 535.71	4 807.24	646.94	-6.98	104.24	2 345.71
PPO-CAPG	286.56	1 962.92	1 569.24	5 185.03	687.78	-7.29	103.80	2 693.94
PPO-ISAE(L=0.90)	387.74	2 288.21	1 777.03	6 040.82	857.58	-8.15	96.48	3 105.64
PPO-ISAE(L=0.95)	405.51	2 104.51	1 875.05	5 696.81	864.05	-7.24	100.48	2 924.72

Advantage estimator based on importance sampling

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 23

Related Articles 15

Metrics

Recommended 0

[1]	Ling MA, Qiliang FAN, Ting XU, Guanchen GUO, Shenglin ZHANG, Yongqian SUN, Yuzhi ZHANG. Scheduling framework based on reinforcement learning in online-offline colocated cloud environment [J]. Journal on Communications, 2023, 44(6): 90-102.
[2]	Biao JIN, Yikang LI, Zhiqiang YAO, Yulin CHEN, Jinbo XIONG. GenFedRL: a general federated reinforcement learning framework for deep reinforcement learning agents [J]. Journal on Communications, 2023, 44(6): 183-197.
[3]	Yuancheng LI, Yongtai QIN. Deep reinforcement learning based algorithm for real-time QoS optimization of software-defined security middle platform [J]. Journal on Communications, 2023, 44(5): 181-192.
[4]	Dacheng ZHOU, Hongchang CHEN, Weizhen HE, Guozhen CHENG, Hongchao HU. Research on multidimensional dynamic defense strategy for microservice based on deep reinforcement learning [J]. Journal on Communications, 2023, 44(4): 50-63.
[5]	Guoliang XU, Feng TAN, Yongyi RAN, Feng CHEN. Joint beam hopping and coverage control optimization algorithm for multibeam satellite system [J]. Journal on Communications, 2023, 44(4): 78-86.
[6]	Wenjun XU, Silei WU, Fengyu WANG, Lan LIN, Guojun LI, Zhi ZHANG. Large-scale post-disaster user distributed coverage optimization based on multi-agent reinforcement learning [J]. Journal on Communications, 2022, 43(8): 1-16.
[7]	Zongxuan SHA, Ru HUO, Chuang SUN, Shuo WANG, Tao HUANG. Forwarding efficiency aware traffic scheduling algorithm based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 30-40.
[8]	Shuai MA, Bing LI, Haihong SHENG, Rongyan GU, Hui ZHOU, Hongmei WANG, Yue WANG, Shiyin LI. Research on power allocation of integrated VLPC based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 121-130.
[9]	Yu ZHANG, Min CHENG. Joint optimization of edge computing and caching in NDN [J]. Journal on Communications, 2022, 43(8): 164-175.
[10]	Peiliang ZUO, Shaolong HOU, Chao GUO, Hua JIANG, Wenbo WANG. Security decision method for the edge of multi-layer satellite network based on reinforcement learning [J]. Journal on Communications, 2022, 43(6): 189-199.
[11]	Xianchao ZHANG, Yao ZHAO, Haijun YE, Rui FAN. Intelligent transmit power control algorithm for the multi-user interference of wireless network [J]. Journal on Communications, 2022, 43(2): 15-21.
[12]	Chuanhuang LI, Yangting CHEN, Jingjing TANG, Jiali LOU, Renhua XIE, Chuntao FANG, Weiming WANG, Chao CHEN. QL-STCT: an intelligent routing convergence method for SDN link failure [J]. Journal on Communications, 2022, 43(2): 131-142.
[13]	Jinyin CHEN, Shulong HU, Changyou XING, Guomin ZHANG. Deception defense method against intelligent penetration attack [J]. Journal on Communications, 2022, 43(10): 106-120.
[14]	Xin SU, Leilei MENG, Yiqing ZHOU, Wu CELIMUGE. Maritime mobile edge computing offloading method based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(10): 133-145.
[15]	Li’na DU, Li ZHUO, Shuo YANG, Jiafeng LI, Jing ZHANG. Survey on reinforcement learning based adaptive bit rate algorithm for mobile video streaming services [J]. Journal on Communications, 2021, 42(9): 205-217.