增量式双自然策略梯度的行动者评论家算法

doi:10.11959/j.issn.1000-436x.2017089

Journal on Communications ›› 2017, Vol. 38 ›› Issue (4): 166-177.doi: 10.11959/j.issn.1000-436x.2017089

• Correspondences • Previous Articles Next Articles

Actor-critic algorithm with incremental dual natural policy gradient

Peng ZHANG¹,Quan LIU^1,^2,³,Shan ZHONG¹,Jian-wei ZHAI¹,Wei-sheng QIAN¹

¹ School of Computer Science and Technology,Soochow University,Suzhou 215006,China
² Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
³ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China

Revised:2017-03-03 Online:2017-04-01 Published:2017-07-20
Supported by:
The National Natural Science Foundation of China(61272005);The National Natural Science Foundation of China(61303108);The National Natural Science Foundation of China(61373094);The National Natural Science Foundation of China(61472262);The National Natural Science Foundation of China(61502323);The National Natural Science Foundation of China(61502329);The Natural Science Foundation of Jiangsu Province(BK2012616);The High School Natural Foundation of Jiangsu Province(13KJB520020);The Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04);Suzhou Industrial Application of Basic Research Program(SYG201422);Suzhou Industrial Application of Basic Research Program(SYG201308)

Abstract

Abstract:

The existed algorithms for continuous action space failed to consider the way of selecting optimal action and utilizing the knowledge of the action space,so an efficient actor-critic algorithm was proposed by improving the natural gradient.The objective of the proposed algorithm was to maximize the expected return.Upper and the lower bounds of the action range were weighted to obtain the optimal action.The two bounds were approximated by linear function.Afterward,the problem of obtaining the optimal action was transferred to the learning of double policy parameter vectors.To speed the learning,the incremental Fisher information matrix and the eligibilities of both bounds were designed.At three reinforcement learning problems,compared with other representative methods with continuous action space,the simulation results show that the proposed algorithm has the advantages of rapid convergence rate and high convergence stability.

Key words: reinforcement learning, natural gradient, actor-critic, continuous space

CLC Number:

TP181

Peng ZHANG,Quan LIU,Shan ZHONG,Jian-wei ZHAI,Wei-sheng QIAN. Actor-critic algorithm with incremental dual natural policy gradient[J]. Journal on Communications, 2017, 38(4): 166-177.

Figures/Tables 11

References 21

[1]	SUTTON R S , BARTO A G . Reinforcement learning:an introduction[M]. Cambridge Massachusetts: MIT pressPress, 1998.
[2]	BUSONIU L , BABUSKA R , SCHUTTER B D ,et al. Reinforcement learning and dynamic programming using function approximators[M]. Florida: CRC PressPress, 2010.
[3]	LEE D , SEO H , JUNG M W . Neural basis of reinforcement learning and decision making[J]. Annual Review of Neuroscience, 2012,35(5): 287-308.
[4]	WIERING M , VAN O M . Reinforcement learning:STATE-OFTHE-Art[M]. Berlin Heidelberg:Springer. 2014.
[5]	SUTTON R S , MCALLESTER D A , SINGH S P ,et al. Policy gradient methods for reinforcement learning with function approximation[C]// NIPS. 1999,99: 1057-1063.
[6]	PETERS J , SCHAAL S . NATURAL A C[J]. Neurocomputing, 2008,71(7-9): 1180-1190.
[7]	PETERS J , VIJAYAKUMAR S , SCHAAL S . Reinforcement learning for humanoid robotics[J]. Autonomous Robot, 2003,12(1): 1-20.
[8]	VAN H H . Reinforcement learning in continuous state and action spaces[M]. Reinforcement Learning. Springer Berlin Heidelberg, 2012: 207-251.
[9]	WIERSTRA D , SCHAUL T , PETERS J ,et al. Natural evolution strategies[C]// 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). 2008: 3381-3387.
[10]	SUN Y , WIERSTRA D , SCHAUL T ,et al. Efficient natural evolution strategies[C]// The 11th Annual Conference on Genetic and Evolutionary Computation. 2009: 539-546.
[11]	RUBINSTEIN R Y , KROESE D P . The cross-entropy method[J]. Information Science ＆ Statistics, 2008,50(1): 92-92.
[12]	BOTEV Z I , KROESE D P , RUBINSTEIN R Y ,et al. The cross-entropy method for optimization[J]. Machine Learning:Theory and Applications,Chennai:Elsevier BV, 2013,31: 35-59.
[13]	MARTIN H J A , DE LOPE J . x＜α＞:an effective algorithm for continuous actions reinforcement learning problems[C]// The IEEE Industrial Electronics Society. 2009: 2063-2068.
[14]	LILLICRAP T P , HUNT J J , PRITZEL A ,et al. Continuous control with deep reinforcement learning[J]. Computer Science, 2015,8(6): A187.
[15]	GU S , LILLICRAP T , SUTSKEVER I ,et al. Continuous deep Q-learning with model-based acceleration[J]. arXiv Preprint arXiv:1603.00748, 2016.
[16]	KHAMASSI M , TZAFESTAS C . Active exploration in parameterized reinforcement learning[J]. arXiv preprint arXiv:1610, 2016.
[17]	BHATNAGAR S , GHAVAMZADEH M , LEE M ,et al. Incremental natural actor-critic algorithms[C]// Advances in neural information processing systems. 2007: 105-112.
[18]	VIJAY R , KONDA , JOHN N. . Tsitsiklis.actor-critic algorithms[J]. Siam Journal on Control ＆ Optimization, 2001,42(4): 1008-1014.
[19]	BERENJI H R , KHEDKAR P . Learning and tuning fuzzy logic controllers through reinforcements[J]. IEEE Transactions on Neural Networks, 1992,3(5): 724-740.
[20]	SINGH S P , SUTTON R S . Reinforcement learning with replacing eligibility traces[J]. Machine Learning, 1996,22(1-3): 123-158.
[21]	SUTTON R S . Generalization in reinforcement learning:successful examples using sparse coarse coding[J]. Neural Information Processing Systems, 1996: 1038-1044.

Metrics

Recommended 0

No Suggested Reading articles found!

算法名称	首次成功	成功率	平均步数
IDNPG-AC	245.4	85.33%	2 601.5
CACLA	809.8	12.38%	444.5
INAC	643.0	63.68%	1 939.5
INAC-S	510.2	70.10%	2 129.7

小车水平位置	小车速度为-0.07时的加速度	小车速度为-0.042时的加速度	小车速度为-0.014时的加速度	小车速度为0.014时的加速度	小车速度为0.042时的加速度
-1.20	1	1	1	1	1
-1.03	-1	1	1	1	1
-0.86	1	-1	-1	1	1
-0.69	1	-1	1	1	1
-0.52	1	-1	-1	1	1
-0.35	1	-1	-1	1	1
-0.18	1	-1	-1	1	1
-0.01	1	-1	-1	1	1
0.16	1	-1	-1	-1	1
0.33	1	1	1	1	0.56

Actor-critic algorithm with incremental dual natural policy gradient

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 21

Related Articles 15

Metrics

Recommended 0

[1]	Ling MA, Qiliang FAN, Ting XU, Guanchen GUO, Shenglin ZHANG, Yongqian SUN, Yuzhi ZHANG. Scheduling framework based on reinforcement learning in online-offline colocated cloud environment [J]. Journal on Communications, 2023, 44(6): 90-102.
[2]	Biao JIN, Yikang LI, Zhiqiang YAO, Yulin CHEN, Jinbo XIONG. GenFedRL: a general federated reinforcement learning framework for deep reinforcement learning agents [J]. Journal on Communications, 2023, 44(6): 183-197.
[3]	Yuancheng LI, Yongtai QIN. Deep reinforcement learning based algorithm for real-time QoS optimization of software-defined security middle platform [J]. Journal on Communications, 2023, 44(5): 181-192.
[4]	Dacheng ZHOU, Hongchang CHEN, Weizhen HE, Guozhen CHENG, Hongchao HU. Research on multidimensional dynamic defense strategy for microservice based on deep reinforcement learning [J]. Journal on Communications, 2023, 44(4): 50-63.
[5]	Guoliang XU, Feng TAN, Yongyi RAN, Feng CHEN. Joint beam hopping and coverage control optimization algorithm for multibeam satellite system [J]. Journal on Communications, 2023, 44(4): 78-86.
[6]	Wenjun XU, Silei WU, Fengyu WANG, Lan LIN, Guojun LI, Zhi ZHANG. Large-scale post-disaster user distributed coverage optimization based on multi-agent reinforcement learning [J]. Journal on Communications, 2022, 43(8): 1-16.
[7]	Zongxuan SHA, Ru HUO, Chuang SUN, Shuo WANG, Tao HUANG. Forwarding efficiency aware traffic scheduling algorithm based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 30-40.
[8]	Shuai MA, Bing LI, Haihong SHENG, Rongyan GU, Hui ZHOU, Hongmei WANG, Yue WANG, Shiyin LI. Research on power allocation of integrated VLPC based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 121-130.
[9]	Yu ZHANG, Min CHENG. Joint optimization of edge computing and caching in NDN [J]. Journal on Communications, 2022, 43(8): 164-175.
[10]	Peiliang ZUO, Shaolong HOU, Chao GUO, Hua JIANG, Wenbo WANG. Security decision method for the edge of multi-layer satellite network based on reinforcement learning [J]. Journal on Communications, 2022, 43(6): 189-199.
[11]	Xianchao ZHANG, Yao ZHAO, Haijun YE, Rui FAN. Intelligent transmit power control algorithm for the multi-user interference of wireless network [J]. Journal on Communications, 2022, 43(2): 15-21.
[12]	Chuanhuang LI, Yangting CHEN, Jingjing TANG, Jiali LOU, Renhua XIE, Chuntao FANG, Weiming WANG, Chao CHEN. QL-STCT: an intelligent routing convergence method for SDN link failure [J]. Journal on Communications, 2022, 43(2): 131-142.
[13]	Jinyin CHEN, Shulong HU, Changyou XING, Guomin ZHANG. Deception defense method against intelligent penetration attack [J]. Journal on Communications, 2022, 43(10): 106-120.
[14]	Xin SU, Leilei MENG, Yiqing ZHOU, Wu CELIMUGE. Maritime mobile edge computing offloading method based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(10): 133-145.
[15]	Li’na DU, Li ZHUO, Shuo YANG, Jiafeng LI, Jing ZHANG. Survey on reinforcement learning based adaptive bit rate algorithm for mobile video streaming services [J]. Journal on Communications, 2021, 42(9): 205-217.