强化学习中基于权重梯度下降的函数逼近方法

doi:10.11959/j.issn.2096-109x.2023050

Chinese Journal of Network and Information Security ›› 2023, Vol. 9 ›› Issue (4): 16-28.doi: 10.11959/j.issn.2096-109x.2023050

• Papers • Previous Articles

Function approximation method based on weights gradient descent in reinforcement learning

Xiaoyan QIN¹, Yuhan LIU², Yunlong XU³, Bin LI⁴

¹ School of Information and Software, Global Institute of Software Technology, Suzhou 215163, China
² University of Waterloo, Waterloo, N2L3G4, Canada
³ Applied Technology College, Soochow University, Suzhou 215325, China
⁴ School of Computer Science and Technology, Soochow University, Suzhou 215325, China

Revised:2023-05-30 Online:2023-08-01 Published:2023-08-01
Supported by:
The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61876217);The National Natural Science Foundation of China(62176175);Jiangsu Province Natural Science Research University Major Projects(18KJA520011);Jiangsu Province Natural Science Research University Major Projects(17KJA520004);Suzhou Industrial Application of Basic Research Program Part(SYG201422);Jiangsu Province High End Research and Training Project for Professional Leaders of Teachers in Vocational Colleges(2021GRFX052);Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions;Project Funded byJiangsu Province Vocational Education the Double-Qualified of Teaching Stu-dio in Software Technology

Abstract

Abstract:

Function approximation has gained significant attention in reinforcement learning research as it effectively addresses problems with large-scale, continuous state, and action space.Although the function approximation algorithm based on gradient descent method is one of the most widely used methods in reinforcement learning, it requires careful tuning of the step size parameter as an inappropriate value can lead to slow convergence, unstable convergence, or even divergence.To address these issues, an improvement was made around the temporal-difference (TD) algorithm based on function approximation.The weight update method was enhanced using both the least squares method and gradient descent, resulting in the proposed weights gradient descent (WGD) method.The least squares were used to calculate the weights, combining the ideas of TD and gradient descent to find the error between the weights.And this error was used to directly update the weights.By this method, the weights were updated in a new manner, effectively reducing the consumption of computing resources by the algorithm enhancing other gradient descent-based function approximation algorithms.The WGD method is widely applicable in various gradient descent-based reinforcement learning algorithms.The results show that WGD method can adjust parameters within a wider space, effectively reducing the possibility of algorithm divergence.Additionally, it achieves better performance while improving the convergence speed of the algorithm.

Key words: function approximation, reinforcement learning, gradient descent, least-squares, weights gradient descent

CLC Number:

TP18

Xiaoyan QIN, Yuhan LIU, Yunlong XU, Bin LI. Function approximation method based on weights gradient descent in reinforcement learning[J]. Chinese Journal of Network and Information Security, 2023, 9(4): 16-28.

Figures/Tables 16

算法					(α\α^w)
算法	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
Sarsa	133.9	140.7	X	X	X	X	X	X	X
WGD-Sarsa	165.3	151.6	115.6	118.7	112.3	111.7	116.8	109.4	120.7
Sarsa(λ),λ=0.1	136.2	145.5	X	X	X	X	X	X	X
Sarsa(λ),λ=0.3	124.0	X	X	X	X	X	X	X	X
WGD-Sarsa1(λ),λ=0.1	160.7	138.4	107.1	115.9	114.9	115.1	135.2	110.4	128.9
WGD-Sarsa1(λ),λ=0.6	163.2	125.0	110.6	112.0	125.2	113.0	148.2	X	X
WGD-Sarsa2(λ),λ=0.1	173.3	158.2	112.7	133.1	124.9	110.9	115.6	115.9	122.1
WGD-Sarsa2(λ),λ=0.6	137.2	139.7	119.8	126.4	114.4	107.9	128.7	106.7	X
AC, $α^{θ} = 0.01$	174.0	X	X	X	X	X	X	X	X
WGD-AC, $α^{θ} = 0.01$	170.7	135.5	133.1	128.0	125.1	127.8	129.9	130.0	130.4

算法					(α\α^w)
算法	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
Sarsa	793.5	124.1	X	X	X	X	X	X	X
WGD-Sarsa	372.7	585.7	744.5	681.0	747.2	804.4	743.6	802.0	686.6
Sarsa(λ),λ=0.1	810.6	156.0	X	X	X	X	X	X	X
Sarsa(λ),λ=0.9	57.51	24.59	X	X	X	X	X	X	X
WGD-Sarsa1(λ),λ=0.1	411.2	600.0	730.6	734.3	790.6	722.5	812.7	844.9	590.4
WGD-Sarsa1(λ),λ=0.9	776.9	796.2	773.6	629.1	88.96	13.56	10.60	12.38	X
WGD-Sarsa2(λ),λ=0.1	418.2	614.9	777.3	654.7	744.5	893.9	717.2	769.7	773.1
WGD-Sarsa2(λ),λ=0.9	888.7	978.8	992.4	989.8	986.6	940.5	368.9	32.58	25.06
AC, $α^{θ} = 0.1$	739.2	23.82	21.78	X	X	X	X	X	X
AC, $α^{θ} = 0.9$	744.6	21.60	21.98	X	X	X	X	X	X
WGD-AC, $α^{θ} = 0.1$	241.2	621.2	673.9	582.4	709.9	752.2	777.8	747.3	711.7
WGD-AC, $α^{θ} = 0.9$	220.4	618.4	680.3	562.3	709.8	766.4	826.7	751.2	704.1

References 19

[1]	SUTTON R S , BARTO A G . Reinforcement learning:an introduction[J]. IEEE Transactions on Neural Networks, 1998,9(5): 1054-1054.
[2]	陈兴国, 俞扬 . 强化学习及其在电脑围棋中的应用[J]. 自动化学报, 2016,42(5): 685-695.
	CHEN X G , YU Y . Reinforcement learning and its application to the game of GO[J]. Acta Automatica Sinica, 2016,42(5): 685-695.
[3]	HU Y J , GAO Y , AN B . Multiagent reinforcement learning with unshared value functions[J]. IEEE Transactions on Cybernetics, 2015,45(4): 647-662.
[4]	POLYDOROS A S , NALPANTIDIS L . Survey of model-based reinforcement learning:applications on robotics[J]. Journal of Intelligent ＆ Robotic Systems, 2017,86(2): 1-21.
[5]	SSENGONZI C , KOGEDA O P , OLWAL T O . A survey of deep reinforcement learning application in 5G and beyond network slicing and virtualization[R]. 2022.
[6]	WU Y , MOZIFIAN M , SHKURTI F . Shaping rewards for reinforcement learning with imperfect demonstrations using generative models[C]// 2021 IEEE International Conference on Robotics and Automation (ICRA). 2021: 6628-6634.
[7]	XU XIN , ZUO LEI , HUANG Z H . Reinforcement learning algorithms with function approximation:Recent advances and applications[J]. Information Sciences, 2014,261(5): 1-31.
[8]	GRONDMAN I , BUSONIU L , LOPES G A D ,et al. A survey of actor-critic reinforcement learning:standard and natural policy gradients[J]. IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews), 2012,42(6): 1291-1307.
[9]	VAN-SEIJEN H , MAHMOOD A R , PILARSKI P M ,et al. True online temporal-difference learning[J]. Journal of Machine Learning Research, 2015,17(1): 5057-5096.
[10]	LI K , BURDICK J W . A function approximation method for model-based high-dimensional inverse reinforcement learning[R]. 2017.
[11]	FUJIMOTO S , GU S S . A minimalist approach to offline reinforcement learning[C]// Advances in Neural Information Processing Systems, 2021,34: 20132-20145.
[12]	THOMAS P S , BRUNSKILL E . Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines[R]. 2017.
[13]	MNIH V , KAVUKCUOGLU K , SILVER D ,et al. Human-level control through deep reinforcement learning[J]. Nature, 2015,518(7540): 529-533.
[14]	郭潇逍, 李程, 梅俏竹 . 深度学习在游戏中的应用[J]. 自动化学报, 2016,42(5): 676-684.
	GUO X X , LI C , MEI Q Z . Deep learning applied to games[J]. Acta Automatica Sinica, 2016,42(5): 676-684.
[15]	GEIST M , PIETQUIN O . Parametric value function approximation:a unified view[C]// Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2011.
[16]	FAIRBANK M , ALONSO E . The divergence of reinforcement learning algorithms with value-iteration and function approximation[C]// International Joint Conference on Neural Networks. 2012.
[17]	BHANDARI J , RUSSO D , SINGAL R . A finite time analysis of temporal difference learning with linear function approximation[R]. 2018.
[18]	AWHEDA M D , SCHWARTZ H M . A residual gradient fuzzy reinforcement learning algorithm for differential games[J]. International Journal of Fuzzy Systems, 2017.
[19]	BO L , JI L , GHAVAMZADEH M ,et al. Proximal gradient temporal difference learning algorithms[C]// International Joint Conference on Artificial Intelligence. 2016.

Metrics

Recommended 0

No Suggested Reading articles found!

Function approximation method based on weights gradient descent in reinforcement learning

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 16

References 19

Related Articles 2

Metrics

Recommended 0

[1]	Tian XIAO, Zhihao JIANG, Peng TANG, Zheng HUANG, Jie GUO, Weidong QIU. High-performance directional fuzzing scheme based on deep reinforcement learning [J]. Chinese Journal of Network and Information Security, 2023, 9(2): 132-142.
[2]	Tangwei1 XU,Hailu ZHANG,Chuhuan LIU,Liang XIAO,Zhenmin ZHU. Reinforcement learning based group key agreement scheme with reduced latency for VANET [J]. Chinese Journal of Network and Information Security, 2020, 6(5): 119-125.