强化学习中基于权重梯度下降的函数逼近方法

doi:10.11959/j.issn.2096-109x.2023050

网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (4): 16-28.doi: 10.11959/j.issn.2096-109x.2023050

• 学术论文 • 上一篇

强化学习中基于权重梯度下降的函数逼近方法

秦晓燕¹, 刘禹含², 徐云龙³, 李斌⁴

¹ 苏州高博软件技术职业学院信息与软件学院，江苏苏州 215163
² 滑铁卢大学，安大略滑铁卢 N2L3G4
³ 苏州大学应用技术学院，江苏苏州 215325
⁴ 苏州大学计算机科学与技术学院，江苏苏州 215325

修回日期:2023-05-30 出版日期:2023-08-01 发布日期:2023-08-01
作者简介:秦晓燕（1984- ），女，江苏泰州人，苏州高博软件技术职业学院副教授，主要研究方向为软件工程、人工智能
刘禹含（2000- ），女，黑龙江大庆人，加拿大滑铁卢大学硕士生，主要研究方向为数字媒体技术
徐云龙（1964- ），男，江苏苏州人，苏州大学副教授，主要研究方向为机器学习、操作系统
李斌（1994- ），男，江苏镇江人，主要研究方向为强化学习
基金资助:
国家自然科学基金(61772355);国家自然科学基金(61702055);国家自然科学基金(61876217);国家自然科学基金(62176175);江苏省高等学校自然科学研究重大项目(18KJA520011);江苏省高等学校自然科学研究重大项目(17KJA520004);苏州市应用基础研究计划工业部分(SYG201422);江苏省高职院校教师专业带头人高端研修项目(2021GRFX052);江苏高校优势学科建设工程资助项目;江苏省职业教育软件技术“双师型”名师工作室资助项目

Function approximation method based on weights gradient descent in reinforcement learning

Xiaoyan QIN¹, Yuhan LIU², Yunlong XU³, Bin LI⁴

¹ School of Information and Software, Global Institute of Software Technology, Suzhou 215163, China
² University of Waterloo, Waterloo, N2L3G4, Canada
³ Applied Technology College, Soochow University, Suzhou 215325, China
⁴ School of Computer Science and Technology, Soochow University, Suzhou 215325, China

Revised:2023-05-30 Online:2023-08-01 Published:2023-08-01
Supported by:
The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61876217);The National Natural Science Foundation of China(62176175);Jiangsu Province Natural Science Research University Major Projects(18KJA520011);Jiangsu Province Natural Science Research University Major Projects(17KJA520004);Suzhou Industrial Application of Basic Research Program Part(SYG201422);Jiangsu Province High End Research and Training Project for Professional Leaders of Teachers in Vocational Colleges(2021GRFX052);Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions;Project Funded byJiangsu Province Vocational Education the Double-Qualified of Teaching Stu-dio in Software Technology

摘要/Abstract

摘要：

函数逼近法（function approximation）是强化学习领域中的一个研究热点，可以有效处理强化学习中大规模、连续状态和动作空间的问题。基于梯度下降（gradient descent）的函数逼近方法虽然是强化学习中使用最广泛的方法之一，但该算法对步长参数的要求较高，取值不当易产生收敛速度慢、收敛不稳定甚至发散的情况。针对这类问题，通过围绕基于函数逼近的TD（TD，temporal difference）算法，在最小二乘方法和梯度下降方法的基础上对权重的更新方法进行了改进，利用最小二乘方法处理值函数求解权重值，并结合时序差分和梯度下降的思想求出权重之间的误差，并利用该误差直接更新权重，从而提出一种权重梯度下降（WGD，weight gradient descent）方法。该方法以全新的方式更新权重，有效降低算法对计算资源的消耗，并且可以有效地对其他基于梯度下降的函数逼近算法进行改进，广泛应用于诸多基于梯度下降的强化学习算法。实验表明，WGD 方法能够在更广泛的空间中调整参数，可以有效降低算法发散的可能性，在保证算法拥有良好收敛效果的同时，提高算法的收敛速度。

关键词: 函数逼近, 强化学习, 梯度下降, 最小二乘, 权重梯度下降

Abstract:

Function approximation has gained significant attention in reinforcement learning research as it effectively addresses problems with large-scale, continuous state, and action space.Although the function approximation algorithm based on gradient descent method is one of the most widely used methods in reinforcement learning, it requires careful tuning of the step size parameter as an inappropriate value can lead to slow convergence, unstable convergence, or even divergence.To address these issues, an improvement was made around the temporal-difference (TD) algorithm based on function approximation.The weight update method was enhanced using both the least squares method and gradient descent, resulting in the proposed weights gradient descent (WGD) method.The least squares were used to calculate the weights, combining the ideas of TD and gradient descent to find the error between the weights.And this error was used to directly update the weights.By this method, the weights were updated in a new manner, effectively reducing the consumption of computing resources by the algorithm enhancing other gradient descent-based function approximation algorithms.The WGD method is widely applicable in various gradient descent-based reinforcement learning algorithms.The results show that WGD method can adjust parameters within a wider space, effectively reducing the possibility of algorithm divergence.Additionally, it achieves better performance while improving the convergence speed of the algorithm.

Key words: function approximation, reinforcement learning, gradient descent, least-squares, weights gradient descent

中图分类号:

TP18

秦晓燕, 刘禹含, 徐云龙, 李斌. 强化学习中基于权重梯度下降的函数逼近方法[J]. 网络与信息安全学报, 2023, 9(4): 16-28.

Xiaoyan QIN, Yuhan LIU, Yunlong XU, Bin LI. Function approximation method based on weights gradient descent in reinforcement learning[J]. Chinese Journal of Network and Information Security, 2023, 9(4): 16-28.

图/表 16

图1

图2

图3

图4

图5

图6

图7

图8

图9

图10

图11

图12

图13

图14

表1

Mountain Car实验中不同步长参数取值下各算法的平均收敛效果数据统计Table 1 Statistics on the average convergence performance of each algorithm under different step size parameter values in the Mountain Car experiment"

算法					(α\α^w)
算法	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
Sarsa	133.9	140.7	X	X	X	X	X	X	X
WGD-Sarsa	165.3	151.6	115.6	118.7	112.3	111.7	116.8	109.4	120.7
Sarsa(λ),λ=0.1	136.2	145.5	X	X	X	X	X	X	X
Sarsa(λ),λ=0.3	124.0	X	X	X	X	X	X	X	X
WGD-Sarsa1(λ),λ=0.1	160.7	138.4	107.1	115.9	114.9	115.1	135.2	110.4	128.9
WGD-Sarsa1(λ),λ=0.6	163.2	125.0	110.6	112.0	125.2	113.0	148.2	X	X
WGD-Sarsa2(λ),λ=0.1	173.3	158.2	112.7	133.1	124.9	110.9	115.6	115.9	122.1
WGD-Sarsa2(λ),λ=0.6	137.2	139.7	119.8	126.4	114.4	107.9	128.7	106.7	X
AC, $α^{θ} = 0.01$	174.0	X	X	X	X	X	X	X	X
WGD-AC, $α^{θ} = 0.01$	170.7	135.5	133.1	128.0	125.1	127.8	129.9	130.0	130.4

表1

表2

Cartpole实验中不同步长参数取值下各算法的平均收敛效果数据统计Table 2 Statistics on the average convergence performance of various algorithms under asynchronous long parameter values in the Cartpole experiment"

算法					(α\α^w)
算法	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
Sarsa	793.5	124.1	X	X	X	X	X	X	X
WGD-Sarsa	372.7	585.7	744.5	681.0	747.2	804.4	743.6	802.0	686.6
Sarsa(λ),λ=0.1	810.6	156.0	X	X	X	X	X	X	X
Sarsa(λ),λ=0.9	57.51	24.59	X	X	X	X	X	X	X
WGD-Sarsa1(λ),λ=0.1	411.2	600.0	730.6	734.3	790.6	722.5	812.7	844.9	590.4
WGD-Sarsa1(λ),λ=0.9	776.9	796.2	773.6	629.1	88.96	13.56	10.60	12.38	X
WGD-Sarsa2(λ),λ=0.1	418.2	614.9	777.3	654.7	744.5	893.9	717.2	769.7	773.1
WGD-Sarsa2(λ),λ=0.9	888.7	978.8	992.4	989.8	986.6	940.5	368.9	32.58	25.06
AC, $α^{θ} = 0.1$	739.2	23.82	21.78	X	X	X	X	X	X
AC, $α^{θ} = 0.9$	744.6	21.60	21.98	X	X	X	X	X	X
WGD-AC, $α^{θ} = 0.1$	241.2	621.2	673.9	582.4	709.9	752.2	777.8	747.3	711.7
WGD-AC, $α^{θ} = 0.9$	220.4	618.4	680.3	562.3	709.8	766.4	826.7	751.2	704.1

表2

参考文献 19

[1]	SUTTON R S , BARTO A G . Reinforcement learning:an introduction[J]. IEEE Transactions on Neural Networks, 1998,9(5): 1054-1054.
[2]	陈兴国, 俞扬 . 强化学习及其在电脑围棋中的应用[J]. 自动化学报, 2016,42(5): 685-695.
	CHEN X G , YU Y . Reinforcement learning and its application to the game of GO[J]. Acta Automatica Sinica, 2016,42(5): 685-695.
[3]	HU Y J , GAO Y , AN B . Multiagent reinforcement learning with unshared value functions[J]. IEEE Transactions on Cybernetics, 2015,45(4): 647-662.
[4]	POLYDOROS A S , NALPANTIDIS L . Survey of model-based reinforcement learning:applications on robotics[J]. Journal of Intelligent ＆ Robotic Systems, 2017,86(2): 1-21.
[5]	SSENGONZI C , KOGEDA O P , OLWAL T O . A survey of deep reinforcement learning application in 5G and beyond network slicing and virtualization[R]. 2022.
[6]	WU Y , MOZIFIAN M , SHKURTI F . Shaping rewards for reinforcement learning with imperfect demonstrations using generative models[C]// 2021 IEEE International Conference on Robotics and Automation (ICRA). 2021: 6628-6634.
[7]	XU XIN , ZUO LEI , HUANG Z H . Reinforcement learning algorithms with function approximation:Recent advances and applications[J]. Information Sciences, 2014,261(5): 1-31.
[8]	GRONDMAN I , BUSONIU L , LOPES G A D ,et al. A survey of actor-critic reinforcement learning:standard and natural policy gradients[J]. IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews), 2012,42(6): 1291-1307.
[9]	VAN-SEIJEN H , MAHMOOD A R , PILARSKI P M ,et al. True online temporal-difference learning[J]. Journal of Machine Learning Research, 2015,17(1): 5057-5096.
[10]	LI K , BURDICK J W . A function approximation method for model-based high-dimensional inverse reinforcement learning[R]. 2017.
[11]	FUJIMOTO S , GU S S . A minimalist approach to offline reinforcement learning[C]// Advances in Neural Information Processing Systems, 2021,34: 20132-20145.
[12]	THOMAS P S , BRUNSKILL E . Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines[R]. 2017.
[13]	MNIH V , KAVUKCUOGLU K , SILVER D ,et al. Human-level control through deep reinforcement learning[J]. Nature, 2015,518(7540): 529-533.
[14]	郭潇逍, 李程, 梅俏竹 . 深度学习在游戏中的应用[J]. 自动化学报, 2016,42(5): 676-684.
	GUO X X , LI C , MEI Q Z . Deep learning applied to games[J]. Acta Automatica Sinica, 2016,42(5): 676-684.
[15]	GEIST M , PIETQUIN O . Parametric value function approximation:a unified view[C]// Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2011.
[16]	FAIRBANK M , ALONSO E . The divergence of reinforcement learning algorithms with value-iteration and function approximation[C]// International Joint Conference on Neural Networks. 2012.
[17]	BHANDARI J , RUSSO D , SINGAL R . A finite time analysis of temporal difference learning with linear function approximation[R]. 2018.
[18]	AWHEDA M D , SCHWARTZ H M . A residual gradient fuzzy reinforcement learning algorithm for differential games[J]. International Journal of Fuzzy Systems, 2017.
[19]	BO L , JI L , GHAVAMZADEH M ,et al. Proximal gradient temporal difference learning algorithms[C]// International Joint Conference on Artificial Intelligence. 2016.

强化学习中基于权重梯度下降的函数逼近方法

Function approximation method based on weights gradient descent in reinforcement learning

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 19

相关文章 5

Metrics

推荐阅读 0

[1]	肖天, 江智昊, 唐鹏, 黄征, 郭捷, 邱卫东. 基于深度强化学习的高性能导向性模糊测试方案[J]. 网络与信息安全学报, 2023, 9(2): 132-142.
[2]	张明英, 华冰, 张宇光, 李海东, 郑墨泓. 基于鸽群的鲁棒强化学习算法[J]. 网络与信息安全学报, 2022, 8(5): 66-74.
[3]	时文旗, 罗向阳, 郭家山. 基于加权最小二乘的社交网络用户定位方法[J]. 网络与信息安全学报, 2022, 8(3): 41-52.
[4]	徐堂炜,张海璐,刘楚环,肖亮,朱珍民. 基于强化学习的低时延车联网群密钥分配管理技术[J]. 网络与信息安全学报, 2020, 6(5): 119-125.
[5]	刘文星,陈伟,刘渊. 图像四叉树剖分下的自适应数字水印算法[J]. 网络与信息安全学报, 2017, 3(12): 54-61.