基于策略约束强化学习的算网多目标优化研究

doi:10.11959/j.issn.1000-0801.2023165

摘要/Abstract

摘要：

算力网络需要在满足用户业务需求的基础上最大化系统性能指标，现有方法主要通过多目标加权进行转换和求解，存在超参数难以确定、跨场景适用性差等问题。在分析算网目标特性的基础上，基于策略约束强化学习，将业务需求作为约束、系统性能指标作为优化目标，通过价值—策略—超参数的多级迭代策略，实现算网对用户业务需求的期望确定性保障和对系统性能的最优化。同时，研究了针对超参数寻优的多尺度步长（multi-scale step length，MSL）方法，进一步提升了系统的稳定性和准确性。仿真结果表明，所提方法在系统架构和负载变化情况下均具有良好的收敛性和稳定性。

关键词: 算力网络, 多目标优化, 强化学习

Abstract:

The computing power network needs to maximize the system performance index on the basis of meeting user business needs, and the existing methods are mainly based on the multi-objective weighting method, which has problems such as difficult to determine hyperparameters and poor cross-scenario applicability.Based on this, based on the analysis of the characteristics of the computing power network target, the user business requirements were taken as the policy constraints, and the performance indicators of the computing power network was taken as the optimization goal based on constrained policy optimization, and the expectation certainty of user business needs and the optimization of system performance through the value-strategy-hyper-parameter multi-level iterative strategy was realized.At the same time, the multi-scale step length (MSL) method for hyper-parameter optimization was studied, which further improved the stability and accuracy of the system.Simulation results show that the proposed method has good convergence and stability under the conditions of single terminal-single edge server, multi-terminal-multi-edge server and system load change.

Key words: computing power network, multi-objective optimization, reinforcement learning

中图分类号:

TP393

沈林江, 曹畅, 崔超, 张岩. 基于策略约束强化学习的算网多目标优化研究[J]. 电信科学, 2023, 39(8): 136-148.

Linjiang SHEN, Chang CAO, Chao CUI, Yan ZHANG. Research on constrained policy reinforcement learning based multi-objective optimization of computing power network[J]. Telecommunications Science, 2023, 39(8): 136-148.

图/表 10

表1

系统参数"

参数	参数含义	参数	参数含义
R	边缘服务器数量	k_l	终端能耗系数
S	边缘服务器	ε	小尺度衰落信道功率增益
L	终端数量	H	信道增益
M	终端	g₀	路径损耗常数
A	计算任务	θ	路径损耗指数
D	任务数据量	d₀	参考距离
C	任务计算量	N₀	噪声功率密度
σ	任务处理时限	d	终端到边缘服务器距离
Q_e	边缘服务器任务队列	p	终端发射功率
N_e	边缘服务器队列长度	w	总带宽
f_n	主频等分段数量	α	终端占用的带宽比例
f_e	边缘服务器频率	State	强化学习系统状态
$T_{e}^{w}$	边缘服务器队列等待时长	Action	强化学习行为空间
$T_{e}^{c}$	边缘服务器任务计算时长	I_l	终端任务是否按时完成
$T_{e}^{p}$	终端到边缘的传输时长	I_e	边缘服务器任务是否按时完成
T_e	边缘服务器任务总处理时长	I_p	任务卸载位置标志位
E_e	边缘服务器能耗	π	智能体的策略函数
$E_{e}^{p}$	终端到边缘服务器传输能耗	$V_{R}^{π}$	策略价值函数
k_e	边缘服务器能耗系数	γ	回报折扣因子
Q_l	终端任务队列	μ	系统初始状态分布
N_l	终端队列长度	$J_{R}^{π}$	系统优化目标函数
f_l	终端频率	$J_{R}^{c}$	系统约束目标函数
$T_{l}^{w}$	终端队列等待时长	∈	超时率阈值
$T_{l}^{c}$	终端队列任务计算时长	λ	约束权重因子
T_l	终端任务总处理时长	${\hat{V}}_{R}^{π}$	折扣策略价值函数
E_l	终端能耗	η	迭代步长

表1

图1

表2

模拟仿真系统参数"

参数	取值范围
$ε_{l, e}$	Exp(1)
w	10 MHz
N₀	- 174 dBm/Hz
g₀	- 40 dB
θ	4
d₀	1m
d_{l ,e}	[90,150]m
p_{l ,e}	500 mW
k_l	10^{- 27}
k_e	10^{- 27}
D	[0,50 000]bit
C	[30 000,50 000]cycle（CPU转数）
f_e,max	2.5 GHz
f_l,max	1GHz
f_n	10

表2

图2

图3

图4

图5

图6

图7

图8

参考文献 17

[1]	TANG X Y , CAO C , WANG Y X ,et al. Computing power network:the architecture of convergence of computing and networking towards 6G requirement[J]. China Communications, 2021,18(2): 175-185.
[2]	雷波, 赵倩颖, 赵慧玲 . 边缘计算与算力网络综述[J]. 中兴通讯技术, 2021,27(3): 3-6.
	LEI B , ZHAO Q Y , ZHAO H L . Overview of edge computing and computing power network[J]. ZTE Technology Journal, 2021,27(3): 3-6.
[3]	雷波, 刘增义, 王旭亮 ,等. 基于云、网、边融合的边缘计算新方案:算力网络[J]. 电信科学, 2019,35(9): 44-51.
	LEI B , LIU Z Y , WANG X L ,et al. Computing network:a new multi-access edge computing[J]. Telecommunications Science, 2019,35(9): 44-51.
[4]	李建飞, 曹畅, 李奥 ,等. 算力网络中面向业务体验的算力建模[J]. 中兴通讯技术, 2020,26(5): 34-38,52.
	LI J F , CAO C , LI A ,et al. Computing power modeling for business experience in computing power network[J]. ZTE Technology Journal, 2020,26(5): 34-38,52.
[5]	何涛, 杨振东, 曹畅 ,等. 算力网络发展中的若干关键技术问题分析[J]. 电信科学, 2022,38(6): 62-70.
	HE T , YANG Z D , CAO C ,et al. Analysis of some key technical problems in the development of computing power network[J]. Telecommunications Science, 2022,38(6): 62-70.
[6]	KHAN W Z , AHMED E , HAKAK S ,et al. Edge computing:a survey[J]. Future Generation Computer Systems, 2019,97(C): 219-235.
[7]	MAO Y Y , ZHANG J , SONG S H ,et al. Stochastic joint radio and computational resource management for multi-user mobile-edge computing systems[J]. IEEE Transactions on Wireless Communications, 2017,16(9): 5994-6009.
[8]	MOUSAVI S S , SCHUKAT M , HOWLEY E . Deep reinforcement learning:an overview[C]// Proceedings of SAI Intelligent Systems Conference (IntelliSys). Heidelberg:Springer, 2016: 426-440.
[9]	LI Y , ZHANG X , ZENG T ,et al. Task placement and resource allocation for edge machine learning:a GNN-based multi-agent reinforcement learning paradigm[J]. arXiv preprint, 2023,arXiv:2302.00571.
[10]	ALE L H , ZHANG N , FANG X J ,et al. Delay-aware and energy-efficient computation offloading in mobile-edge computing using deep reinforcement learning[J]. IEEE Transactions on Cognitive Communications and Networking, 2021,7(3): 881-892.
[11]	LI M S , GAO J , ZHAO L ,et al. Deep reinforcement learning for collaborative edge computing in vehicular networks[J]. IEEE Transactions on Cognitive Communications and Networking, 2020,6(4): 1122-1135.
[12]	YANG A , WU M , CHENG B ,et al. Reinforcement learning in computing and network convergence orchestration[J]. arXiv preprint, 2022,arXiv:2209.10753.
[13]	JAIN T , AVANEESH , VERMA R ,et al. Latency-memory optimized splitting of convolution neural networks for resource constrained edge devices[C]// Proceedings of 2022 14th International Conference on Communication Systems ＆ Networks(COMSNETS). Piscataway:IEEE Press, 2022: 531-539.
[14]	TESSLER C , MANKOWITZ D J , MANNOR S . Reward constrained policy optimization[J]. arXiv preprint, 2018,arXiv:1805.11074.
[15]	ZHUANG S , GAO C X , HE Y ,et al. QC-DQN:a novel constrained reinforcement learning method for computation offloading in multi-access edge computing[C]// Proceedings of 2022 International Joint Conference on Neural Networks (IJCNN). Piscataway:IEEE Press, 2022: 1-8.
[16]	BHATNAGAR S , LAKSHMANAN K . An online actor-critic algorithm with function approximation for constrained Markov decision processes[J]. Journal of Optimization Theory and Applications, 2012,153(3): 688-708.
[17]	ACHIAM J , HELD D , TAMAR A ,et al. Constrained policy optimization[J]. arXiv preprint, 2017,arXiv:1705.10528.