基于值函数迁移的启发式Sarsa算法

doi:10.11959/j.issn.1000-436x.2018133

Journal on Communications ›› 2018, Vol. 39 ›› Issue (8): 37-47.doi: 10.11959/j.issn.1000-436x.2018133

• Artificial Intelligence and Network Security • Previous Articles Next Articles

Heuristic Sarsa algorithm based on value function transfer

Jianping CHEN^1,^2,³,Zhengxia YANG^1,^2,³,Quan LIU⁴,Hongjie WU^1,^2,³,Yang XU⁵,Qiming FU^1,^2,³()

¹ Institute of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou 215009,China
² Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology,Suzhou 215009,China
³ Suzhou Key Laboratory of Mobile Networking and Applied Technologies,Suzhou University of Science and Technology,Suzhou 215009,China
⁴ School of Computer Science and Technology,Soochow University,Suzhou 215000,China
⁵ Institute of Information Engineering,Zhejiang Fashion Institute of Technology College,Ningbo 315000,China

Revised:2018-07-13 Online:2018-08-01 Published:2018-09-13
Supported by:
The National Natural Science Foundation of China(61502329);The National Natural Science Foundation of China(61772357);The National Natural Science Foundation of China(61750110519);The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61672371);The National Natural Science Foundation of China(61602334);The Natural Science Foundation of Jiangsu Province(BK20140283);The Key Research and Development Program of Jiangsu Province(BE2017663);High School Natural Science Foundation of Jiangsu Province(13KJB520020);Suzhou Industrial Application of Basic Research Program Part(SYG201422)

Abstract

Abstract:

With the problem of slow convergence for traditional Sarsa algorithm,an improved heuristic Sarsa algorithm based on value function transfer was proposed.The algorithm combined traditional Sarsa algorithm and value function transfer method,and the algorithm introduced bisimulation metric and used it to measure the similarity between new tasks and historical tasks in which those two tasks had the same state space and action space and speed up the algorithm convergence.In addition,combined with heuristic exploration method,the algorithm introduced Bayesian inference and used variational inference to measure information gain.Finally,using the obtained information gain to build intrinsic reward function model as exploring factors,to speed up the convergence of the algorithm.Applying the proposed algorithm to the traditional Grid World problem,and compared with the traditional Sarsa algorithm,the Q-Learning algorithm,and the VFT-Sarsa algorithm,the IGP-Sarsa algorithm with better convergence performance,the experiment results show that the proposed algorithm has faster convergence speed and better convergence stability.

Key words: reinforcement learning, value function transfer, bisimulation metric, variational Bayes

CLC Number:

TP391

Jianping CHEN,Zhengxia YANG,Quan LIU,Hongjie WU,Yang XU,Qiming FU. Heuristic Sarsa algorithm based on value function transfer[J]. Journal on Communications, 2018, 39(8): 37-47.

Figures/Tables 10

References 19

[1]	SUTTON R S , BARTO G A . Reinforcement learning:an introduction[M]. Cambridge: MIT PressPress, 1998.
[2]	SCHMIDHUBER J , INFORMATIK T T . On learning how to learn learning strategies[R]. Germany:Technische University, 1995.
[3]	AMMAR H B , EATON E , LUNA J M ,et al. Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning[C]// The 15th International Conference on Artificial Intelligence. 2015: 3345-3351.
[4]	GUPTA A , DEVIN C , LIU Y X ,et al. Learning invariant feature spaces to transfer skills with reinforcement learning[C]// The 5th International Conference on Learning Representations. 2017: 2147-2153.
[5]	LAROCHE R , BARLIER M . Transfer reinforcement learning with shared dynamics[C]// The 31th International Conference on the Association for the Advance of Artificial Intelligence. 2017: 2147-2153.
[6]	BARRETO A , DABNEY W , MUNOS R ,et al. Successor features for transfer in reinforcement learning[C]// The 32th International Conference on Neural Information Processing Systems. 2017: 4055-4065.
[7]	DEARDEN R , NIR F , STUART R . Bayesian Q-learning[C]// The 21th International Conference on the Association for the Advance of Artificial Intelligence. 1998: 761-768.
[8]	GUEZ A , SILVER D , DAYAN P . Scalable and efficient Bayes- adaptive reinforcement learning based on Monte-Carlo tree search[J]. Journal of Artificial Intelligence Research, 2013,48(1): 841-883.
[9]	LITTLE D Y , SOMMER F T . Learning and exploration in action-perception loops[J]. Frontiers in Neural Circuits, 2013,7(7): 37-56.
[10]	MANSOUR Y , SLIVKINS A , SYRGKANIS V . Bayesian incentive-compatible bandit exploration[C]// The 16th International Conference on Economics and Computation. 2015: 565-582.
[11]	VIEN N A , LEE S G , CHUNG T C . Bayes-adaptive hierarchical MDPs[J]. Applied Intelligence, 2016,45(1): 112-126.
[12]	WU B , FENG Y . Monte-Carlo Bayesian reinforcement learning using a compact factored representation[C]// The 4th International Conference on Information Science and Control Engineering. 2017: 466-469.
[13]	傅启明, 刘全, 伏玉琛 ,等. 一种高斯过程的带参近似策略迭代算法[J]. 软件学报, 2013,24(11): 2676-2687.
	FU Q M , LIU Q , FU Y C ,et al. Parametric approximation policy strategy iteration algorithm based on Gaussian process[J]. Journal of Software, 2013,24(11): 2676-2687.
[14]	GIVAN R , DEAN T , GREIG M . Equivalence notions and model minimization in Markov decision processes[J]. Artificial Intelligence, 2003,147(1): 163-223.
[15]	FERNS N , PANANGADEN P , PRECUP D . Metrics for finite Markov decision processes[C]// The 20th International Conference on Uncertainty in Artificial Intelligence. 2004: 162-169.
[16]	BEAL M J . Variational algorithms for approximate Bayesian inference[D]. London:University of London, 2003.
[17]	傅启明, 刘全, 尤树华 ,等. 一种新的基于值函数迁移的快速Sarsa算法[J]. 电子学报, 2014,42(11): 2157-2161.
	FU Q M , LIU Q , YOU S H ,et al. A novel fast sarsa algorithm based on value function transfer[J]. Acta Electronica Sinica, 2014,42(11): 2157-2161.
[18]	MIERING M , HASSELT H V . The QV family compared to other reinforcement learning algorithms[C]// The 17th International Conference on Approximate Dynamic Programming and Reinforcement Learning. 2008: 101-108.
[19]	CHUNG J J , LAWRANCE N R J , SUKKARIEH S . Gaussian processes for informative exploration in reinforcement learning[C]// The 20th International Conference on Robotics and Automation. 2013: 2633-2639.

Metrics

Recommended 0

No Suggested Reading articles found!

问题规模				η
问题规模	0.3	0.5	0.6	0.8	2	5	8
5×6	456	36	102	236	543	956	1 026
10×10	612	156	84	156	456	1 023	1 456

Heuristic Sarsa algorithm based on value function transfer

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 19

Related Articles 15

Metrics

Recommended 0

[1]	Ling MA, Qiliang FAN, Ting XU, Guanchen GUO, Shenglin ZHANG, Yongqian SUN, Yuzhi ZHANG. Scheduling framework based on reinforcement learning in online-offline colocated cloud environment [J]. Journal on Communications, 2023, 44(6): 90-102.
[2]	Biao JIN, Yikang LI, Zhiqiang YAO, Yulin CHEN, Jinbo XIONG. GenFedRL: a general federated reinforcement learning framework for deep reinforcement learning agents [J]. Journal on Communications, 2023, 44(6): 183-197.
[3]	Yuancheng LI, Yongtai QIN. Deep reinforcement learning based algorithm for real-time QoS optimization of software-defined security middle platform [J]. Journal on Communications, 2023, 44(5): 181-192.
[4]	Dacheng ZHOU, Hongchang CHEN, Weizhen HE, Guozhen CHENG, Hongchao HU. Research on multidimensional dynamic defense strategy for microservice based on deep reinforcement learning [J]. Journal on Communications, 2023, 44(4): 50-63.
[5]	Guoliang XU, Feng TAN, Yongyi RAN, Feng CHEN. Joint beam hopping and coverage control optimization algorithm for multibeam satellite system [J]. Journal on Communications, 2023, 44(4): 78-86.
[6]	Wenjun XU, Silei WU, Fengyu WANG, Lan LIN, Guojun LI, Zhi ZHANG. Large-scale post-disaster user distributed coverage optimization based on multi-agent reinforcement learning [J]. Journal on Communications, 2022, 43(8): 1-16.
[7]	Zongxuan SHA, Ru HUO, Chuang SUN, Shuo WANG, Tao HUANG. Forwarding efficiency aware traffic scheduling algorithm based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 30-40.
[8]	Shuai MA, Bing LI, Haihong SHENG, Rongyan GU, Hui ZHOU, Hongmei WANG, Yue WANG, Shiyin LI. Research on power allocation of integrated VLPC based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 121-130.
[9]	Yu ZHANG, Min CHENG. Joint optimization of edge computing and caching in NDN [J]. Journal on Communications, 2022, 43(8): 164-175.
[10]	Peiliang ZUO, Shaolong HOU, Chao GUO, Hua JIANG, Wenbo WANG. Security decision method for the edge of multi-layer satellite network based on reinforcement learning [J]. Journal on Communications, 2022, 43(6): 189-199.
[11]	Xianchao ZHANG, Yao ZHAO, Haijun YE, Rui FAN. Intelligent transmit power control algorithm for the multi-user interference of wireless network [J]. Journal on Communications, 2022, 43(2): 15-21.
[12]	Chuanhuang LI, Yangting CHEN, Jingjing TANG, Jiali LOU, Renhua XIE, Chuntao FANG, Weiming WANG, Chao CHEN. QL-STCT: an intelligent routing convergence method for SDN link failure [J]. Journal on Communications, 2022, 43(2): 131-142.
[13]	Jinyin CHEN, Shulong HU, Changyou XING, Guomin ZHANG. Deception defense method against intelligent penetration attack [J]. Journal on Communications, 2022, 43(10): 106-120.
[14]	Xin SU, Leilei MENG, Yiqing ZHOU, Wu CELIMUGE. Maritime mobile edge computing offloading method based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(10): 133-145.
[15]	Li’na DU, Li ZHUO, Shuo YANG, Jiafeng LI, Jing ZHANG. Survey on reinforcement learning based adaptive bit rate algorithm for mobile video streaming services [J]. Journal on Communications, 2021, 42(9): 205-217.