基于可中断Option的在线分层强化学习方法

doi:10.11959/j.issn.1000-436x.2016117

通信学报 ›› 2016, Vol. 37 ›› Issue (6): 65-74.doi: 10.11959/j.issn.1000-436x.2016117

基于可中断Option的在线分层强化学习方法

朱斐^1,²,许志鹏¹,刘全^1,²,伏玉琛¹,王辉¹

¹ 苏州大学计算机科学与技术学院，江苏苏州215006
² 吉林大学符号计算与知识工程教育部重点实验室，吉林长春130012

出版日期:2016-06-25 发布日期:2017-08-04
基金资助:
国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;江苏省高校自然科学研究基金资助项目;吉林大学符号计算与知识工程教育部重点实验室基金资助项目;苏州市应用基础研究计划基金资助项目;苏州大学高校省级重点实验室基金资助项目;中国国家留学基金资助项目

Online hierarchical reinforcement learning based on interrupting Option

Fei ZHU^1,²,Zhi-peng XU¹,Quan LIU^1,²,Yu-chen FU¹,Hui WANG¹

¹ School of Computer Science and Technology,Soochow University,Suzhou 215006,China
² Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China

Online:2016-06-25 Published:2017-08-04
Supported by:
The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The High School Natural Foundation of Jiangsu Province;The Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education of Jilin University;Suzhou Industrial Application of Basic Research Program;Provincial Key Laboratory for Computer Information Processing Technology of Soochow University;The China Scholarship Council Project

摘要/Abstract

摘要：

针对大数据体量大的问题，在Macro-Q算法的基础上提出了一种在线更新的Macro-Q算法(MQIU)，同时更新抽象动作的值函数和元动作的值函数，提高了数据样本的利用率。针对传统的马尔可夫过程模型和抽象动作均难于应对可变性，引入中断机制，提出了一种可中断抽象动作的Macro-Q无模型学习算法(IMQ)，能在动态环境下学习并改进控制策略。仿真结果验证了MQIU算法能加快算法收敛速度，进而能解决更大规模的问题，同时也验证了IMQ算法能够加快任务的求解，并保持学习性能的稳定性。

关键词: 大数据, 强化学习, 分层强化学习, Option, 在线学习

Abstract:

Aiming at dealing with volume of big data,an on-line updating algorithm,named by Macro-Q with in-place updating (MQIU),which was based on Macro-Q algorithm and takes advantage of in-place updating approach,was proposed.The MQIU algorithm updates both the value function of abstract action and the value function of primitive action,and hence speeds up the convergence rate.By introducing the interruption mechanism,a model-free interrupting Macro-Q Option learning algorithm(IMQ),which was based on hierarchical reinforcement learning,was also introduced to order to handle the variability which was hard to process by the conventional Markov decision process model and abstract action so that IMQ was able to learn and improve control strategies in a dynamic environment.Simulations verify the MQIU algorithm speeds up the convergence rate so that it is able to do with the larger scale of data,and the IMQ algorithm solves the task faster with a stable learning performance.

Key words: big data, reinforcement learning, hierarchical reinforcement learning, Option, online learning

朱斐,许志鹏,刘全,伏玉琛,王辉. 基于可中断Option的在线分层强化学习方法[J]. 通信学报, 2016, 37(6): 65-74.

Fei ZHU,Zhi-peng XU,Quan LIU,Yu-chen FU,Hui WANG. Online hierarchical reinforcement learning based on interrupting Option[J]. Journal on Communications, 2016, 37(6): 65-74.

图/表 6

图1

图2

图3

图4

表1

图5

参考文献 19

[1]	OTTERLO M V , WIERING M . Reinforcement learning and Markov decision processes[J]. Adaptation Learning ＆Optimization, 2012,206(4): 3-42.
[2]	VAN H H . Reinforcement learning:state of the art[M]. Berlin: SpringerPress, 2007.
[3]	沈晶, 顾国昌, 刘海波 . 未知动态环境中基于分层强化学习的移动机器人路径规划[J]. 机器人, 2006,28(5): 544-547. SHEN J , GU G C , LIU H B . Mobile robot path planning based on hierarchical reinforcement learning in unknown dynamic environment[J]. ROBOT, 2006,28(5): 544-547.
[4]	刘全, 闫其粹, 伏玉琛 ,等. 一种基于启发式奖赏函数的分层强化学习方法[J]. 计算机研究与发展, 2011,48(12): 2352-2358. LIU Q , YAN Q C , FU Y C ,et al. A hierarchical reinforcement learning method based on heuristic reward function[J]. Journal of Computer Research and Development, 2011,48(12): 2352-2358.
[5]	陈兴国, 高阳, 范顺国 ,等. 基于核方法的连续动作Actor-Critic学习[J]. 模式识别与人工智能, 2014(2): 103-110. CHEN X G , GAO Y , FAN S G ,et al. Kernel-based continuous-action actor-critic learning[J]. Pattern Recognition and Artificial Intelligence, 2014(2): 103-110.
[6]	朱斐, 刘全, 傅启明 ,等. 一种用于连续动作空间的最小二乘行动者-评论家方法[J]. 计算机研究与发展, 2014,51(3): 548-558. ZHU F , LIU Q , FU Q M ,et al. A least square actor-critic approach for continuous action space[J]. Journal of Computer Research and Development, 2014,51(3): 548-558.
[7]	唐昊, 张晓艳, 韩江洪 ,等. 基于连续时间半马尔可夫决策过程的Option算法[J]. 计算机学报, 2014(9): 2027-2037. TANG H , ZHANG X Y , HAN J H ,et al. Option algorithm based on continuous-time semi-Markov decision process[J]. Chinese Journal of Computers, 2014(9): 2027-2037.
[8]	SUTTON R S , PRECUP D , SINGH S . Between MDPs and semi-MDPs:a framework for temporal abstraction in reinforcement learning[J]. Artificial Intelligence, 1999,112(1): 181-211.
[9]	MCGOVERN A , BARTO A G . Automatic discovery of subgoals in reinforcement learning using diverse density[J]. Computer Science Department Faculty Publication Series, 2001(8): 361-368.
[10]	?IM?EK ? , WOLFE A P , BARTO A G . Identifying useful subgoals in reinforcement learning by local graph partitioning[C]// The 22nd International Conference on Machine Learning. ACM, 2005: 816-823.
[11]	?IM?EK ? , BARTO A G , . Using relative novelty to identify useful temporal abstractions in reinforcement learning[C]// The Twenty-first International Conference on Machine Learning. ACM, 2004: 751-758.
[12]	CHAGANTY A T , GAUR P , RAVINDRAN B . Learning in a small world[C]// The 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1.International Foundation for Autonomous Agents and Multiagent Systems. 2012: 391-397.
[13]	SUTTON R S , SINGH S , PRECUP D ,et al. Improved switching among temporally abstract actions[J]. Advances in Neural Information Processing Systems, 1999: 1066-1072.
[14]	CASTRO P S , PRECUP D . Automatic construction of temporally extended actions for mdps using bisimulation metrics[C]// European Conference on Recent Advances in Reinforcement Learning. Springer-Verlag, 2011: 140-152.
[15]	何清, 李宁, 罗文娟 ,等. 大数据下的机器学习算法综述[J]. 模式识别与人工智能, 2014,27(4): 327-336. HE Q , LI N , LUO W J ,et al. A survey of machine learning algorithms for big data[J]. Pattern Recognition and Artificial Intelligence, 2014,27(4): 327-336.
[16]	SUTTON R S , PRECUP D , SINGH S P . Intra-option learning about temporally abstract actions[C]// ICML. 1998,98: 556-564.
[17]	石川, 史忠植, 王茂光 . 基于路径匹配的在线分层强化学习方法[J]. 计算机研究与发展, 2008,45(9): 1470-1476. SHI C , SHI Z Z , WANG M G . Online hierarchical reinforcement learning based on path-matching[J]. Journal of Computer Research and Development, 2008,45(9): 1470-1476.
[18]	BOTVINICK M M . Hierarchical reinforcement learning and decision making[J]. Current Opinion in Neurobiology, 2012,22(6): 956-962.
[19]	王爱平, 万国伟, 程志全 ,等. 支持在线学习的增量式极端随机森林分类器[J]. 软件学报, 2011,22(9): 2059-2074. WANG A P , WAN G W , CHENG Z Q ,et al. Incremental learning extremely random forest classifier for online learning[J]. Journal of Software, 2011,22(9): 2059-2074.

算法	收敛情节	收敛累计步数
IMQ with good Option	250	25 000
IMQ with key Option	200	23 000
IMQ with integrated Option	250	27 000
Q-learning	500	70 000

基于可中断Option的在线分层强化学习方法

Online hierarchical reinforcement learning based on interrupting Option

在线阅读

PDF下载

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 19

相关文章 15

Metrics

推荐阅读 0

[1]	马玲, 樊漆亮, 许婷, 郭冠琛, 张圣林, 孙永谦, 张玉志. 基于强化学习的在线离线混部云环境下的调度框架[J]. 通信学报, 2023, 44(6): 90-102.
[2]	金彪, 李逸康, 姚志强, 陈瑜霖, 熊金波. GenFedRL：面向深度强化学习智能体的通用联邦强化学习框架[J]. 通信学报, 2023, 44(6): 183-197.
[3]	李元诚, 秦永泰. 基于深度强化学习的软件定义安全中台QoS实时优化算法[J]. 通信学报, 2023, 44(5): 181-192.
[4]	周大成, 陈鸿昶, 何威振, 程国振, 扈红超. 基于深度强化学习的微服务多维动态防御策略研究[J]. 通信学报, 2023, 44(4): 50-63.
[5]	许国良, 谭峰, 冉泳屹, 陈丰. 面向多波束卫星系统的波束跳变与覆盖控制联合优化算法[J]. 通信学报, 2023, 44(4): 78-86.
[6]	金伟, 李凤华, 余铭洁, 郭云川, 周紫妍, 房梁. 面向HDFS的密钥资源控制机制[J]. 通信学报, 2022, 43(9): 27-41.
[7]	许文俊, 吴思雷, 王凤玉, 林兰, 李国军, 张治. 基于多智能体强化学习的大规模灾后用户分布式覆盖优化[J]. 通信学报, 2022, 43(8): 1-16.
[8]	沙宗轩, 霍如, 孙闯, 汪硕, 黄韬. 基于深度强化学习的转发效能感知流量调度算法[J]. 通信学报, 2022, 43(8): 30-40.
[9]	马帅, 李兵, 盛海鸿, 谷荣妍, 周辉, 王洪梅, 王悦, 李世银. 基于深度强化学习的可见光定位通信一体化功率分配研究[J]. 通信学报, 2022, 43(8): 121-130.
[10]	张宇, 程旻. NDN中边缘计算与缓存的联合优化[J]. 通信学报, 2022, 43(8): 164-175.
[11]	左珮良, 侯少龙, 郭超, 蒋华, 王文博. 基于强化学习的多层卫星网络边缘安全决策方法[J]. 通信学报, 2022, 43(6): 189-199.
[12]	张先超, 赵耀, 叶海军, 樊锐. 无线网络多用户干扰下智能发射功率控制算法[J]. 通信学报, 2022, 43(2): 15-21.
[13]	李传煌, 陈泱婷, 唐晶晶, 楼佳丽, 谢仁华, 方春涛, 王伟明, 陈超. QL-STCT：一种SDN链路故障智能路由收敛方法[J]. 通信学报, 2022, 43(2): 131-142.
[14]	陈晋音, 胡书隆, 邢长友, 张国敏. 面向智能渗透攻击的欺骗防御方法[J]. 通信学报, 2022, 43(10): 106-120.
[15]	苏新, 孟蕾蕾, 周一青, CELIMUGE Wu. 基于深度强化学习的海洋移动边缘计算卸载方法[J]. 通信学报, 2022, 43(10): 133-145.