基于两层模糊划分的时间差分算法

doi:10.3969/j.issn.1000-436x.2013.10.011

通信学报 ›› 2013, Vol. 34 ›› Issue (10): 92-99.doi: 10.3969/j.issn.1000-436x.2013.10.011

基于两层模糊划分的时间差分算法

穆翔¹,刘全^1,²,傅启明¹,孙洪坤¹,周鑫¹

¹ 苏州大学计算机科学与技术学院，江苏苏州 215006
² 吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012

出版日期:2013-10-25 发布日期:2017-08-10
基金资助:
国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;江苏省自然科学基金资助项目;江苏省高校自然科学研究基金资助项目;江苏省高校自然科学研究基金资助项目;吉林大学符号计算与知识工程教育部重点实验室基金资助项目

TD algorithm based on double-layer fuzzy partitioning

Xiang MU¹,Quan LIU^1,²,Qi-ming FU¹,Hong-kun SUN¹,Xin ZHOU¹

¹ Institute of Computer Science and Technology,Soochow University,Suzhou 215006,China
² Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China

Online:2013-10-25 Published:2017-08-10
Supported by:
The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The Natural Science Foundation of Jiangsu Province;The High School Natural Foundation of Jiangsu Province;The High School Natural Foundation of Jiangsu Province;The Foundation of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education of Jilin University

摘要/Abstract

摘要：

针对传统的基于查询表或函数逼近的Q值迭代算法在处理连续空间问题时收敛速度慢、且不易求解连续行为策略的问题，提出了一种基于两层模糊划分的在策略时间差分算法——DFP-OPTD，并从理论上分析其收敛性。算法中第一层模糊划分作用于状态空间，第二层模糊划分作用于动作空间，并结合两层模糊划分计算出Q值函数。根据所得的值函数，使用梯度下降方法更新模糊规则中的后件参数。将Q DFP-OPTD应用于经典强化学习问题中，实验结果表明，该算法有较好的收敛性能，且可以求解连续行为策略。

关键词: 强化学习, 在策略, 梯度下降, 两层模糊划分, 连续行为策略

Abstract:

When dealing with the continuous space problems,the traditional Q-iteration algorithms based on lookup-table or function approximation converge slowly and are diff lt to get a continuous policy.To overcome the above weak-nesses,an on-policy TD algorithm named DFP-OPTD was proposed based on double-layer fuzzy partitioning and its convergence was proved.The first layer of fuzzy partitioning was applied for state space,the second layer of fuzzy parti-tioning was applied for action space,and Q-value functions were computed by the combination of the two layer fuzzy partitioning.Based on the Q-value function,the consequent parameters of fuzzy rules were updated by gradient descent method.Applying DFP-OPTD on two classical reinforcement learning problems,experimental results show that the algo-rithm not only can be used to get a continuous action policy,but also has a better convergence performance.

Key words: reinforcement learning, on-policy, gradient descent;, double layer fuzzy partitioning, continuous action policy

穆翔,刘全,傅启明,孙洪坤,周鑫. 基于两层模糊划分的时间差分算法[J]. 通信学报, 2013, 34(10): 92-99.

Xiang MU,Quan LIU,Qi-ming FU,Hong-kun SUN,Xin ZHOU. TD algorithm based on double-layer fuzzy partitioning[J]. Journal on Communications, 2013, 34(10): 92-99.

图/表 6

图1

图2

图3

图4

表1

图5

参考文献 18

[1]	SUTTON R S , BARTO A G . Reinforcement Learning:An Introduc-tion[M]. Cambridge: MIT Press, 1998.
[2]	刘全, 闫其粹, 伏玉琛 ,等. 一种基于启发式奖赏函数的分层强化学习方法[J]. 计算机研究与发展, 2011,48(12): 2352-2358. LIU Q , YAN Q C , FU Y C ,et al. A hierarchical reinforcement learning method based on heuristic reward function[J]. Journal of Computer Research and Development, 2011,48(12): 2352-2358.
[3]	SUTTON R S , MCALLESTER D , SINGH S ,et al. Policy gradient methods for reinforcement learning with function approximation[A]. Proc of the 16th Annual Conference on Neural Information Processing Systems[C]. Denver, 1999. 1057-1063.
[4]	MAEI H R , SUTTON R S . GQ(?):a general gradient algorithm for? temporal difference prediction learning with eligibili y traces[A]. International Conference on Artificial General Intelligence[C]. Lugano, 2010. 91-96.
[5]	SUTTON R S,SZEPESV′ARI CS , MAEI H R . A convergent O(n)algorithm for off-policy temporal-difference learning with linear func-tion approximation[A]. Proc of the 22nd Annual Conference Neural Infor mation Processing Systems[C]. Vancouver, 2009. 1609-1616.
[6]	SHERSTOV A A , STONE P . Function approximation via tile coding:automating parameter choice[A]. Proc of the 5th Sympos um on Ab-straction,Reformulation and Approximation[C]. New York,USA, 2005. 194-205.
[7]	HEINEN M R , ENGEL P M . An incremental probabilistic neural network for regression and reinforcement learning tasks[A]. Proc of the 20th International Conference on Artificial Neural Networks[C]. Berlin, 2010. 170-179.
[8]	PAZIS J , LAGOUDAKIS M G . Learning continuous-action control policies[A]. Proc of the IEEE Symposium on Adaptive Dynamic Pro-gramming and Reinforcement Learning[C]. Washington, 2009. 169-176.
[9]	BONARINI A , LAZARIC A , MONTRONE F ,et al. Reinforcement distribution in fuzzy Q-learning[J]. Fuzzy Sets and Systems, 2009,160(10): 1420-1443.
[10]	HSU C H , JUANG C F . Self-organizing interval type-2 fuzzy Q-learning for reinforcement fuzzy control[A]. Proc of the 2011 IEEE International Conference on Systems,Man,and Cybernetics[C]. New Jersey, 2011. 2033-2038.
[11]	TADASHI H , AKINORI F , OSAMU ,et al. Fuzzy interpolation-based Q-learning with continuous states and actions[A]. Proc of the Fifth IEEE International Conference on Fuzzy Systems[C]. New York,USA, 2011. 594-600.
[12]	GLORENNEC P Y , JOUFFE L . Fuzzy Q-learning[A]. Proc of the Sixth IEEE International Conference on Fuzzy Systems[C]. Cam-bridge, 1997. 659-662.
[13]	CHANG H S , FU M C , HU J ,et al. Simulation-based Algorithms for Markov Decision Processes[M]. New York: Springer, 2007.
[14]	LUCIAN B , ROBERT B , BART D S ,et al. Reinforcement Learning and Dynamic Programming Using Function Approximation[M]. Flor-ida: CRC Press, 2010.
[15]	CASTILLO O , MELIN P . Type-2 Fuzzy Logic:Theory and Applica-tions[M]. New York: Springer, 2008.
[16]	TSITSIKLIS J N , ROY V B . An analysis of temporal-difference learning with function approximation[J]. IEEE Transactions Auto-matic Control, 1997,42(5): 674-690.
[17]	DAYAN P D . The convergence of TD(?)for general?[J]. Machine Learning, 1992,8(3-4): 341-362.
[18]	刘次华 . 随机过程[M]. 武汉: 华中科技大学出版社, 2008. LIU C H . Stochastic Process[M]. Wuhan: Huazhong University o Science and Technology PressPress, 2008.

算法	算法收敛所需情节数		算法一个迭代步的平均时间
算法	最小情节数	平均情节数	算法一个迭代步的平均时间
DFP-OPTD	142	155	100%
GD-Sarsa(?)	179	204	49%

基于两层模糊划分的时间差分算法

TD algorithm based on double-layer fuzzy partitioning

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 18

相关文章 15

Metrics

推荐阅读 0

[1]	马玲, 樊漆亮, 许婷, 郭冠琛, 张圣林, 孙永谦, 张玉志. 基于强化学习的在线离线混部云环境下的调度框架[J]. 通信学报, 2023, 44(6): 90-102.
[2]	金彪, 李逸康, 姚志强, 陈瑜霖, 熊金波. GenFedRL：面向深度强化学习智能体的通用联邦强化学习框架[J]. 通信学报, 2023, 44(6): 183-197.
[3]	李元诚, 秦永泰. 基于深度强化学习的软件定义安全中台QoS实时优化算法[J]. 通信学报, 2023, 44(5): 181-192.
[4]	周大成, 陈鸿昶, 何威振, 程国振, 扈红超. 基于深度强化学习的微服务多维动态防御策略研究[J]. 通信学报, 2023, 44(4): 50-63.
[5]	许国良, 谭峰, 冉泳屹, 陈丰. 面向多波束卫星系统的波束跳变与覆盖控制联合优化算法[J]. 通信学报, 2023, 44(4): 78-86.
[6]	许文俊, 吴思雷, 王凤玉, 林兰, 李国军, 张治. 基于多智能体强化学习的大规模灾后用户分布式覆盖优化[J]. 通信学报, 2022, 43(8): 1-16.
[7]	沙宗轩, 霍如, 孙闯, 汪硕, 黄韬. 基于深度强化学习的转发效能感知流量调度算法[J]. 通信学报, 2022, 43(8): 30-40.
[8]	马帅, 李兵, 盛海鸿, 谷荣妍, 周辉, 王洪梅, 王悦, 李世银. 基于深度强化学习的可见光定位通信一体化功率分配研究[J]. 通信学报, 2022, 43(8): 121-130.
[9]	张宇, 程旻. NDN中边缘计算与缓存的联合优化[J]. 通信学报, 2022, 43(8): 164-175.
[10]	左珮良, 侯少龙, 郭超, 蒋华, 王文博. 基于强化学习的多层卫星网络边缘安全决策方法[J]. 通信学报, 2022, 43(6): 189-199.
[11]	张先超, 赵耀, 叶海军, 樊锐. 无线网络多用户干扰下智能发射功率控制算法[J]. 通信学报, 2022, 43(2): 15-21.
[12]	李传煌, 陈泱婷, 唐晶晶, 楼佳丽, 谢仁华, 方春涛, 王伟明, 陈超. QL-STCT：一种SDN链路故障智能路由收敛方法[J]. 通信学报, 2022, 43(2): 131-142.
[13]	陈晋音, 胡书隆, 邢长友, 张国敏. 面向智能渗透攻击的欺骗防御方法[J]. 通信学报, 2022, 43(10): 106-120.
[14]	苏新, 孟蕾蕾, 周一青, CELIMUGE Wu. 基于深度强化学习的海洋移动边缘计算卸载方法[J]. 通信学报, 2022, 43(10): 133-145.
[15]	杜丽娜, 卓力, 杨硕, 李嘉锋, 张菁. 基于强化学习的移动视频流业务码率自适应算法研究进展[J]. 通信学报, 2021, 42(9): 205-217.