基于两层模糊划分的时间差分算法

doi:10.3969/j.issn.1000-436x.2013.10.011

Abstract

Abstract:

When dealing with the continuous space problems,the traditional Q-iteration algorithms based on lookup-table or function approximation converge slowly and are diff lt to get a continuous policy.To overcome the above weak-nesses,an on-policy TD algorithm named DFP-OPTD was proposed based on double-layer fuzzy partitioning and its convergence was proved.The first layer of fuzzy partitioning was applied for state space,the second layer of fuzzy parti-tioning was applied for action space,and Q-value functions were computed by the combination of the two layer fuzzy partitioning.Based on the Q-value function,the consequent parameters of fuzzy rules were updated by gradient descent method.Applying DFP-OPTD on two classical reinforcement learning problems,experimental results show that the algo-rithm not only can be used to get a continuous action policy,but also has a better convergence performance.

Key words: reinforcement learning, on-policy, gradient descent;, double layer fuzzy partitioning, continuous action policy

Xiang MU,Quan LIU,Qi-ming FU,Hong-kun SUN,Xin ZHOU. TD algorithm based on double-layer fuzzy partitioning[J]. Journal on Communications, 2013, 34(10): 92-99.

Figures/Tables 6

References 18

[1]	SUTTON R S , BARTO A G . Reinforcement Learning:An Introduc-tion[M]. Cambridge: MIT Press, 1998.
[2]	刘全, 闫其粹, 伏玉琛 ,等. 一种基于启发式奖赏函数的分层强化学习方法[J]. 计算机研究与发展, 2011,48(12): 2352-2358. LIU Q , YAN Q C , FU Y C ,et al. A hierarchical reinforcement learning method based on heuristic reward function[J]. Journal of Computer Research and Development, 2011,48(12): 2352-2358.
[3]	SUTTON R S , MCALLESTER D , SINGH S ,et al. Policy gradient methods for reinforcement learning with function approximation[A]. Proc of the 16th Annual Conference on Neural Information Processing Systems[C]. Denver, 1999. 1057-1063.
[4]	MAEI H R , SUTTON R S . GQ(?):a general gradient algorithm for? temporal difference prediction learning with eligibili y traces[A]. International Conference on Artificial General Intelligence[C]. Lugano, 2010. 91-96.
[5]	SUTTON R S,SZEPESV′ARI CS , MAEI H R . A convergent O(n)algorithm for off-policy temporal-difference learning with linear func-tion approximation[A]. Proc of the 22nd Annual Conference Neural Infor mation Processing Systems[C]. Vancouver, 2009. 1609-1616.
[6]	SHERSTOV A A , STONE P . Function approximation via tile coding:automating parameter choice[A]. Proc of the 5th Sympos um on Ab-straction,Reformulation and Approximation[C]. New York,USA, 2005. 194-205.
[7]	HEINEN M R , ENGEL P M . An incremental probabilistic neural network for regression and reinforcement learning tasks[A]. Proc of the 20th International Conference on Artificial Neural Networks[C]. Berlin, 2010. 170-179.
[8]	PAZIS J , LAGOUDAKIS M G . Learning continuous-action control policies[A]. Proc of the IEEE Symposium on Adaptive Dynamic Pro-gramming and Reinforcement Learning[C]. Washington, 2009. 169-176.
[9]	BONARINI A , LAZARIC A , MONTRONE F ,et al. Reinforcement distribution in fuzzy Q-learning[J]. Fuzzy Sets and Systems, 2009,160(10): 1420-1443.
[10]	HSU C H , JUANG C F . Self-organizing interval type-2 fuzzy Q-learning for reinforcement fuzzy control[A]. Proc of the 2011 IEEE International Conference on Systems,Man,and Cybernetics[C]. New Jersey, 2011. 2033-2038.
[11]	TADASHI H , AKINORI F , OSAMU ,et al. Fuzzy interpolation-based Q-learning with continuous states and actions[A]. Proc of the Fifth IEEE International Conference on Fuzzy Systems[C]. New York,USA, 2011. 594-600.
[12]	GLORENNEC P Y , JOUFFE L . Fuzzy Q-learning[A]. Proc of the Sixth IEEE International Conference on Fuzzy Systems[C]. Cam-bridge, 1997. 659-662.
[13]	CHANG H S , FU M C , HU J ,et al. Simulation-based Algorithms for Markov Decision Processes[M]. New York: Springer, 2007.
[14]	LUCIAN B , ROBERT B , BART D S ,et al. Reinforcement Learning and Dynamic Programming Using Function Approximation[M]. Flor-ida: CRC Press, 2010.
[15]	CASTILLO O , MELIN P . Type-2 Fuzzy Logic:Theory and Applica-tions[M]. New York: Springer, 2008.
[16]	TSITSIKLIS J N , ROY V B . An analysis of temporal-difference learning with function approximation[J]. IEEE Transactions Auto-matic Control, 1997,42(5): 674-690.
[17]	DAYAN P D . The convergence of TD(?)for general?[J]. Machine Learning, 1992,8(3-4): 341-362.
[18]	刘次华 . 随机过程[M]. 武汉: 华中科技大学出版社, 2008. LIU C H . Stochastic Process[M]. Wuhan: Huazhong University o Science and Technology PressPress, 2008.

Metrics

Recommended 0

No Suggested Reading articles found!

算法	算法收敛所需情节数		算法一个迭代步的平均时间
算法	最小情节数	平均情节数	算法一个迭代步的平均时间
DFP-OPTD	142	155	100%
GD-Sarsa(?)	179	204	49%

TD algorithm based on double-layer fuzzy partitioning

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 18

Related Articles 15

Metrics

Recommended 0

[1]	Ling MA, Qiliang FAN, Ting XU, Guanchen GUO, Shenglin ZHANG, Yongqian SUN, Yuzhi ZHANG. Scheduling framework based on reinforcement learning in online-offline colocated cloud environment [J]. Journal on Communications, 2023, 44(6): 90-102.
[2]	Biao JIN, Yikang LI, Zhiqiang YAO, Yulin CHEN, Jinbo XIONG. GenFedRL: a general federated reinforcement learning framework for deep reinforcement learning agents [J]. Journal on Communications, 2023, 44(6): 183-197.
[3]	Yuancheng LI, Yongtai QIN. Deep reinforcement learning based algorithm for real-time QoS optimization of software-defined security middle platform [J]. Journal on Communications, 2023, 44(5): 181-192.
[4]	Dacheng ZHOU, Hongchang CHEN, Weizhen HE, Guozhen CHENG, Hongchao HU. Research on multidimensional dynamic defense strategy for microservice based on deep reinforcement learning [J]. Journal on Communications, 2023, 44(4): 50-63.
[5]	Guoliang XU, Feng TAN, Yongyi RAN, Feng CHEN. Joint beam hopping and coverage control optimization algorithm for multibeam satellite system [J]. Journal on Communications, 2023, 44(4): 78-86.
[6]	Wenjun XU, Silei WU, Fengyu WANG, Lan LIN, Guojun LI, Zhi ZHANG. Large-scale post-disaster user distributed coverage optimization based on multi-agent reinforcement learning [J]. Journal on Communications, 2022, 43(8): 1-16.
[7]	Zongxuan SHA, Ru HUO, Chuang SUN, Shuo WANG, Tao HUANG. Forwarding efficiency aware traffic scheduling algorithm based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 30-40.
[8]	Shuai MA, Bing LI, Haihong SHENG, Rongyan GU, Hui ZHOU, Hongmei WANG, Yue WANG, Shiyin LI. Research on power allocation of integrated VLPC based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 121-130.
[9]	Yu ZHANG, Min CHENG. Joint optimization of edge computing and caching in NDN [J]. Journal on Communications, 2022, 43(8): 164-175.
[10]	Peiliang ZUO, Shaolong HOU, Chao GUO, Hua JIANG, Wenbo WANG. Security decision method for the edge of multi-layer satellite network based on reinforcement learning [J]. Journal on Communications, 2022, 43(6): 189-199.
[11]	Xianchao ZHANG, Yao ZHAO, Haijun YE, Rui FAN. Intelligent transmit power control algorithm for the multi-user interference of wireless network [J]. Journal on Communications, 2022, 43(2): 15-21.
[12]	Chuanhuang LI, Yangting CHEN, Jingjing TANG, Jiali LOU, Renhua XIE, Chuntao FANG, Weiming WANG, Chao CHEN. QL-STCT: an intelligent routing convergence method for SDN link failure [J]. Journal on Communications, 2022, 43(2): 131-142.
[13]	Jinyin CHEN, Shulong HU, Changyou XING, Guomin ZHANG. Deception defense method against intelligent penetration attack [J]. Journal on Communications, 2022, 43(10): 106-120.
[14]	Xin SU, Leilei MENG, Yiqing ZHOU, Wu CELIMUGE. Maritime mobile edge computing offloading method based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(10): 133-145.
[15]	Li’na DU, Li ZHUO, Shuo YANG, Jiafeng LI, Jing ZHANG. Survey on reinforcement learning based adaptive bit rate algorithm for mobile video streaming services [J]. Journal on Communications, 2021, 42(9): 205-217.