基于拓扑序列更新的值迭代算法

doi:10.3969/j.issn.1000-436x.2014.08.008

Journal on Communications ›› 2014, Vol. 35 ›› Issue (8): 56-62.doi: 10.3969/j.issn.1000-436x.2014.08.008

• Academic paper • Previous Articles Next Articles

Optimized algorithm for value iteration based on topological sequence backups

Wei HUANG¹,Quan LIU^1,²,Hong-kun SUN¹,Qi-ming FU¹,HOUXiao-ke Z¹

¹ School of Computer Science and Technology, Soochow University, Suzhou 215006, China
² Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

Online:2014-08-25 Published:2017-06-29
Supported by:
The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The Natural Science Foundation of Jiangsu Province;High School Natural Foundation of Jiangsu Province;High School Natural Foundation of Jiangsu Province;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin Univer-sity

Abstract

Abstract:

In order to improve the convergence performance, an optimized value iteration based on topological sequence backups, VI-TS, is proposed. The key idea of VI-TS is to circumvent the problem of unnecessary backups by dividing an MDP into strongly-connected components and solving these components in topological sequences after detecting the structure of MDP. The experiment results show that VI-TS has a better convergence performance and robustness for state space growth when applied to classical planning experiment scenarios.

Key words: reinforcement learning, value iteration, topological sequence, VI-TS

Wei HUANG,Quan LIU,Hong-kun SUN,Qi-ming FU,HOUXiao-ke Z. Optimized algorithm for value iteration based on topological sequence backups[J]. Journal on Communications, 2014, 35(8): 56-62.

Figures/Tables 7

References 12

[1]	刘全，傅启明，龚声蓉等. 最小状态变元平均奖赏的强化学习方法[J]. 通信学报 2011,32(1): 66-71. LIU Q , FU Q M , GONG S R , et al. Reinforcement learning algorithm based on minimum state method and average reward[J]. Journal on Communications, 2011,32(1): 66-71.
[2]	SZEPESVARI C . Algorithms for Reinforcement Learning[M]. San Rafael: Morgan Claypool, 2010.
[3]	SUTTON R S , BARTO A G . Reinforcement Learning: An Introduc-tion[M]. Cambridge: MIT Press, 1998.
[4]	HOWARD R . Dynamic Programming and Markov Processes[M]. Cambridge, MA: MIT Press, 1960.
[5]	BERTSEKAS D P . Dynamic Programming and Optimal Control[M]. Belmont, MA: Athena Scientific, 2000.
[6]	POWELL W B . Approximate Dynamic Programming: Solving the Curses of Dimensionality[M]. New York: John Wiley＆Sons, 2007.
[7]	HANSEN E , ZILBERSTEIN S . Lao*: a heuristic search algorithm that finds solutions with loops[[J]. Artificial Intelligence, 2001,129(1/2): 35-62.
[8]	BONET B , GEFFNER H . Labeled RTDP: Improving the convergence of real-time dynamic programming[A]. Proc of 13th ICAPS[C]. Trento, Italy 2003. 12-21.
[9]	BONET B , GEFFNER H . Faster heuristic search algorithms for plan-ning with uncertainty and full feedback[A]. International Joint Con-ference on Artificial Intelligence[C]. 2003. 1233-1238.
[10]	MOORE A W , ATKESON C G . Prioritized sweeping: reinforcement learning with less data and less time[J]. Machine Learning, 1993,13(1): 103-130.
[11]	ANDRE D , FRIEDMAN N , PARR R . Generalized prioritized sweep-ing[A]. Proc of the 10th Conference on Advances in Neural Informa-tion Processing Systems[C]. Cambridge, 1997. 1001-1007.
[12]	CORMEN T H , LEISERSON C E , RIVEST R L , et al. Introduction to Algorithms[M]. Cambridge, MA: MIT Press, 2001.

Metrics

Recommended 0

No Suggested Reading articles found!

算法	Time₁	Time₂	Time₃	Time_total
VI	—	—	—	1.636
ILAO^*	—	—	—	0.472
LRTDP	—	—	—	0.298
VI-TS	0.011	0.007	0.002	0.019

状态空间维度	VI	ILAO*	LRTDP			VI-TS
状态空间维度	VI	ILAO*	LRTDP	Time₁	Time₂		Time₃	Time_total
100×100	10.25	1.95	1.31	0.21	0.01		0.02	0.24
300×300	＞300	12.15	233.5	2.23	0.13		0.01	2.38
700×700	＞300	102.55	＞300	12.32	0.75		0.18	13.25

[1]	Ling MA, Qiliang FAN, Ting XU, Guanchen GUO, Shenglin ZHANG, Yongqian SUN, Yuzhi ZHANG. Scheduling framework based on reinforcement learning in online-offline colocated cloud environment [J]. Journal on Communications, 2023, 44(6): 90-102.
[2]	Biao JIN, Yikang LI, Zhiqiang YAO, Yulin CHEN, Jinbo XIONG. GenFedRL: a general federated reinforcement learning framework for deep reinforcement learning agents [J]. Journal on Communications, 2023, 44(6): 183-197.
[3]	Yuancheng LI, Yongtai QIN. Deep reinforcement learning based algorithm for real-time QoS optimization of software-defined security middle platform [J]. Journal on Communications, 2023, 44(5): 181-192.
[4]	Dacheng ZHOU, Hongchang CHEN, Weizhen HE, Guozhen CHENG, Hongchao HU. Research on multidimensional dynamic defense strategy for microservice based on deep reinforcement learning [J]. Journal on Communications, 2023, 44(4): 50-63.
[5]	Guoliang XU, Feng TAN, Yongyi RAN, Feng CHEN. Joint beam hopping and coverage control optimization algorithm for multibeam satellite system [J]. Journal on Communications, 2023, 44(4): 78-86.
[6]	Wenjun XU, Silei WU, Fengyu WANG, Lan LIN, Guojun LI, Zhi ZHANG. Large-scale post-disaster user distributed coverage optimization based on multi-agent reinforcement learning [J]. Journal on Communications, 2022, 43(8): 1-16.
[7]	Zongxuan SHA, Ru HUO, Chuang SUN, Shuo WANG, Tao HUANG. Forwarding efficiency aware traffic scheduling algorithm based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 30-40.
[8]	Shuai MA, Bing LI, Haihong SHENG, Rongyan GU, Hui ZHOU, Hongmei WANG, Yue WANG, Shiyin LI. Research on power allocation of integrated VLPC based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(8): 121-130.
[9]	Yu ZHANG, Min CHENG. Joint optimization of edge computing and caching in NDN [J]. Journal on Communications, 2022, 43(8): 164-175.
[10]	Peiliang ZUO, Shaolong HOU, Chao GUO, Hua JIANG, Wenbo WANG. Security decision method for the edge of multi-layer satellite network based on reinforcement learning [J]. Journal on Communications, 2022, 43(6): 189-199.
[11]	Xianchao ZHANG, Yao ZHAO, Haijun YE, Rui FAN. Intelligent transmit power control algorithm for the multi-user interference of wireless network [J]. Journal on Communications, 2022, 43(2): 15-21.
[12]	Chuanhuang LI, Yangting CHEN, Jingjing TANG, Jiali LOU, Renhua XIE, Chuntao FANG, Weiming WANG, Chao CHEN. QL-STCT: an intelligent routing convergence method for SDN link failure [J]. Journal on Communications, 2022, 43(2): 131-142.
[13]	Jinyin CHEN, Shulong HU, Changyou XING, Guomin ZHANG. Deception defense method against intelligent penetration attack [J]. Journal on Communications, 2022, 43(10): 106-120.
[14]	Xin SU, Leilei MENG, Yiqing ZHOU, Wu CELIMUGE. Maritime mobile edge computing offloading method based on deep reinforcement learning [J]. Journal on Communications, 2022, 43(10): 133-145.
[15]	Li’na DU, Li ZHUO, Shuo YANG, Jiafeng LI, Jing ZHANG. Survey on reinforcement learning based adaptive bit rate algorithm for mobile video streaming services [J]. Journal on Communications, 2021, 42(9): 205-217.

Optimized algorithm for value iteration based on topological sequence backups

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 12

Related Articles 15

Metrics

Recommended 0