深度强化学习算法与应用研究现状综述

doi:10.11959/j.issn.2096-6652.202034

智能科学与技术学报 ›› 2020, Vol. 2 ›› Issue (4): 314-326.doi: 10.11959/j.issn.2096-6652.202034

深度强化学习算法与应用研究现状综述

刘朝阳¹, 穆朝絮¹, 孙长银²

¹ 天津大学电气自动化与信息工程学院，天津 300072
² 东南大学自动化学院，江苏南京 210096

修回日期:2020-12-03 出版日期:2020-12-15 发布日期:2020-12-01
作者简介:刘朝阳（1996- ），男，天津大学电气自动化与信息工程学院博士生，主要研究方向为强化学习、多智能体强化学习。
穆朝絮（1984- ），女，博士，天津大学电气自动化与信息工程学院教授，主要研究方向为强化学习、自适应学习系统、非线性控制和优化。
孙长银（1975-），男，博士，东南大学自动化学院教授，中国自动化学会会士，中国自动化学会人工智能与机器人教育专业委员会主任。主要研究方向为智能控制与优化、强化学习、神经网络、数据驱动控制。担任 IEEE Transactions on Neural Networks and Learning Systems、IEEE/CAA Journal of Automatica Sinica、《自动化学报》《控制理论与应用》《智能科学与技术学报》等高质量学术期刊编委。2011年获得国家杰出青年科学基金。“智能机器人感知与控制”江苏高等学校优秀科技创新团队带头人，2016 年全国优秀科技工作者，第三批国家“万人计划”科技创新领军人才，中国科学技术协会第九次全国代表大会代表，“自主无人系统协同控制理论及应用”国家自然科学基金委员会创新研究群体学术带头人，科学技术部科技创新2030—“新一代人工智能”重大项目“人在回路的混合增强智能”首席科学家，江苏省前沿引领技术基础研究专项领衔科学家。
基金资助:
国家自然科学基金资助项目(61773284)

An overview on algorithms and applications of deep reinforcement learning

Zhaoyang LIU¹, Chaoxu MU¹, Changyin SUN²

¹ School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
² School of Automation, Southeast University, Nanjing 210096, China

Revised:2020-12-03 Online:2020-12-15 Published:2020-12-01
Supported by:
The National Natural Science Foundation of China(61773284)

摘要/Abstract

摘要：

深度强化学习主要被用来处理感知-决策问题，已经成为人工智能领域重要的研究分支。概述了基于值函数和策略梯度的两类深度强化学习算法，详细阐述了深度Q网络、深度策略梯度及相关改进算法的原理，并综述了深度强化学习在视频游戏、导航、多智能体协作以及推荐系统等领域的应用研究进展。最后，对深度强化学习的算法和应用进行展望，针对一些未来的研究方向和研究热点给出了建议。

关键词: 人工智能, 深度强化学习, 值函数, 策略梯度, 导航, 协作, 复杂环境, 泛化性, 鲁棒性

Abstract:

Deep reinforcement learning (DRL) is mainly applied to solve the perception-decision problem, and has become an important research branch in the field of artificial intelligence.Two kinds of DRL algorithms based on value function and policy gradient were summarized, including deep Q network, policy gradient as well as related developed algorithms.In addition, the applications of DRL in video games, navigation, multi-agent cooperation and recommendation field were intensively reviewed.Finally, a prospect for the future research of DRL was made, and some research suggestions were given.

Key words: artificial intelligence, deep reinforcement learning, value function, policy gradient, navigation, cooperation, complex environment, generalization, robustness

中图分类号:

TP181

刘朝阳,穆朝絮,孙长银. 深度强化学习算法与应用研究现状综述[J]. 智能科学与技术学报, 2020, 2(4): 314-326.

Zhaoyang LIU,Chaoxu MU,Changyin SUN. An overview on algorithms and applications of deep reinforcement learning[J]. Chinese Journal of Intelligent Science and Technology, 2020, 2(4): 314-326.

图/表 8

图1

图2

图3

图4

表1

图5

图6

图7

参考文献 73

[1]	SUTTON R S , BARTO A G . Reinforcement learning:an introduction[M]. Cambridge: MIT Press, 2018.
[2]	LECUN Y , BENGIO Y , HINTON G . Deep learning[J]. Nature, 2015,521(7553): 436-444.
[3]	赵冬斌, 邵坤, 朱圆恒 ,等. 深度强化学习综述：兼论计算机围棋的发展[J]. 控制理论与应用, 2016,33(6): 701-717.
	ZHAO D B , SHAO K , ZHU Y H ,et al. Review of deep reinforcement learning and discussions on the development of computer Go[J]. Control Theory ＆ Applications, 2016,33(6): 701-717.
[4]	万里鹏, 兰旭光, 张翰博 ,等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019,32(1): 67-81.
	WAN L P , LAN X G , ZHANG H B ,et al. A review of deep reinforcement learning theory and application[J]. Pattern Recognition and Artificial Intelligence, 2019,32(1): 67-81.
[5]	MNIH V , KAVUKCUOGLU K , SILVER D ,et al. Human-level control through deep reinforcement learning[J]. Nature, 2015,518(7540): 529-533.
[6]	SILVER D , HUANG A , MADDISON C J ,et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016,529(7587): 484-489.
[7]	SILVER D , SCHRITTWIESER J , SIMONYAN K ,et al. Mastering the game of go without human knowledge[J]. Nature, 2017,550(7676): 354-359.
[8]	BERNER C , BROCKMAN G , CHAN B ,et al. Dota2 with large scale deep reinforcement learning[J]. arXiv preprint, 2019,arXiv:1912. 06680.
[9]	VINYALS O , BABUSCHKIN I , CZARNECKI W M ,et al. Grandmaster level in StarCraftII using multi-agent reinforcement learning[J]. Nature, 2019,575(7782): 350-354.
[10]	刘全, 翟建伟, 章宗长 ,等. 深度强化学习综述[J]. 计算机学报, 2018,41(1): 1-27.
	LIU Q , ZHAI J W , ZHANG Z Z ,et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers, 2018,41(1): 1-27.
[11]	刘建伟, 高峰, 罗雄麟 . 基于值函数和策略梯度的深度强化学习综述[J]. 计算机学报, 2019,42(6): 1406-1438.
	LIU J W , GAO F , LUO X L . Survey of deep reinforcement learning based on value function and policy gradient[J]. Chinese Journal of Computers, 2019,42(6): 1406-1438.
[12]	SUTTON R S . Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988,3(1): 9-44.
[13]	WATKINS C J C H , DAYAN P . Q-learning[J]. Machine Learning, 1992,8(3-4): 279-292.
[14]	VAN HASSELT H , GUEZ A , SILVER D ,et al. Deep reinforcement learning with double Q-learning[C]// The 30th AAAI Conference on Artificial Intelligence.[S.l.:s.n.], 2016.
[15]	SCHAUL T , QUAN J , ANTONOGLOU I ,et al. Prioritized experience replay[C]// The 4th International Conference on Learning Representations.[S.l.:s.n.], 2016.
[16]	WANG Z , SCHAUL T , HESSEL M ,et al. Dueling network architectures for deep reinforcement learning[C]// The 33rd International Conference on Machine Learning. New York:ACM Press, 2016.
[17]	NAIR A , SRINIVASAN P , BLACKWELL S ,et al. Massively parallel methods for deep reinforcement learning[J]. arXiv preprint, 2015,arXiv:1507. 04296
[18]	SLIVER D , LEVER G , HEESS N ,et al. Deterministic policy gradient algorithms[C]// The 31st International Conference on Machine Learning. New York:ACM Press, 2014.
[19]	LILLICRAP P T , HUNT J J , PRITZEL A ,et al. Continuous control with deep reinforcement learning[C]// The 4th International Conference on Learning Representations.[S.l.:s.n.], 2016.
[20]	MNIH V , BADIA A P , MIRZA M ,et al. Asynchronous methods for deep reinforcement learning[C]// The 33rd International Conference on Machine Learning. New York:ACM Press, 2016.
[21]	SCHULMAN J , WOLSKI F , DHARIWAL P ,et al. Proximal policy optimization algorithms[J]. arXiv preprint, 2017,arXiv:1707. 06347.
[22]	HAARNOJA T , ZHOU A , ABBEEL P ,et al. Soft actor-critic:off-policy maximum entropy deep reinforcement learning with a stochastic actor[J]. arXiv preprint, 2018,arXiv:1801. 01290.
[23]	沈宇, 韩金朋, 李灵犀 ,等. 游戏智能中的 AI——从多角色博弈到平行博弈[J]. 智能科学与技术学报, 2020,2(3): 205-213.
	SHEN Y , HAN J P , LI L X ,et al. AI in game intelligence—from multi-role game to parallel game[J]. Chinese Journal of Intelligent Science and Technology, 2020,2(3): 205-213.
[24]	BADIA A P , PIOT B , KAPTUROWSKI S ,et al. Agent57:outperforming the atari human benchmark[J]. arXiv preprint, 2020,arXiv:2003. 13350.
[25]	KEMPKA M , WYDMUCH M , RUNC G ,et al. Vizdoom:a doom-based AI research platform for visual reinforcement learning[C]// 2016 IEEE Conference on Computational Intelligence and Games (CIG). Piscataway:IEEE Press, 2016: 1-8.
[26]	LAMPLE G , CHAPLOT D S . Playing FPS games with deep reinforcement learning[C]// The 31st AAAI Conference on Artificial Intelligence.[S.l.:s.n.], 2017.
[27]	DOSOVITSKIY A , KOLTUN V . Learning to act by predicting the future[J]. arXiv preprint, 2016,arXiv:1611. 01779.
[28]	PATHAK D , AGRAWAL P , EFROS A A ,et al. Curiosity-driven exploration by self-supervised prediction[C]// The 34th International Conference on Machine Learning. New York:ACM Press, 2017.
[29]	WU Y , ZHANG W , SONG K . Master-slave curriculum design for reinforcement learning[C]// The 28th International Joint Conference on Artificial Intelligence. New York:ACM Press, 2018: 1523-1529.
[30]	VINYALS O , EWALDS T , BARTUNOV S ,et al. StarcraftII:a new challenge for reinforcement learning[J]. arXiv preprint, 2017,arXiv:1708. 04782.
[31]	ZAMBALDI V , RAPOSO D , SANTORO A ,et al. Relational deep reinforcement learning[J]. arXiv preprint, 2018,arXiv:1806. 01830.
[32]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Advances in Neural Information Processing Systems. New York:ACM Press, 2017: 5998-6008.
[33]	RASHID T , SAMVELYAN M , DE WITT C S ,et al. QMIX:monotonic value function factorisation for deep multi-agent reinforcement learning[J]. arXiv preprint, 2018,arXiv:1803. 11485.
[34]	YE D , LIU Z , SUN M ,et al. Mastering complex control in MOBA games with deep reinforcement learning[C]// The 34th AAAI Conference on Artificial Intelligence.[S.l.:s.n.], 2020: 6672-6679.
[35]	OH J , CHOCKALINGAM V , SINGH S ,et al. Control of memory,active perception,and action in minecraft[C]// The 33rd International Conference on Machine Learning. New York:ACM Press, 2016.
[36]	JADERBERG M , MNIH V , CZARNECKI W M ,et al. Reinforcement learning with unsupervised auxiliary tasks[J]. arXiv preprint, 2016,arXiv:1611. 05397.
[37]	MIROWSKI P , PASCANU R , VIOLA F ,et al. Learning to navigate in complex environments[J]. arXiv preprint, 2016,arXiv:1611. 03673.
[38]	WANG Y , HE H , SUN C . Learning to navigate through complex dynamic environment with modular deep reinforcement learning[J]. IEEE Transactions on Games, 2018,10(4): 400-412.
[39]	SHI H , SHI L , XU M ,et al. End-to-end navigation strategy with deep reinforcement learning for mobile robots[J]. IEEE Transactions on Industrial Informatics, 2020,16(4): 2393-2402.
[40]	SAVINOV N , RAICHUK A , MARINIER R ,et al. Episodic curiosity through reachability[C]// The 7th International Conference on Learning Representations.[S.l.:s.n.], 2019.
[41]	ZHU Y , MOTTAGHI R , KOLVE E ,et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning[C]// 2017 IEEE international conference on robotics and automation (ICRA). Piscataway:IEEE Press, 2017: 3357-3364.
[42]	TAI L , LIU M . Towards cognitive exploration through deep reinforcement learning for mobile robots[J]. arXiv preprint, 2016,arXiv:1610. 01733.
[43]	TAI L , PAOLO G , LIU M . Virtual-to-real deep reinforcement learning:continuous control of mobile robots for mapless navigation[C]// 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway:IEEE Press, 2017: 31-36.
[44]	WU Y , RAO Z , ZHANG W ,et al. Exploring the task cooperation in multi-goal visual navigation[C]// The 28th International Joint Conference on Artificial Intelligence.[S.l.:s.n.], 2019: 609-615.
[45]	ZHANG W , ZHANG Y , LIU N . Map-less navigation:a single DRL-based controller for robots with varied dimensions[J]. arXiv preprint, 2020,arXiv:2002. 06320.
[46]	MIROWSKI P , GRIMES M K , MALINOWSKI M ,et al. Learning to navigate in cities without a map[C]// Advances in Neural Information Processing Systems.[S.l.:s.n.], 2018: 2419-2430.
[47]	LI A , HU H , MIROWSKI P ,et al. Cross-view policy learning for street navigation[C]// The IEEE International Conference on Computer Vision. Piscataway:IEEE Press, 2019: 8100-8109.
[48]	HERMANN K M , MALINOWSKI M , MIROWSKI P ,et al. Learning to follow directions in street view[C]// The 34th AAAI Conference on Artificial Intelligence.[S.l.:s.n.], 2020.
[49]	CHANCá N M , MILFORD M . CityLearn:diverse real-world environments for sample-efficient navigation policy learning[J]. arXiv preprint, 2020,arXiv:1910. 04335.
[50]	孙长银, 穆朝絮 . 多智能体深度强化学习的若干关键科学问题[J]. 自动化学报, 2020,46(7): 1301-1312.
	SUN C Y , MU C X . Important scientific problems of multi-agentdeep reinforcement learning[J]. Acta Automatica Sinica, 2020,46(7): 1301-1312.
[51]	OROOJLOOYJADID A , HAJINEZHAD D . A review of cooperative multi-agent deep reinforcement learning[J]. arXiv preprint, 2019,arXiv:1908. 03963.
[52]	OMIDSHAFIEI S , PAZIS J , AMATO C ,et al. Deep decentralized multi-task multi-agent reinforcement learning under partial observability[C]// The 34th International Conference on Machine Learning. New York:ACM Press, 2017.
[53]	MATIGNON L , LAURENT G J , LE FORT-PIAT N . Hysteretic Q-learning:an algorithm for decentralized reinforcement learning in cooperative multi-agent teams[C]// 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway:IEEE Press, 2007: 64-69.
[54]	FOERSTER J , NARDELLI N , FARQUHAR G ,et al. Stabilising experience replay for deep multi-agent reinforcement learning[C]// The 34th International Conference on Machine Learning. New York:ACM Press, 2017.
[55]	PALMER G , TUYLS K , BLOEMBERGEN D ,et al. Lenient multi-agent deep reinforcement learning[C]// The 17th International Conference on Autonomous agents and Multiagent Systems. New York:ACM Press, 2018.
[56]	EVERETT R , ROBERTS S . Learning against non-stationary agents with opponent modelling and deep reinforcement learning[C]// 2018 AAAI Spring Symposium Series.[S.l.:s.n.], 2018.
[57]	JIN Y , WEI S , YUAN J ,et al. Stabilizing multi-agent deep reinforcement learning by implicitly estimating other agents’ behaviors[C]// 2020 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2020: 3547-3551.
[58]	LIU X , TAN Y . Attentive relational state representation in decentralized multiagent reinforcement learning[J]. IEEE Transactions on Cybernetics, 2020.
[59]	GUPTA J K , EGOROV M , KOCHENDERFER M . Cooperative multi-agent control using deep reinforcement learning[C]// The 16th International Conference on Autonomous Agents and Multiagent Systems. Cham:Springer, 2017: 66-83.
[60]	LOWE R , WU Y , TAMAR A ,et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]// Advances in Neural Information Processing Systems. New York:ACM Press, 2017: 6379-6390.
[61]	FOERSTER J , FARQUHAR G , AFOURAS T ,et al. Counterfactual multi-agent policy gradients[C]// The 32nd AAAI Conferenceon Artificial Intelligence.[S.l.:s.n.], 2018.
[62]	SUNEHAG P , LEVER G , GRUSLYS A ,et al. Value-decomposition networks for cooperative multi-agent learning[J]. arXiv preprint, 2011,arXiv:1706. 05296.
[63]	MAO H , ZHANG Z , XIAO Z ,et al. Modelling the dynamic joint policy of teammates with attention multi-agent DDPG[C]// The 18th International Conference on Autonomous Agentsand Multiagent Systems. New York:ACM Press, 2019.
[64]	IQBAL S , SHA F . Actor-attention-critic for multi-agent reinforcement learning[C]// International Conference on Machine Learning.[S.l.:s.n.], 2019: 2961-2970.
[65]	FOERSTER J N , ASSAEL Y M , DE FREITAS N ,et al. Learning to communicate with deep multi-agent reinforcement learning[C]// Advances in Neural Information Processing Systems. New York:ACM Press, 2016: 2137-2145.
[66]	SUKHBAATAR S , SZLAM A , FERGUS R . Learning multiagent communication with back propagation[C]// Advances in Neural Information Processing Systems. New York:ACM Press, 2016: 2244-2252.
[67]	JIANG J , LU Z . Learning attentional communication for multi-agent cooperation[C]// Advances in Neural Information Processing Systems. New York:ACM Press, 2018: 7254-7264.
[68]	KIM D , MOON S , HOSTALLERO D ,et al. Learning to schedule communication in multi-agent reinforcement learning[C]// The 7th International Conference on Learning Representations.[S.l.:s.n.], 2019.
[69]	DAS A , GERVET T , ROMOFF J ,et al. TarMAC:targeted multi-agent communication[C]// The 36th International Conference on MachineLearning.[S.l.:s.n.], 2019.
[70]	SHANI G , HECKERMAN D , BRAFMAN R I ,et al. An MDP-based recommender system[J]. Journal of Machine Learning Research, 2005,6(Sep): 1265-1295.
[71]	ZHAO X , XIA L , TANG J ,et al. Deep reinforcement learning for search,recommendation,and online advertising:a survey[J]. ACM SIGWEB Newsletter, 2019(Spring): 1-15.
[72]	ZHAO X , XIA L , ZHANG L ,et al. Deep reinforcement learning for page-wise recommendations[C]// The 12th ACM Conference on Recommender Systems. New York:ACM Press, 2018: 95-103.
[73]	ZHENG G , ZHANG F , ZHENG Z ,et al. DRN:a deep reinforcement learning framework for news recommendation[C]// The 2018 World Wide Web Conference. New York:ACM Press, 2018: 167-176.

应用领域	分类方式	参考文献	研究意义
视频游戏	Atari 2600	[5,14-17,24]	将DRL应用在多种游戏环境中，提升DRL算法的通用性
	ViZDoom、StarCraftII等	[25-34]	将DRL应用到复杂的游戏场景中，提升智能体的决策能力
导航	迷宫导航	[35-40]	根据应用场景设计迷宫环境，采用DRL处理特定的导航问题
	室内导航	[41-45]	采用 DRL 算法训练智能体在室内环境进行导航，并尝试将虚拟环境中训练好的智能体应用到现实环境中
	街景导航	[46-49]	采用DRL处理城市与城市之间的长距离导航，并提升DRL算法的泛化能力
多智能体协作	独立学习者协作	[50-53,54,55,56,57-59]	协作智能体在训练时使用独立DRL的方式，方便进行数量上的扩展
	集中式评价器协作	[60-64]	协作智能体在训练时通过集中式的评价器获取其他智能体的信息，解决环境非静态问题
	通信协作	[65-69]	利用 DRL 处理多智能体之间可以通信的情况，并采用通信促进智能体之间的协作
推荐系统	推荐算法	[70-73]	利用 DRL 进行推荐可以实时地对推荐策略进行调整，从而满足用户的动态偏好，并且推荐算法能够得到长期的回报

深度强化学习算法与应用研究现状综述

An overview on algorithms and applications of deep reinforcement learning

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 73

相关文章 15

Metrics

推荐阅读 0

[1]	王飞跃, 缪青海, 张军平, 郑文博, 丁文文. 探讨AI for Science的影响与意义：现状与展望[J]. 智能科学与技术学报, 2023, 5(1): 1-6.
[2]	张军欢, 朱正一, 蔡可玮. 人工智能与量化交易的课程建设[J]. 智能科学与技术学报, 2023, 5(1): 104-112.
[3]	田永林, 陈苑文, 杨静, 王雨桐, 王晓, 缪青海, 王子然, 王飞跃. 元宇宙与平行系统：发展现状、对比及展望[J]. 智能科学与技术学报, 2023, 5(1): 121-132.
[4]	卢经纬, 程相, 王飞跃. 求解微分方程的人工智能与深度学习方法：现状及展望[J]. 智能科学与技术学报, 2022, 4(4): 461-476.
[5]	蔡莹皓, 杨华, 安璇, 王文硕, 杜沂东, 张嘉韬, 王志刚. 神经符号学及其应用研究[J]. 智能科学与技术学报, 2022, 4(4): 560-570.
[6]	武强, 季雪庭, 吕琳媛. 元宇宙中的人工智能技术与应用[J]. 智能科学与技术学报, 2022, 4(3): 324-334.
[7]	赖文柱, 陈德旺, 何振峰, 邓新国, GIUSEPPE CARLO Marano. 地铁列车驾驶技术发展综述：从人工驾驶到智能无人驾驶[J]. 智能科学与技术学报, 2022, 4(3): 335-343.
[8]	马帅, 傅启明, 陈建平, 冯帆, 陆悠, 李铮伟, 裘舒年. 基于双池DQN的HVAC无模型优化控制方法[J]. 智能科学与技术学报, 2022, 4(3): 426-444.
[9]	孙宇祥, 彭益辉, 李斌, 周佳炜, 张鑫磊, 周献中. 智能博弈综述：游戏AI对作战推演的启示[J]. 智能科学与技术学报, 2022, 4(2): 157-173.
[10]	刘家成, 张向文. 基于TD3的电动汽车复合电源能量管理策略研究[J]. 智能科学与技术学报, 2022, 4(2): 277-287.
[11]	冯埔, 吴文峻, 罗杰, 于鑫, 田雍恺. 基于群体熵的机器人群体智能汇聚度量[J]. 智能科学与技术学报, 2022, 4(1): 65-74.
[12]	胡东伟, 冯晓璐. 大脑建模的理论框架及热点问题[J]. 智能科学与技术学报, 2021, 3(4): 412-434.
[13]	李亚玲, 杨林瑶, 葛俊, 覃缘琪, 王晓. 博弈5.0：基于平行系统和机器博弈的社会认知平行博弈[J]. 智能科学与技术学报, 2021, 3(4): 507-520.
[14]	王飞跃. 平行哲学与智能技术：平行产业与智慧社会的对偶方程与测试基础[J]. 智能科学与技术学报, 2021, 3(3): 245-255.
[15]	刘文, 胡琨林, 李岩, 刘钊. 移动目标轨迹预测方法研究综述[J]. 智能科学与技术学报, 2021, 3(2): 149-160.