网络与信息安全学报 ›› 2022, Vol. 8 ›› Issue (5): 66-74.doi: 10.11959/j.issn.2096-109x.2022064

• 专题:大数据与人工智能安全 • 上一篇    下一篇

基于鸽群的鲁棒强化学习算法

张明英1, 华冰2, 张宇光1, 李海东1, 郑墨泓3   

  1. 1 中国电子技术标准化研究院,北京 100007
    2 南京航空航天大学航天学院,江苏 南京 211106
    3 中国电子科技集团公司第七研究所,广东 广州 510000
  • 修回日期:2022-07-15 出版日期:2022-10-15 发布日期:2022-10-01
  • 作者简介:张明英(1985- ),男,广西北海人,中国电子技术标准化研究院高级工程师,主要研究方向为人工智能、知识图谱、大数据
    华冰(1978- ),女,江苏南京人,南京航空航天大学副研究员,主要研究方向为飞行器导航、智能数据处理
    张宇光(1991- ),男,内蒙古包头人,中国电子技术标准化研究院工程师,主要研究方向为数据安全、人工智能安全、个人信息保护、计算机视觉、视觉生成
    李海东(1992- ),男,湖北孝感人,中国电子技术标准化研究院工程师,主要研究方向为人工智能安全、大数据安全、个人信息保护
    郑墨泓(1995- ),女,广东潮阳人,中国电子科技集团公司第七研究所助理工程师,主要研究方向为航天器姿态控制、无人机组网
  • 基金资助:
    科技创新2030重大项目(2020AAA0107804)

Robust reinforcement learning algorithm based on pigeon-inspired optimization

Mingying ZHANG1, Bing HUA2, Yuguang ZHANG1, Haidong LI1, Mohong ZHENG3   

  1. 1 China Electronics Standardization Institute, Beijing 100007, China
    2 College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
    3 The 7th Research Institute of China Electronics Technology Group Corporation, Guangzhou 510000, China
  • Revised:2022-07-15 Online:2022-10-15 Published:2022-10-01
  • Supported by:
    Science and Technology Innovation 2030 Major Project(2020AAA0107804)

摘要:

强化学习是一种人工智能算法,具有计算逻辑清晰、模型易扩展的优点,可以在较少甚至没有先验信息的前提下,通过和环境交互并最大化值函数,调优策略性能,有效地降低物理模型引起的复杂性。基于策略梯度的强化学习算法目前已成功应用于图像智能识别、机器人控制、自动驾驶路径规划等领域。然而强化学习高度依赖采样的特性决定了其训练过程需要大量样本来收敛,且决策的准确性易受到与仿真环境中不匹配的轻微干扰造成严重影响。特别是当强化学习应用于控制领域时,由于无法保证算法的收敛性,难以对其稳定性进行证明,为此,需要对强化学习进行改进。考虑到群体智能算法可通过群体协作解决复杂问题,具有自组织性及稳定性强的特征,利用其对强化学习进行优化求解是一个提高强化学习模型稳定性的有效途径。结合群体智能中的鸽群算法,对基于策略梯度的强化学习进行改进:针对求解策略梯度时存在迭代求解可能无法收敛的问题,提出了基于鸽群的强化学习算法,以最大化未来奖励为目的求解策略梯度,将鸽群算法中的适应性函数和强化学习结合估计策略的优劣,避免求解陷入死循环,提高了强化学习算法的稳定性。在具有非线性关系的两轮倒立摆机器人控制系统上进行仿真验证,实验结果表明,基于鸽群的强化学习算法能够提高系统的鲁棒性,降低计算量,减少算法对样本数据库的依赖。

关键词: 鸽群算法, 强化学习, 策略梯度, 鲁棒性

Abstract:

Reinforcement learning(RL) is an artificial intelligence algorithm with the advantages of clear calculation logic and easy expansion of the model.Through interacting with the environment and maximizing value functions on the premise of obtaining little or no prior information, RL can optimize the performance of strategies and effectively reduce the complexity caused by physical models .The RL algorithm based on strategy gradient has been successfully applied in many fields such as intelligent image recognition, robot control and path planning for automatic driving.However, the highly sampling-dependent characteristics of RL determine that the training process needs a large number of samples to converge, and the accuracy of decision making is easily affected by slight interference that does not match with the simulation environment.Especially when RL is applied to the control field, it is difficult to prove the stability of the algorithm because the convergence of the algorithm cannot be guaranteed.Considering that swarm intelligence algorithm can solve complex problems through group cooperation and has the characteristics of self-organization and strong stability, it is an effective way to be used for improving the stability of RL model.The pigeon-inspired optimization algorithm in swarm intelligence was combined to improve RL based on strategy gradient.A RL algorithm based on pigeon-inspired optimization was proposed to solve the strategy gradient in order to maximize long-term future rewards.Adaptive function of pigeon-inspired optimization algorithm and RL were combined to estimate the advantages and disadvantages of strategies, avoid solving into an infinite loop, and improve the stability of the algorithm.A nonlinear two-wheel inverted pendulum robot control system was selected for simulation verification.The simulation results show that the RL algorithm based on pigeon-inspired optimization can improve the robustness of the system, reduce the computational cost, and reduce the algorithm’s dependence on the sample database.

Key words: pigeon-inspired optimization algorithm, strengthen learning, policy gradient, robustness

中图分类号: 

No Suggested Reading articles found!