Journal on Communications ›› 2019, Vol. 40 ›› Issue (5): 108-116.doi: 10.11959/j.issn.1000-436x.2019122

• Papers • Previous Articles     Next Articles

Advantage estimator based on importance sampling

Quan LIU1,2,3,4,Yubin JIANG1,Zhihui HU1   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou 215006,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China
  • Revised:2019-04-25 Online:2019-05-25 Published:2019-05-30
  • Supported by:
    The National Natural Science Foundation of China(61772355);The National Natural Science Foundation of China(61702055);The National Natural Science Foundation of China(61472262);The National Natural Science Foundation of China(61502323);The National Natural Science Foundation of China(61502329);Jiangsu Province Natural Science Research University Major Projects(18KJA520011);Jiangsu Province Natural Science Research University Major Projects(17KJA520004);Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04);Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172017K18);Suzhou Industrial Application of Basic Research Program Part(SYG201422)

Abstract:

In continuous action tasks,deep reinforcement learning usually uses Gaussian distribution as a policy function.Aiming at the problem that the Gaussian distribution policy function slows down due to the clipped action,an importance sampling advantage estimator was proposed.Based on the general advantage estimator,an importance sampling mechanism was introduced by the estimator to improve the convergence speed of the algorithm and correct the deviation of the value function caused by calculating the target strategy and action strategy ratio of the boundary action.In addition,the L parameter was introduced by ISAE which improved the reliability of the sample and limited the stability of the network parameters by limiting the range of the importance sampling rate.In order to verify the effectiveness of the ISAE,applying it to proximal policy optimization and comparing it with other algorithms on the MuJoCo platform.Experimental results show that ISAE has a faster convergence rate.

Key words: reinforcement learning, importance sampling, deep reinforcement learning, advantage function

CLC Number: 

No Suggested Reading articles found!