A STOCHASTIC TRUST-REGION FRAMEWORK FOR POLICY OPTIMIZATION 被引量：1

导出

摘要 In this paper,we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning.The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy.The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy.We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region.To overcome the bias caused by sampling to the function estimations under the random settings,we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted.Moreover,for a Gaussian policy which is commonly used for continuous action space,the maximization with respect to the mean and covariance is performed separately to control the entropy loss.Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions.Comparisons with the state-of-the–art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

作者 Mingming Zhao Yongfeng Li Zaiwen Wen

机构地区 Beijing International Center for Mathematical Research

出处《Journal of Computational Mathematics》 SCIE CSCD 2022年第6期1004-1030,共27页 计算数学（英文）

基金 The computational results were obtained at GPUs supported by the National Engineering Laboratory for Big Data Analysis and Applications and the High-performance Computing Platform of Peking University.

关键词 Deep reinforcement learning Stochastic trust region method Policy optimization Global convergence Entropy control

分类号 O24 [理学—计算数学]

引文网络
相关文献

同被引文献1

1Xiaoyu Wang,Ya-xiang Yuan.STOCHASTIC TRUST-REGION METHODS WITH TRUST-REGION RADIUS DEPENDING ON PROBABILISTIC MODELS[J].Journal of Computational Mathematics,2022,40(2):294-334. 被引量：2

引证文献1

1Rulei Qi,Dan Xue,Jing Li,Yujia Zhai.AN ACCELERATED STOCHASTIC TRUST REGION METHOD FOR STOCHASTIC OPTIMIZATION[J].Journal of Computational Mathematics,2025,43(5):1169-1193.

1Saman Babaie-Kafaki,Saeed Rezaee.A randomized nonmonotone adaptive trust region method based on the simulated annealing strategy for unconstrained optimization[J].International Journal of Intelligent Computing and Cybernetics,2019,12(3):389-399.
2齐小刚,张海洋,魏倩.一种非视距环境下的目标定位算法[J].智能系统学报,2021,16(1):75-80. 被引量：2
3阿尔迈拉·奥斯曼诺维奇·通斯特伦,阿金(翻译),魏潇(审校).当AI成为论文作者[J].环球科学,2022(19):56-59.
4Samih Zein.A Polynomial Chaos Expansion Trust Region Method for Robust Optimization[J].Communications in Computational Physics,2013,14(7):412-424.
5李亭,刘俊,周晓根,张贵军.距离约束和二面角优化的蛋白质结构预测方法[J].小型微型计算机系统,2022,43(1):203-209.
6Zheng Chen,Minjie Zhang,Jiang Zhu,Shiqiang Zhu.Real time path planning via alternating minimisation through image information[J].IET Cyber-Systems and Robotics,2021,3(3):245-255. 被引量：1
7YONGHUI YU,SONGHUI XU,ERYONG ZHAO,YONGSHUN DONG,JINBIN CHEN,BOQI RAO,JIE ZENG,LEI YANG,JIACHUN LU,FUMAN QIU.Identification of a 10-pseudogenes signature as a novel prognosis biomarker for ovarian cancer[J].BIOCELL,2022,46(4):999-1011.
8程进,胡寒栋,江业帆,张一博,丁季时雨.基于强化学习的通信受限环境多无人机协同策略[J].无人系统技术,2022,5(5):12-20. 被引量：4
9戢泽民,徐野,哈乐.面向强化学习的虚拟链路智能体仿真环境研究[J].科技资讯,2022,20(19):29-32.
10Jie Guo,Zhong Wan.A new three-term spectral conjugate gradient algorithm with higher numerical performance for solving large scale optimization problems based on Quasi-Newton equation[J].International Journal of Modeling, Simulation, and Scientific Computing,2021,12(5):234-247.

Journal of Computational Mathematics

2022年第6期

浏览历史

内容加载中请稍等...

A STOCHASTIC TRUST-REGION FRAMEWORK FOR POLICY OPTIMIZATION 被引量：1

同被引文献1

引证文献1

相关作者

相关机构

相关主题

浏览历史