摘要
尽管策略梯度强化学习算法有较好的收敛性,但是在梯度估计的过程中方差过大,却是该方法在理论和应用上的一个主要弱点.为减小梯度强化学习算法的方差,该文提出一种新的算法———Istate Grbp算法:在策略梯度算法Istate GPOMDP中加入回报基线,以改进策略梯度算法的学习性能.文中证明了在Istate GPOMDP算法中引入回报基线,不会改变梯度估计的期望值,并且给出了使方差最小的最优回报基线.实验结果表明,和已有算法相比,该文提出的算法通过减小梯度估计的方差,提高了学习效率,加快了学习过程的收敛.
Although policy gradient reinforcement learning (PGRL) has good convergence properties, the variance of policy gradient estimation in existing PGRL algorithms is usually large, which becomes a significant problem for policy gradient algorithms in theory and in practice. This paper proposes a new policy gradient algorithm with reward baselines——Istate Grbp. The Istate Grbp algorithm is an extension of the Istate GPOMDP algorithm by introducing reward baselines to reduce the variance in policy gradient estimation. It is proved that adding a reward baseline in previous Istate GPOMDP does not influence the bias of policy gradient estimation, and the optimal reward baseline to minimize the variance is also derived, which is the average of the observed rewards. The experimental results on a typical POMDP problem show that the variance of Istate Grbp is much smaller than previous Istate GPOMDP and the learning efficiency and convergence speed are both improved.
出处
《计算机学报》
EI
CSCD
北大核心
2005年第6期1021-1026,共6页
Chinese Journal of Computers
基金
国家自然科学基金重点项目(60234030)
青年科学基金项目(60303012)资助.
关键词
强化学习
策略梯度
部分可观测马氏决策过程
回报基线
reinforcement learning
policy gradient
partial observable Markov decision process
reward baseline