期刊文献+

策略梯度强化学习中的最优回报基线 被引量:6

The Optimal Reward Baseline for Policy-Gradient Reinforcement Learning
在线阅读 下载PDF
导出
摘要 尽管策略梯度强化学习算法有较好的收敛性,但是在梯度估计的过程中方差过大,却是该方法在理论和应用上的一个主要弱点.为减小梯度强化学习算法的方差,该文提出一种新的算法———Istate Grbp算法:在策略梯度算法Istate GPOMDP中加入回报基线,以改进策略梯度算法的学习性能.文中证明了在Istate GPOMDP算法中引入回报基线,不会改变梯度估计的期望值,并且给出了使方差最小的最优回报基线.实验结果表明,和已有算法相比,该文提出的算法通过减小梯度估计的方差,提高了学习效率,加快了学习过程的收敛. Although policy gradient reinforcement learning (PGRL) has good convergence properties, the variance of policy gradient estimation in existing PGRL algorithms is usually large, which becomes a significant problem for policy gradient algorithms in theory and in practice. This paper proposes a new policy gradient algorithm with reward baselines——Istate Grbp. The Istate Grbp algorithm is an extension of the Istate GPOMDP algorithm by introducing reward baselines to reduce the variance in policy gradient estimation. It is proved that adding a reward baseline in previous Istate GPOMDP does not influence the bias of policy gradient estimation, and the optimal reward baseline to minimize the variance is also derived, which is the average of the observed rewards. The experimental results on a typical POMDP problem show that the variance of Istate Grbp is much smaller than previous Istate GPOMDP and the learning efficiency and convergence speed are both improved.
出处 《计算机学报》 EI CSCD 北大核心 2005年第6期1021-1026,共6页 Chinese Journal of Computers
基金 国家自然科学基金重点项目(60234030) 青年科学基金项目(60303012)资助.
关键词 强化学习 策略梯度 部分可观测马氏决策过程 回报基线 reinforcement learning policy gradient partial observable Markov decision process reward baseline
  • 相关文献

参考文献12

  • 1Sutton R.S., McAllester David, Singh Satinder et al. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of Advances in Neural Information Processing Systems, Denver, CO, USA, 2000, 1057~1063
  • 2Baird L.C. Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning, San Francisco, 1995, 30~37
  • 3Tsitsiklis J.N., van Roy B. Feature-based methods for large scale dynamic programming. Machine Learning, 1996, 22(1): 59~94
  • 4徐昕,贺汉根.神经网络增强学习的梯度算法研究[J].计算机学报,2003,26(2):227-233. 被引量:22
  • 5Baxter Jonathan, Bartlett P.L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 2001, 15(4): 319~350
  • 6Aberdeen Douglas Alexander. Policy-gradient algorithms for partially observable Markov decision processes[Ph.D. dissertation]. Australian National University, 2003
  • 7Evan Greensmith. Variance reduction techniques for gradient estimation in reinforcement learning. Journal of Machine Learning Research, 2004, 5(11): 1471~1530
  • 8Williams R.J. Simple statistical Gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3): 229~256
  • 9Baxter Jonathan, Bartlett P.L., Weaver Lex. Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 2001, 15(4): 351~381
  • 10Sutton R.S. Temporal Credit assignment in reinforcement learning[Ph.D. dissertation]. University of Massachustts, Amherst, MA, 1984

二级参考文献14

  • 1Baird L C. Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning (ICML95), Tahoe City, California, USA, 1995. 30~37
  • 2Rumelhart D E et al. Learning internal representations by error propagation. In: Rumelhart D E et al, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol.1,Cambridge, MA: MIT Press,1986. 318~362
  • 3Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 1989, 2: 303~314
  • 4Baird L C, Moore A. Gradient descent for general reinforcement learning. In: Kearns M S, Solla S A, Cohn D A eds. Advances in Neural Information Processing Systems 11, Cambrige, MA: MIT Press, 1999. 968~974
  • 5Bertsekas D P, Tsitsiklis J N. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 2000, 10(3): 627~642
  • 6Heger M. The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning, 1996, 22(1): 197~225
  • 7Sutton R. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky D S, Mozer M C, Hasselmo M E eds. Advances in Neural Information Processing Systems 8, Cambrige, MA: MIT Press, 1996. 1038~1044
  • 8Kaelbling L P et al. Reinforcement learning: A survey. Jour- nal of Artificial Intelligence Research, 1996, 4: 237~285
  • 9Tesauro G J. Temporal difference learning and TD-gammon. Communications of the ACM, 1995, 38(3):58~68
  • 10Crites R H, Barto A G. Elevator group control using multiple reinforcement learning agents. Machine Learning, 1998, 33(2/3):235~262

共引文献21

同被引文献50

引证文献6

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部