期刊文献+

基于一般化斜投影的异策略时序差分学习算法 被引量:4

Off-policy linear temporal difference learning algorithms with a generalized oblique projection
在线阅读 下载PDF
导出
摘要 在强化学习的值函数线性估计问题中,时序差分不动点解和贝尔曼残差的方法都是对真实值函数的斜投影,然而这两种解经证明都不是最优解.通过对两种投影进行加权平均,提出了一种一般化的斜投影算子.基于此推导出两种残差时序差分学习算法,并给出了这两种算法在异策略下的收敛性证明.在著名的Baird的异策略反例实验上,与相关算法进行了对比,实验结果验证了所提算法的正确性和有效性. In the case of linear value function approximated reinforcement learning,it is meaningful to do research on off-policy algorithms to get a better balance on explore-exploit trade-off.In recent years,Sutton et al proposed offpolicy gradient temporal difference learning algorithms,which possess good properties in speed and convergence.The main contribution of this paper is proposing ageneralized oblique projection framework,which utilizes the weighted sum of two projections,so as to derive off-policy temporal difference learning algorithms with a generalized oblique projection.To derive a good algorithm,methods of Temporal Difference fixed-point and Bellman residual are widely used.However,they can be viewed as oblique projections of the true value function,therefore these two projections are not optimal.This paper starts from understanding different algorithms from the view of projection,and proposes a kind of method to obtain a better projection,by generalizing a kind of oblique projection as the weighted sum of projections of these two projections.Further,to obtain convergent algorithms for off-policy setting,we extend thegeneralized projection based on the norm of expected TD update,and generates two kinds of objective functions.Employing the approach of stochastic gradient method,this paper derives two convergent off-policy linear residual Temporal Difference algorithms.To theoretically prove the convergence of our algorithms,we use the method of ordinary-differential-equation approach,which views iterations as ordinary differential equations,and tries to guarantee the stability of them.Experimental results on Baird's off-policy counterexample demonstrate the effectiveness of the proposed algorithms.Discussions on performance for different weight value parameters are given at last.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2017年第6期1052-1062,共11页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(61403208) 南京大学计算机软件新技术国家重点实验室开放课题(KFKT2016B04) 南京邮电大学引进人才科研启动基金(NY214014)
关键词 强化学习 线性函数估计 斜投影 异策略 时序差分学习 reinforcement learning, linear function approximation, oblique projection, off-policy, temporaldifference learning
  • 相关文献

参考文献1

二级参考文献4

共引文献299

同被引文献48

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部