期刊文献+

基于参数探索的期望最大化策略搜索 被引量:4

Expectation-maximization Policy Search with Parameter-based Exploration
在线阅读 下载PDF
导出
摘要 针对随机探索易于导致梯度估计方差过大的问题,提出一种基于参数探索的期望最大化(Expectation-maximization,EM)策略搜索方法.首先,将策略定义为控制器参数的一个概率分布.然后,根据定义的概率分布直接在控制器参数空间进行多次采样以收集样本.在每一幕样本的收集过程中,由于选择的动作均是确定的,因此可以减小采样带来的方差,从而减小梯度估计方差.最后,基于收集到的样本,通过最大化期望回报函数的下界来迭代地更新策略参数.为减少采样耗时和降低采样成本,此处利用重要采样技术以重复使用策略更新过程中收集的样本.两个连续空间控制问题的仿真结果表明,与基于动作随机探索的策略搜索强化学习方法相比,本文所提方法不仅学到的策略最优,而且加快了算法收敛速度,具有较好的学习性能. In order to reduce large variance of gradient estimation resulted from stochastic exploration strategy, a kind of expectation-maximization policy search reinforcement learning with parameter-based exploration is proposed. At first, a probability distribution over the parameters of a controller is used to define a policy. Secondly, samples are collected by directly sampling in the controller parameter space according to the probability distribution for several times. During the sample-collection procedure of each episode, because the selected actions are deterministic, sampling from the defined policy leads to a small variance in the samples, which can reduce the variance of gradient estimation. At last, based on the collected samples, policy parameters are iteratively updated by maximizing the lower bound of the expected return function. In order to reduce the time-consumption and to lower the cost of sampling, an importance sampling technique is used to repeatedly use samples collected from policy update process. Simulation results on two continuous-space control problems illustrate that the proposed policy search method can not only obtain the most optimal policy but also improve the convergence speed as compared with several policy search reinforcement learning methods with action-based stochastic exploration, thus has a better learning performance.
出处 《自动化学报》 EI CSCD 北大核心 2012年第1期38-45,共8页 Acta Automatica Sinica
基金 国家自然科学基金(60804022 60974050 61072094) 教育部新世纪优秀人才支持计划(NCET-08-0836 NCET-10-0765) 霍英东教育基金会青年教师基金(121066)资助~~
关键词 策略搜索 强化学习 参数空间 探索 期望最大化 重要采样 Policy search, reinforcement learning, parameter space, exploration, expectation-maximization (EM), importance sampling
  • 相关文献

参考文献5

二级参考文献72

共引文献78

同被引文献40

  • 1SUTTON R S, BARTO A G. Introduction to reinforce- ment learning IM]. Cambridge: MIT, 1998.
  • 2PETERS J, SCHAAL S. Natural actor-critic [J]. Neu- rocomputing, 2008, 71(7): 1180- 1190.
  • 3GRONDMAN I, BUSONIU L, LOPES G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients ['[1. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6) : 1291 - 1307.
  • 4BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor:critic algorithms [J]. Automati- ca, 2009, 45(11).. 2471-2482.
  • 5SUTTON R S. Learning to predict by the methods of temporal differences [J]. Machine Learning, 1988, 3(1): 9 -44.
  • 6ADAM S, BUSONIU L, BABUSKA R. Experience replay for rea[time reinforcement learning control [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42 ( 2): 201 - 212.
  • 7BRADTKE S J, BARTO A G. Linear least-squares algo- rithms for temporal difference learning [J]. Machine Learning, 1996, 22(1/2/3) .. 33 - 57.
  • 8BOYAN J A. Technical update: least-squares temporal difference learning [J]. Machine Learning, 2002, 49(2/ 3) : 233- 246.
  • 9DANN C, NEUMANN G, PETERS J. Policy evalua- tion with temporal differences: a survey and compari- son [J]. The Journal of Machine Learning Research, 2014, 15(1): 809-883.
  • 10GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation [M'] ff Mod- eling DecJsJorts for Artificial Intelligence. Berlin: Springer, 2010 : 207 - 218.

引证文献4

二级引证文献42

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部