摘要
针对具有连续状态和离散行为空间的Markov决策问题 ,提出了一种新的采用多层前馈神经网络进行值函数逼近的梯度下降增强学习算法 .该算法采用了近似贪心且连续可微的Boltzmann分布行为选择策略 ,通过极小化具有非平稳行为策略的Bellman残差平方和性能指标 ,以实现对Markov决策过程最优值函数的逼近 .对算法的收敛性和近似最优策略的性能进行了理论分析 .通过Mountain Car学习控制问题的仿真研究进一步验证了算法的学习效率和泛化性能 .
To solve Markov decision problems with continuous state space and discrete action space, neural networks are commonly used as value function approximators. Since there are no teacher signals in reinforcement learning, gradient algorithms for neural networks in supervised learning can not be applied directly. The existing direct algorithms for reinforcement-learning based on neural networks are not gradient descent algorithms of any objective functions. Thus, their convergence analysis is hard to be obtained and some divergence examples have been found. In the previous work on residual gradient algorithms, the action policy is assumed to be stationary so that convergence can not be guaranteed when the action policy is usually greedy with respect to the estimated value function. In this paper, a new gradient descent reinforcement-learning algorithm is proposed, where multi-layer feed-forward neural networks are used as value function approximators. A nearly greedy and differentiable action policy with Boltzmann probability distribution is employed in the new algorithm. The optimal value functions of Markov decision processes are approximated by minimizing Bellman residuals with non-stationary action polices. To derive incremental gradient learning rules, an upper bound function of the Bellman residuals is employed as the objective function. The convergence of the proposed algorithm and the performance of the approximated optimal policy are analyzed theoretically. Simulation results on the learning control of the Mountain-Car problem illustrate the learning efficiency and generalization ability of the proposed algorithm.
出处
《计算机学报》
EI
CSCD
北大核心
2003年第2期227-233,共7页
Chinese Journal of Computers
基金
国家自然科学基金 ( 6 0 0 75 0 2 0 )资助 .