神经网络增强学习的梯度算法研究被引量：22

A Gradient Algorithm for Neural-Network-Based Reinforcement Learning

下载PDF

导出

摘要针对具有连续状态和离散行为空间的Markov决策问题 ,提出了一种新的采用多层前馈神经网络进行值函数逼近的梯度下降增强学习算法 .该算法采用了近似贪心且连续可微的Boltzmann分布行为选择策略 ,通过极小化具有非平稳行为策略的Bellman残差平方和性能指标 ,以实现对Markov决策过程最优值函数的逼近 .对算法的收敛性和近似最优策略的性能进行了理论分析 .通过Mountain Car学习控制问题的仿真研究进一步验证了算法的学习效率和泛化性能 . To solve Markov decision problems with continuous state space and discrete action space, neural networks are commonly used as value function approximators. Since there are no teacher signals in reinforcement learning, gradient algorithms for neural networks in supervised learning can not be applied directly. The existing direct algorithms for reinforcement-learning based on neural networks are not gradient descent algorithms of any objective functions. Thus, their convergence analysis is hard to be obtained and some divergence examples have been found. In the previous work on residual gradient algorithms, the action policy is assumed to be stationary so that convergence can not be guaranteed when the action policy is usually greedy with respect to the estimated value function. In this paper, a new gradient descent reinforcement-learning algorithm is proposed, where multi-layer feed-forward neural networks are used as value function approximators. A nearly greedy and differentiable action policy with Boltzmann probability distribution is employed in the new algorithm. The optimal value functions of Markov decision processes are approximated by minimizing Bellman residuals with non-stationary action polices. To derive incremental gradient learning rules, an upper bound function of the Bellman residuals is employed as the objective function. The convergence of the proposed algorithm and the performance of the approximated optimal policy are analyzed theoretically. Simulation results on the learning control of the Mountain-Car problem illustrate the learning efficiency and generalization ability of the proposed algorithm.

作者徐昕贺汉根

机构地区国防科学技术大学自动化研究所

出处《计算机学报》 EI CSCD 北大核心 2003年第2期227-233,共7页 Chinese Journal of Computers

基金国家自然科学基金 ( 6 0 0 75 0 2 0 )资助 .

关键词神经网络增强学习梯度算法 MARKOV决策过程值函数逼近机器学习 reinforcement learning neural networks Markov decision processes value function approximation gradient descent

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献14

1Kaelbling L P et al. Reinforcement learning: A survey. Jour- nal of Artificial Intelligence Research, 1996, 4: 237～285
2Tesauro G J. Temporal difference learning and TD-gammon. Communications of the ACM, 1995, 38(3):58～68
3Crites R H, Barto A G. Elevator group control using multiple reinforcement learning agents. Machine Learning, 1998, 33(2/3):235～262
4Tsitsiklis J N, Roy B V. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 1997,42(5): 674～690
5Watkins C J, Dayan P. Q-learning. Machine Learning , 1992,8:279-292
6Singh S P et al. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 2000, 38(3):287～308
7Gorden G. Stable function approximation in dynamic programming. In: Proceedings of the 12th International Conference on Machine Learning (ICML95), Tahoe City, California, USA, 1995. 261～268
8Baird L C. Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning (ICML95), Tahoe City, California, USA, 1995. 30～37
9Rumelhart D E et al. Learning internal representations by error propagation. In: Rumelhart D E et al, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol.1,Cambridge, MA: MIT Press,1986. 318～362
10Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 1989, 2: 303～314

同被引文献201

1兰壮丽,赵勇,张文宇.基于神经网络的智能DSS研究[J].西安科技学院学报,2004,24(2):207-210. 被引量：3
2孙明轩,王郸维,陈彭年.有限区间非线性系统的重复学习控制[J].中国科学：信息科学,2010,40(3):433-444. 被引量：12
3张毅,姚富强.基于可靠性的抗干扰通信网性能仿真[J].系统仿真学报,2004,16(5):967-970. 被引量：4
4高前兆,刘发民,李元洪,张新民,仵彦卿.建设甘肃河西内陆区水资源战略安全体系的构想[J].安全与环境学报,2004,4(3):77-80. 被引量：3
5闫友彪,陈元琰.机器学习的主要策略综述[J].计算机应用研究,2004,21(7):4-10. 被引量：62
6张雁冰,杭大明,马正新,曹志刚.基于再励学习的主动队列管理算法[J].软件学报,2004,15(7):1090-1098. 被引量：7
7雷英杰,王宝树.基于ANFIS的飞行器自动着陆模糊控制器设计[J].系统仿真学报,2004,16(11):2580-2583. 被引量：7
8罗可,林睦纲,郗东妹.数据挖掘中分类算法综述[J].计算机工程,2005,31(1):3-5. 被引量：64
9DerongLiu.Approximate Dynamic Programming for Self-Learning Control[J].自动化学报,2005,31(1):13-18. 被引量：14
10许世范,王雪松,郝继飞.Predicting Model for Complex Production Process Based on Dynamic Neural Network[J].Journal of China University of Mining and Technology,2001,11(1):20-23. 被引量：1

引证文献22

1董沛武,刘微微,娄岩峰.基于遗传算法和神经网络的企业核心竞争力评价模型研究[J].兵工学报,2009,30(S1):114-118. 被引量：6
2王学宁,徐昕,吴涛,贺汉根.策略梯度强化学习中的最优回报基线[J].计算机学报,2005,28(6):1021-1026. 被引量：6
3周昌能,余雪丽.基于BP网络的权值更新快速收敛算法[J].计算机应用,2006,26(8):1940-1942. 被引量：6
4王雪松,程玉虎,易建强,王炜强.基于Elman网络的非线性系统增强式学习控制[J].中国矿业大学学报,2006,35(5):653-657. 被引量：8
5王惠,符策,谢益武,许瑞雪,杨小佳.面向伙伴选择的模糊Markov博弈控制及仿真研究[J].系统仿真学报,2007,19(15):3572-3576. 被引量：1
6王俊丽,胡彧.基于神经网络学习机制的应急决策支持中间件模型[J].山西电子技术,2007(4):57-58.
7陈圣磊,李卫红,姚娟.基于最小二乘的Q(λ)强化学习算法[J].计算机工程与应用,2008,44(34):47-50.
8蚩志锋,闫珍珠,黄彪.基于遗传算法与BP算法的水质评价模型[J].重庆科技学院学报（自然科学版）,2009,11(1):122-124. 被引量：8
9陈圣磊,谷瑞军,陈耿,薛晖.基于TD(λ)的自然梯度强化学习算法[J].计算机科学,2010,37(12):186-189. 被引量：2
10喻昕,邓飞,唐利霞.Pi-sigma神经网络的乘子法随机单点在线梯度算法[J].计算机应用研究,2011,28(11):4074-4077. 被引量：3

二级引证文献104

1郑宗倩,张定海,王金辉,贾生海,白有帅,马雁,寇睿.疏勒河中游梯级水库库容和库水位关系模型优化[J].农业工程,2020(12):50-56. 被引量：1
2王亮.神经网络法探究膀胱体积剂量在宫颈癌病人放射治疗中的应用[J].疾病监测与控制,2020(5):373-375.
3龚代圣,杨栋枢,王文清,杨德胜.基于BP神经网络的信息系统运行质量评价模型[J].微型电脑应用,2011(12):9-12. 被引量：3
4董沛武,刘微微,娄岩峰.基于遗传算法和神经网络的企业核心竞争力评价模型研究[J].兵工学报,2009,30(S1):114-118. 被引量：6
5孙桦.企业团队及其管理要点[J].人才资源开发,2005(10):86-87. 被引量：1
6徐冠,夏克文,徐乃勋.基于LM算法的神经网络在冠心病诊断中的应用[J].微电子学与计算机,2006,23(2):189-192. 被引量：11
7韦玉科,汪仁煌,黎敬波.一种亚健康诊断推理的新方法[J].计算机应用研究,2006,23(3):70-72. 被引量：13
8韦玉科,汪仁煌,陈群,李江平.基于竞争神经网络的中医智能诊断推理新方法[J].计算机工程与应用,2006,42(7):224-226. 被引量：8
9张雨浓,刘巍,易称福,李巍.Legendre正交基前向神经网络的权值直接确定法[J].大连海事大学学报,2008,34(1):32-36. 被引量：6
10李江平,潘保昌,韦玉科.两层级神经网络及在中医智能诊断中的应用[J].计算机应用研究,2008,25(10):3169-3170. 被引量：5

1李兆斌,徐昕,吴军,连传强.增强学习算法的性能测试与对比分析[J].计算机应用研究,2010,27(10):3662-3665. 被引量：1
2傅启明,刘全,王辉,肖飞,于俊,李娇.一种基于线性函数逼近的离策略Q(λ)算法[J].计算机学报,2014,37(3):677-686. 被引量：26
3胡光华,刘英敏,吴沧浦.基于状态集结的值函数逼近[J].北京理工大学学报,2000,20(3):304-308.
4邹鹏飞.用WinMount工具管理你的压缩文件[J].电子乐园,2011(9):56-60.
5李权诚.用WinMount工具管理你的压缩文件[J].电脑入门,2010(10):37-41.
6刘志勇,袁媛.基于测地距离的半监督增强[J].计算机工程与应用,2011,47(21):202-204. 被引量：3
7密码疲劳[J].英语画刊（高级）,2017,0(1):24-24.
8郭德龙,郑添健,周永权.混合快速细菌觅食算法求解非线性方程[J].计算机工程与应用,2014,50(21):32-34. 被引量：3
9王婷婷,丁世飞.基于资格迹的RBF非线性系统强化学习研究[J].小型微型计算机系统,2016,37(7):1508-1512. 被引量：1
102010TI亚洲技术研讨会MOU新亮点[J].单片机与嵌入式系统应用,2010(9):41-41.

计算机学报

2003年第2期

浏览历史

内容加载中请稍等...

神经网络增强学习的梯度算法研究被引量：22

参考文献14

同被引文献201

引证文献22

二级引证文献104

相关作者

相关机构

相关主题

浏览历史

神经网络增强学习的梯度算法研究 被引量：22

参考文献14

同被引文献201

引证文献22

二级引证文献104

相关作者

相关机构

相关主题

浏览历史

神经网络增强学习的梯度算法研究被引量：22