摘要
针对自动驾驶汽车在稀疏奖励、高维连续动作空间下的路径规划难题,本文提出一种融合模仿学习的深度强化学习算法(TD3D)。该方法以TD3为骨架,在Actor-Critic框架中引入“学生—教师”式训练机制:利用专家演示数据构建独立专家经验池,通过离线计算的高质量Q值与动作标签,对Critic与Actor网络进行混合监督;设计随训练进程自适应衰减的演示权重,实现“早期模仿为主、后期探索为主”的平滑过渡。高速公路汇入与变道场景的实验结果表明,TD3D在成功率、平均回合奖励、收敛速度及策略稳定性方面均显著优于原始TD3与行为克隆(BC)基线,且对专家数据规模与来源具有良好的鲁棒性。本研究为稀疏奖励条件下的端到端自动驾驶策略训练提供了一种可落地的工程化思路。
Addressing the path planning challenges faced by autonomous vehicles in sparse reward scenarios and high-dimensional continuous action spaces,this paper proposes a deep reinforcement learning algorithm(TD3D)that integrates imitation learning.This method uses TD3 as its backbone and introduces a“student-teacher”training mechanism within the Actor-Critic framework:an independent expert experience pool is constructed using expert demonstration data,and both the Critic and Actor networks receive mixed supervision through high-quality Q-values and action labels computed offline.A demonstration weight,which adapts and decays during training,enables a smooth transition from“early imitation-dominated”to“late exploration-dominated”learning.Experimental results on highway merge and lane-change scenarios demonstrate that TD3D significantly outperforms both the original TD3 and Behavior Cloning(BC)baselines in success rate,average round reward,convergence speed,and policy stability.It also exhibits strong robustness to the scale and source of expert data.This research provides a practical engineering approach for training end-to-end autonomous driving policies under sparse reward conditions.
作者
杨烈奔
YANG Lieben(Chongqing Electric Power College,Chongqing 400053,China)
基金
Science and Technology Research Program of Chongqing Electric Power College“End-to-End Autonomous Driving Based on Safety Constraintsin Dynamic Environments”(D-KY202515)。
关键词
自动驾驶
路径规划
强化学习
模仿学习
autonomous driving
path planning
reinforcement learning
imitation learning