基于注意力机制的多智能体强化学习算法

Multi-agent reinforcement learning algorithm based on attention mechanism

导出

摘要为了解决基于平均场的多智能体强化学习(M^(3)-UCRL)算法在面对智能体数量众多时对其他智能体动态认知不足从而导致探索效率低下的问题,本文在M^(3)-UCRL的基础上,引入了多头注意力机制,利用Decision Transformer减少长期奖励传播的误差累积,提出一种基于多头注意力机制的多智能体强化学习(M4-UCRL)算法。该算法将强化学习决策问题表述为序列建模问题,利用Transformer的自回归生成能力来预测最优动作序列,提高决策正确率,最后以无人集群协同导航任务实验对算法进行验证。实验结果表明M4-UCRL算法可以有效增强智能体对其他智能体动态的认知,提高了智能体探索的效率,与使用多层感知机作为策略网络的M^(3)-UCRL算法相比,M4-UCRL算法具有更高的性能。 In order to solve the problem that the multi-agent reinforcement learning(M^(3)-UCRL)algorithm based on mean field has insufficient dynamic cognition of other agents when facing a large number of agents,resulting in low exploration efficiency,this paper introduces a multi-head attention mechanism on the basis of M^(3)UCRL,uses Decision Transformer to reduce the error accumulation of long-term reward propagation,and proposes a multi-agent reinforcement learning(M4 UCRL)algorithm based on multi-head attention mechanism.The algorithm formulates the reinforcement learning decision problem as a sequence modeling problem,uses the autoregressive generation ability of Transformer to predict the optimal action sequence,improves the decision accuracy,and finally verifies the algorithm with an unmanned swarm collaborative navigation task experiment.The experimental results show that the M4-UCRL algorithm can effectively enhance the agent's cognition of the dynamics of other agents and improve the efficiency of agent exploration.Compared with the M^(3)-UCRL algorithm using multi-layer perceptron as the policy network,the M4-UCRL algo-rithm has higher performance.

作者单国强李大鹏 SHAN Guoqiang;LI Dapeng(Portland Institute,Nanjing University of Posts and Telecommunications,Nanjing 210003,China;College of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)

机构地区南京邮电大学波特兰学院南京邮电大学通信与信息工程学院

出处《无线通信技术》 2025年第4期44-50,共7页 Wireless Communication Technology

基金国家自然科学基金资助项目(No.62371245)。

关键词多智能体强化学习多头注意力机制策略学习无人集群协同控制 multi-agent reinforcement learning DDPM mean-field control policy learning

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1胡浩然.足球运动员在时间压力下的运动决策[J].健与美,2025(7):95-97.
2金友龙,夏大文.基于领域自适应深度强化学习的跨分布车辆路径求解方法[J].智能计算机与应用,2025,15(12):17-22.
3张星,周诗航,朱刚贤,李加强.基于数字孪生与AI融合的机械设计课程教学模式研究[J].科教文汇,2025(23):128-131. 被引量：1
4邢媛媛,周青云,吴俊越,彭立忠.基于设备状态的智能变电站运行辅助决策模型构建方法[J].自动化技术与应用,2025,44(11):61-64.
5韩洁.生成式AI赋能的智慧图书馆知识服务框架研究[J].科技视界,2025,15(29):120-124. 被引量：1
6宋蓓蓓,余战秋.无人机集群协同搜救任务智能分配方法研究[J].黑龙江工业学院学报(综合版),2025,25(10):118-123.
7王东伟,黄德启,张阳婷,贺佳佳.基于多智能体通信的城市交通信号控制研究[J].东北师大学报(自然科学版),2025,57(4):57-63.
8王兆君.视域融合理论下企业沟通管理的共识构建——基于“前见—效果历史—文本”三维框架的分析[J].武夷学院学报,2025,44(10):23-29.
9周书灿.20世纪三四十年代马克思主义史学家关于中国文明起源问题的表述及缺失[J].郭沫若学刊,2025(4):19-27.
10黄成,殷振凯,邢爱佳,于智龙.基于深度强化学习的空间捕获自主决策[J].仪器仪表学报,2025,46(9):198-211.

无线通信技术

2025年第4期

浏览历史

内容加载中请稍等...

基于注意力机制的多智能体强化学习算法

相关作者

相关机构

相关主题

浏览历史