摘要
为了解决基于平均场的多智能体强化学习(M^(3)-UCRL)算法在面对智能体数量众多时对其他智能体动态认知不足从而导致探索效率低下的问题,本文在M^(3)-UCRL的基础上,引入了多头注意力机制,利用Decision Transformer减少长期奖励传播的误差累积,提出一种基于多头注意力机制的多智能体强化学习(M4-UCRL)算法。该算法将强化学习决策问题表述为序列建模问题,利用Transformer的自回归生成能力来预测最优动作序列,提高决策正确率,最后以无人集群协同导航任务实验对算法进行验证。实验结果表明M4-UCRL算法可以有效增强智能体对其他智能体动态的认知,提高了智能体探索的效率,与使用多层感知机作为策略网络的M^(3)-UCRL算法相比,M4-UCRL算法具有更高的性能。
In order to solve the problem that the multi-agent reinforcement learning(M^(3)-UCRL)algorithm based on mean field has insufficient dynamic cognition of other agents when facing a large number of agents,resulting in low exploration efficiency,this paper introduces a multi-head attention mechanism on the basis of M^(3)UCRL,uses Decision Transformer to reduce the error accumulation of long-term reward propagation,and proposes a multi-agent reinforcement learning(M4 UCRL)algorithm based on multi-head attention mechanism.The algorithm formulates the reinforcement learning decision problem as a sequence modeling problem,uses the autoregressive generation ability of Transformer to predict the optimal action sequence,improves the decision accuracy,and finally verifies the algorithm with an unmanned swarm collaborative navigation task experiment.The experimental results show that the M4-UCRL algorithm can effectively enhance the agent's cognition of the dynamics of other agents and improve the efficiency of agent exploration.Compared with the M^(3)-UCRL algorithm using multi-layer perceptron as the policy network,the M4-UCRL algo-rithm has higher performance.
作者
单国强
李大鹏
SHAN Guoqiang;LI Dapeng(Portland Institute,Nanjing University of Posts and Telecommunications,Nanjing 210003,China;College of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《无线通信技术》
2025年第4期44-50,共7页
Wireless Communication Technology
基金
国家自然科学基金资助项目(No.62371245)。