Multiagent reinforcement learning(MARL)has become a dazzling new star in the field of reinforcement learning in recent years,demonstrating its immense potential across many application scenarios.The reward function di...Multiagent reinforcement learning(MARL)has become a dazzling new star in the field of reinforcement learning in recent years,demonstrating its immense potential across many application scenarios.The reward function directs agents to explore their environments and make optimal decisions within them by establishing evaluation criteria and feedback mechanisms.Concurrently,cooperative objectives at the macro level provide a trajectory for agents’learning,ensuring alignment between individual behavioral strategies and the overarching system goals.The interplay between reward structures and cooperative objectives not only bolsters the effectiveness of individual agents but also fosters interagent collaboration,offering both momentum and direction for the development of swarm intelligence and the harmonious operation of multiagent systems.This review delves deeply into the methods for designing reward structures and optimizing cooperative objectives in MARL,along with the most recent scientific advancements in this field.The article meticulously reviews the application of simulation environments in cooperative scenarios and discusses future trends and potential research directions in the field,providing a forward-looking perspective and inspiration for subsequent research efforts.展开更多
The application of multiple unmanned aerial vehicles(UAVs)for the pursuit and capture of unauthorized UAVs has emerged as a novel approach to ensuring the safety of urban airspace.However,pursuit UAVs necessitate the ...The application of multiple unmanned aerial vehicles(UAVs)for the pursuit and capture of unauthorized UAVs has emerged as a novel approach to ensuring the safety of urban airspace.However,pursuit UAVs necessitate the utilization of their own sensors to proactively gather information from the unauthorized UAV.Considering the restricted sensing range of sensors,this paper proposes a multi-UAV with limited visual field pursuit-evasion(MUV-PE)problem.Each pursuer has a visual field characterized by limited perception distance and viewing angle,potentially obstructed by buildings.Only when the unauthorized UAV,i.e.,the evader,enters the visual field of any pursuer can its position be acquired.The objective of the pursuers is to capture the evader as soon as possible without collision.To address this problem,we propose the normalizing flow actor with graph attention critic(NAGC)algorithm,a multi-agent reinforcement learning(MARL)approach.NAGC executes normalizing flows to augment the flexibility of policy network,enabling the agent to sample actions from more intricate distributions rather than common distributions.To enhance the capability of simultaneously comprehending spatial relationships among multiple UAVs and environmental obstacles,NAGC integrates the“obstacle-target”graph attention networks,significantly aiding pursuers in supporting search or pursuit activities.Extensive experiments conducted in a high-precision simulator validate the promising performance of the NAGC algorithm.展开更多
Price-based and incentive-based demand response(DR)are both recognized as promising solutions to address the increasing uncertainties of renewable energy sources(RES)in microgrids.However,since the temporally optimiza...Price-based and incentive-based demand response(DR)are both recognized as promising solutions to address the increasing uncertainties of renewable energy sources(RES)in microgrids.However,since the temporally optimization horizons of price-based and incentive-based DR are different,few existing methods consider their coordination.In this paper,a multi-agent deep reinforcement learning(MA-DRL)approach is proposed for the temporally coordinated DR in microgrids.The proposed method enhances micrigrid operation revenue by coordinating day-ahead price-based demand response(PBDR)and hourly direct load control(DLC).The operation at different time scales is decided by different DRL agents,and optimized by a multiagent deep deterministic policy gradient(MA-DDPG)using a shared critic to guide agents to attain a global objective.The effectiveness of the proposed approach is validated on a modified IEEE 33-bus distribution system and a modified heavily loaded 69-bus distribution system.展开更多
The increasing adoption of unmanned aerial vehicles(UAVs)in urban low-altitude logistics systems,particularly for time-sensitive applications like parcel delivery and supply distribution,necessitates sophisticated coo...The increasing adoption of unmanned aerial vehicles(UAVs)in urban low-altitude logistics systems,particularly for time-sensitive applications like parcel delivery and supply distribution,necessitates sophisticated coordination mechanisms to optimize operational efficiency.However,the limited capability of UAVs to extract stateaction information in complex environments poses significant challenges to achieving effective cooperation in dynamic and uncertain scenarios.To address this,we presents an Improved Multi-Agent Hybrid Attention Critic(IMAHAC)framework that advances multi-agent deep reinforcement learning(MADRL)through two key innovations.Firstly,a Temporal Difference Error and Time-based Prioritized Experience Replay(TT-PER)mechanism that dynamically adjusts sample weights based on temporal relevance and prediction error magnitude,effectively reducing the interference from obsolete collaborative experiences while maintaining training stability.Secondly,a hybrid attention mechanism is developed,integrating a sensor fusion layer—which aggregates features from multi-sensor data to enhance decision-making—and a dissimilarity layer that evaluates the similarity between key-value pairs and query values.By combining this hybrid attention mechanism with theMulti-Actor Attention Critic(MAAC)framework,our approach strengthens UAVs’capability to extract critical state-action features in diverse environments.Comprehensive simulations in urban air mobility scenarios demonstrate IMAHAC’s superiority over conventional MADRL baselines and MAAC,achieving higher cumulative rewards,fewer collisions,and enhanced cooperative capabilities.This work provides both algorithmic advancements and empirical validation for developing robust autonomous aerial systems in smart city infrastructures.展开更多
Most blockchain systems currently adopt resource-consuming protocols to achieve consensus between miners;for example,the Proof-of-Work(PoW)and Practical Byzantine Fault Tolerant(PBFT)schemes,which have a high consumpt...Most blockchain systems currently adopt resource-consuming protocols to achieve consensus between miners;for example,the Proof-of-Work(PoW)and Practical Byzantine Fault Tolerant(PBFT)schemes,which have a high consumption of computing/communication resources and usually require reliable communications with bounded delay.However,these protocols may be unsuitable for Internet of Things(IoT)networks because the IoT devices are usually lightweight,battery-operated,and deployed in an unreliable wireless environment.Therefore,this paper studies an efficient consensus protocol for blockchain in IoT networks via reinforcement learning.Specifically,the consensus protocol in this work is designed on the basis of the Proof-of-Communication(PoC)scheme directly in a single-hop wireless network with unreliable communications.A distributed MultiAgent Reinforcement Learning(MARL)algorithm is proposed to improve the efficiency and fairness of consensus for miners in the blockchain system.In this algorithm,each agent uses a matrix to depict the efficiency and fairness of the recent consensus and tunes its actions and rewards carefully in an actor-critic framework to seek effective performance.Empirical results from the simulation show that the fairness of consensus in the proposed algorithm is guaranteed,and the efficiency nearly reaches a centralized optimal solution.展开更多
Traditional multi-agent deep reinforcement learning has difficulty obtaining rewards,slow convergence,and effective cooperation among agents in the pretraining period due to the large joint state space and sparse rewa...Traditional multi-agent deep reinforcement learning has difficulty obtaining rewards,slow convergence,and effective cooperation among agents in the pretraining period due to the large joint state space and sparse rewards for action.Therefore,this paper discusses the role of demonstration data in multiagent systems and proposes a multi-agent deep reinforcement learning algorithm from fuse adaptive weight fusion demonstration data.The algorithm sets the weights according to the performance and uses the importance sampling method to bridge the deviation in the mixed sampled data to combine the expert data obtained in the simulation environment with the distributed multi-agent reinforcement learning algorithm to solve the difficult problem.The problem of global exploration improves the convergence speed of the algorithm.The results in the RoboCup2D soccer simulation environment show that the algorithm improves the ability of the agent to hold and shoot the ball,enabling the agent to achieve a higher goal scoring rate and convergence speed relative to demonstration policies and mainstream multi-agent reinforcement learning algorithms.展开更多
The intensity levels of autonomous vehicles should be thoroughly evaluated before deployment,while vehicle tests are difficult for the sake of heavy experimental resources and large numbers of cases,especially tests t...The intensity levels of autonomous vehicles should be thoroughly evaluated before deployment,while vehicle tests are difficult for the sake of heavy experimental resources and large numbers of cases,especially tests that include safety-critical scenarios.In this study,a new scenario generation method is proposed to accelerate the test,which is based on a multiagent reinforcement learning(MARL)framework incorporating the driving potential field(DPF).This framework is used to train some background vehicles to enable high-risk and marginal scenes,where the DPF is applied to enact the rewards of the adversarial background agents.Other background vehicles that use reasonable driving policies,which serve as naturalistic agents to increase scenario diversity,are also considered.The coexistence of naturalistic and adversarial agents enriches the experiences learned by the background cars,providing more marginal and risky scenarios for accelerating the test.The experimental results demonstrate the efficiency of the generation of high-risk and marginal scenes,with comprehensive assessments via a novel field-based dynamic risk evaluation method.展开更多
Space manipulators play an important role in the on-orbit services and planetary surface operation.In the extreme environment of space,space manipulators are susceptible to a variety of unknown disturbances.How to hav...Space manipulators play an important role in the on-orbit services and planetary surface operation.In the extreme environment of space,space manipulators are susceptible to a variety of unknown disturbances.How to have a resilient guarantee in failure or disturbance is the core capability of its future development.Compared with traditional motion planning,learning-based motion planning has gradually become a hot spot in current research.However,no matter what kind of research ideas,the single robotic manipulator is studied as an independent agent,making it unable to provide sufficient flexibility under conditions such as external force disturbance,observation noise,and mechanical failure.Therefore,this paper puts forward the idea of“discretization of the traditional single manipulator”.Different discretization forms are given through the analysis of the multi-degree-of-freedom single-manipulator joint relationship,and a single-manipulator representation composed of multiple new subagents is obtained.Simultaneously,to verify the ability of the new multiagent representation to deal with interference,we adopted a centralized multiagent reinforcement learning framework.The influence of the number of agents and communication distances on learning-based planning results is analyzed in detail.In addition,by imposing joint locking failures on the manipulator and introducing observation and action interference,it is verified that the“multiagent robotic manipulator”obtained after discretization has stronger antidisturbance resilient capability than the traditional single manipulator.展开更多
基金supported by the Key Project of the National Language Commission(No.ZDI145-110)the Key Laboratory Project(No.YYZN-2024-6),the China Disabled Persons’Federation Project(No.2024CDPFAT-22)+3 种基金the National Natural Science Foundation of China(Nos.62171042,62102033,and U24A20331)the Project for the Construction and Support of High-Level Innovative Teams in Beijing Municipal Institutions(No.BPHR20220121)the Beijing Natural Science Foundation(Nos.4232026 and 4242020)the Projects of Beijing Union University(Nos.ZKZD202302 and ZK20202403)。
文摘Multiagent reinforcement learning(MARL)has become a dazzling new star in the field of reinforcement learning in recent years,demonstrating its immense potential across many application scenarios.The reward function directs agents to explore their environments and make optimal decisions within them by establishing evaluation criteria and feedback mechanisms.Concurrently,cooperative objectives at the macro level provide a trajectory for agents’learning,ensuring alignment between individual behavioral strategies and the overarching system goals.The interplay between reward structures and cooperative objectives not only bolsters the effectiveness of individual agents but also fosters interagent collaboration,offering both momentum and direction for the development of swarm intelligence and the harmonious operation of multiagent systems.This review delves deeply into the methods for designing reward structures and optimizing cooperative objectives in MARL,along with the most recent scientific advancements in this field.The article meticulously reviews the application of simulation environments in cooperative scenarios and discusses future trends and potential research directions in the field,providing a forward-looking perspective and inspiration for subsequent research efforts.
基金supported in part by the National Natural Science Foundation of China(62373380)。
文摘The application of multiple unmanned aerial vehicles(UAVs)for the pursuit and capture of unauthorized UAVs has emerged as a novel approach to ensuring the safety of urban airspace.However,pursuit UAVs necessitate the utilization of their own sensors to proactively gather information from the unauthorized UAV.Considering the restricted sensing range of sensors,this paper proposes a multi-UAV with limited visual field pursuit-evasion(MUV-PE)problem.Each pursuer has a visual field characterized by limited perception distance and viewing angle,potentially obstructed by buildings.Only when the unauthorized UAV,i.e.,the evader,enters the visual field of any pursuer can its position be acquired.The objective of the pursuers is to capture the evader as soon as possible without collision.To address this problem,we propose the normalizing flow actor with graph attention critic(NAGC)algorithm,a multi-agent reinforcement learning(MARL)approach.NAGC executes normalizing flows to augment the flexibility of policy network,enabling the agent to sample actions from more intricate distributions rather than common distributions.To enhance the capability of simultaneously comprehending spatial relationships among multiple UAVs and environmental obstacles,NAGC integrates the“obstacle-target”graph attention networks,significantly aiding pursuers in supporting search or pursuit activities.Extensive experiments conducted in a high-precision simulator validate the promising performance of the NAGC algorithm.
基金supported in part by the Guangdong Provincial Key R&D Program under Grant no.2019B111109002。
文摘Price-based and incentive-based demand response(DR)are both recognized as promising solutions to address the increasing uncertainties of renewable energy sources(RES)in microgrids.However,since the temporally optimization horizons of price-based and incentive-based DR are different,few existing methods consider their coordination.In this paper,a multi-agent deep reinforcement learning(MA-DRL)approach is proposed for the temporally coordinated DR in microgrids.The proposed method enhances micrigrid operation revenue by coordinating day-ahead price-based demand response(PBDR)and hourly direct load control(DLC).The operation at different time scales is decided by different DRL agents,and optimized by a multiagent deep deterministic policy gradient(MA-DDPG)using a shared critic to guide agents to attain a global objective.The effectiveness of the proposed approach is validated on a modified IEEE 33-bus distribution system and a modified heavily loaded 69-bus distribution system.
基金supported by theHubei Provincial Technology Innovation Special Project and the Natural Science Foundation of Hubei Province under Grants 2023BEB024,2024AFC066,respectively.
文摘The increasing adoption of unmanned aerial vehicles(UAVs)in urban low-altitude logistics systems,particularly for time-sensitive applications like parcel delivery and supply distribution,necessitates sophisticated coordination mechanisms to optimize operational efficiency.However,the limited capability of UAVs to extract stateaction information in complex environments poses significant challenges to achieving effective cooperation in dynamic and uncertain scenarios.To address this,we presents an Improved Multi-Agent Hybrid Attention Critic(IMAHAC)framework that advances multi-agent deep reinforcement learning(MADRL)through two key innovations.Firstly,a Temporal Difference Error and Time-based Prioritized Experience Replay(TT-PER)mechanism that dynamically adjusts sample weights based on temporal relevance and prediction error magnitude,effectively reducing the interference from obsolete collaborative experiences while maintaining training stability.Secondly,a hybrid attention mechanism is developed,integrating a sensor fusion layer—which aggregates features from multi-sensor data to enhance decision-making—and a dissimilarity layer that evaluates the similarity between key-value pairs and query values.By combining this hybrid attention mechanism with theMulti-Actor Attention Critic(MAAC)framework,our approach strengthens UAVs’capability to extract critical state-action features in diverse environments.Comprehensive simulations in urban air mobility scenarios demonstrate IMAHAC’s superiority over conventional MADRL baselines and MAAC,achieving higher cumulative rewards,fewer collisions,and enhanced cooperative capabilities.This work provides both algorithmic advancements and empirical validation for developing robust autonomous aerial systems in smart city infrastructures.
基金This work was partially supported by the National Key Research and Development Program of China(No.2020YFB1005900)the National Natural Science Foundation of China(Nos.62102232,62122042,and 61971269)the Natural Science Foundation of Shandong Province(No.ZR2021QF064).
文摘Most blockchain systems currently adopt resource-consuming protocols to achieve consensus between miners;for example,the Proof-of-Work(PoW)and Practical Byzantine Fault Tolerant(PBFT)schemes,which have a high consumption of computing/communication resources and usually require reliable communications with bounded delay.However,these protocols may be unsuitable for Internet of Things(IoT)networks because the IoT devices are usually lightweight,battery-operated,and deployed in an unreliable wireless environment.Therefore,this paper studies an efficient consensus protocol for blockchain in IoT networks via reinforcement learning.Specifically,the consensus protocol in this work is designed on the basis of the Proof-of-Communication(PoC)scheme directly in a single-hop wireless network with unreliable communications.A distributed MultiAgent Reinforcement Learning(MARL)algorithm is proposed to improve the efficiency and fairness of consensus for miners in the blockchain system.In this algorithm,each agent uses a matrix to depict the efficiency and fairness of the recent consensus and tunes its actions and rewards carefully in an actor-critic framework to seek effective performance.Empirical results from the simulation show that the fairness of consensus in the proposed algorithm is guaranteed,and the efficiency nearly reaches a centralized optimal solution.
文摘Traditional multi-agent deep reinforcement learning has difficulty obtaining rewards,slow convergence,and effective cooperation among agents in the pretraining period due to the large joint state space and sparse rewards for action.Therefore,this paper discusses the role of demonstration data in multiagent systems and proposes a multi-agent deep reinforcement learning algorithm from fuse adaptive weight fusion demonstration data.The algorithm sets the weights according to the performance and uses the importance sampling method to bridge the deviation in the mixed sampled data to combine the expert data obtained in the simulation environment with the distributed multi-agent reinforcement learning algorithm to solve the difficult problem.The problem of global exploration improves the convergence speed of the algorithm.The results in the RoboCup2D soccer simulation environment show that the algorithm improves the ability of the agent to hold and shoot the ball,enabling the agent to achieve a higher goal scoring rate and convergence speed relative to demonstration policies and mainstream multi-agent reinforcement learning algorithms.
基金supported by the National Natural Science Foundation of China(No.52131201)the Key R&D Program of Shandong Province,China(No.2023CXGC010111).
文摘The intensity levels of autonomous vehicles should be thoroughly evaluated before deployment,while vehicle tests are difficult for the sake of heavy experimental resources and large numbers of cases,especially tests that include safety-critical scenarios.In this study,a new scenario generation method is proposed to accelerate the test,which is based on a multiagent reinforcement learning(MARL)framework incorporating the driving potential field(DPF).This framework is used to train some background vehicles to enable high-risk and marginal scenes,where the DPF is applied to enact the rewards of the adversarial background agents.Other background vehicles that use reasonable driving policies,which serve as naturalistic agents to increase scenario diversity,are also considered.The coexistence of naturalistic and adversarial agents enriches the experiences learned by the background cars,providing more marginal and risky scenarios for accelerating the test.The experimental results demonstrate the efficiency of the generation of high-risk and marginal scenes,with comprehensive assessments via a novel field-based dynamic risk evaluation method.
基金the Young Elite Scientists Sponsorship Program by CAST(grant number:2021QNRC001)Natural Science Foundation of Heilongjiang Province of China(grant number:YQ2022F012)+1 种基金Siyuan Alliance Open ended Fund(grant number:HTKJ2022KL012003)the Fundamental Research Funds for the Central Universities(grant number:HIT.OCEF.2023010).
文摘Space manipulators play an important role in the on-orbit services and planetary surface operation.In the extreme environment of space,space manipulators are susceptible to a variety of unknown disturbances.How to have a resilient guarantee in failure or disturbance is the core capability of its future development.Compared with traditional motion planning,learning-based motion planning has gradually become a hot spot in current research.However,no matter what kind of research ideas,the single robotic manipulator is studied as an independent agent,making it unable to provide sufficient flexibility under conditions such as external force disturbance,observation noise,and mechanical failure.Therefore,this paper puts forward the idea of“discretization of the traditional single manipulator”.Different discretization forms are given through the analysis of the multi-degree-of-freedom single-manipulator joint relationship,and a single-manipulator representation composed of multiple new subagents is obtained.Simultaneously,to verify the ability of the new multiagent representation to deal with interference,we adopted a centralized multiagent reinforcement learning framework.The influence of the number of agents and communication distances on learning-based planning results is analyzed in detail.In addition,by imposing joint locking failures on the manipulator and introducing observation and action interference,it is verified that the“multiagent robotic manipulator”obtained after discretization has stronger antidisturbance resilient capability than the traditional single manipulator.