Opportunistic mobile crowdsensing(MCS)non-intrusively exploits human mobility trajectories,and the participants’smart devices as sensors have become promising paradigms for various urban data acquisition tasks.Howeve...Opportunistic mobile crowdsensing(MCS)non-intrusively exploits human mobility trajectories,and the participants’smart devices as sensors have become promising paradigms for various urban data acquisition tasks.However,in practice,opportunistic MCS has several challenges from both the perspectives of MCS participants and the data platform.On the one hand,participants face uncertainties in conducting MCS tasks,including their mobility and implicit interactions among participants,and participants’economic returns given by the MCS data platform are determined by not only their own actions but also other participants’strategic actions.On the other hand,the platform can only observe the participants’uploaded sensing data that depends on the unknown effort/action exerted by participants to the platform,while,for optimizing its overall objective,the platform needs to properly reward certain participants for incentivizing them to provide high-quality data.To address the challenge of balancing individual incentives and platform objectives in MCS,this paper proposes MARCS,an online sensing policy based on multi-agent deep reinforcement learning(MADRL)with centralized training and decentralized execution(CTDE).Specifically,the interactions between MCS participants and the data platform are modeled as a partially observable Markov game,where participants,acting as agents,use DRL-based policies to make decisions based on local observations,such as task trajectories and platform payments.To align individual and platform goals effectively,the platform leverages Shapley value to estimate the contribution of each participant’s sensed data,using these estimates as immediate rewards to guide agent training.The experimental results on real mobility trajectory datasets indicate that the revenue of MARCS reaches almost 35%,53%,and 100%higher than DDPG,Actor-Critic,and model predictive control(MPC)respectively on the participant side and similar results on the platform side,which show superior performance compared to baselines.展开更多
Cybertwin-enabled 6th Generation(6G)network is envisioned to support artificial intelligence-native management to meet changing demands of 6G applications.Multi-Agent Deep Reinforcement Learning(MADRL)technologies dri...Cybertwin-enabled 6th Generation(6G)network is envisioned to support artificial intelligence-native management to meet changing demands of 6G applications.Multi-Agent Deep Reinforcement Learning(MADRL)technologies driven by Cybertwins have been proposed for adaptive task offloading strategies.However,the existence of random transmission delay between Cybertwin-driven agents and underlying networks is not considered in related works,which destroys the standard Markov property and increases the decision reaction time to reduce the task offloading strategy performance.In order to address this problem,we propose a pipelining task offloading method to lower the decision reaction time and model it as a delay-aware Markov Decision Process(MDP).Then,we design a delay-aware MADRL algorithm to minimize the weighted sum of task execution latency and energy consumption.Firstly,the state space is augmented using the lastly-received state and historical actions to rebuild the Markov property.Secondly,Gate Transformer-XL is introduced to capture historical actions'importance and maintain the consistent input dimension dynamically changed due to random transmission delays.Thirdly,a sampling method and a new loss function with the difference between the current and target state value and the difference between real state-action value and augmented state-action value are designed to obtain state transition trajectories close to the real ones.Numerical results demonstrate that the proposed methods are effective in reducing reaction time and improving the task offloading performance in the random-delay Cybertwin-enabled 6G networks.展开更多
Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metavers...Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metaverses. However, avatar tasks include a multitude of human-to-avatar and avatar-to-avatar interactive applications, e.g., augmented reality navigation,which consumes intensive computing resources. It is inefficient and impractical for vehicles to process avatar tasks locally. Fortunately, migrating avatar tasks to the nearest roadside units(RSU)or unmanned aerial vehicles(UAV) for execution is a promising solution to decrease computation overhead and reduce task processing latency, while the high mobility of vehicles brings challenges for vehicles to independently perform avatar migration decisions depending on current and future vehicle status. To address these challenges, in this paper, we propose a novel avatar task migration system based on multi-agent deep reinforcement learning(MADRL) to execute immersive vehicular avatar tasks dynamically. Specifically, we first formulate the problem of avatar task migration from vehicles to RSUs/UAVs as a partially observable Markov decision process that can be solved by MADRL algorithms. We then design the multi-agent proximal policy optimization(MAPPO) approach as the MADRL algorithm for the avatar task migration problem. To overcome slow convergence resulting from the curse of dimensionality and non-stationary issues caused by shared parameters in MAPPO, we further propose a transformer-based MAPPO approach via sequential decision-making models for the efficient representation of relationships among agents. Finally, to motivate terrestrial or non-terrestrial edge servers(e.g., RSUs or UAVs) to share computation resources and ensure traceability of the sharing records, we apply smart contracts and blockchain technologies to achieve secure sharing management. Numerical results demonstrate that the proposed approach outperforms the MAPPO approach by around 2% and effectively reduces approximately 20% of the latency of avatar task execution in UAV-assisted vehicular Metaverses.展开更多
This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved pr...This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.展开更多
Multi-agent reinforcement learning has recently been applied to solve pursuit problems.However,it suffers from a large number of time steps per training episode,thus always struggling to converge effectively,resulting...Multi-agent reinforcement learning has recently been applied to solve pursuit problems.However,it suffers from a large number of time steps per training episode,thus always struggling to converge effectively,resulting in low rewards and an inability for agents to learn strategies.This paper proposes a deep reinforcement learning(DRL)training method that employs an ensemble segmented multi-reward function design approach to address the convergence problem mentioned before.The ensemble reward function combines the advantages of two reward functions,which enhances the training effect of agents in long episode.Then,we eliminate the non-monotonic behavior in reward function introduced by the trigonometric functions in the traditional 2D polar coordinates observation representation.Experimental results demonstrate that this method outperforms the traditional single reward function mechanism in the pursuit scenario by enhancing agents’policy scores of the task.These ideas offer a solution to the convergence challenges faced by DRL models in long episode pursuit problems,leading to an improved model training performance.展开更多
Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the comb...Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the combat group,this task suffers from credit assignment problem more than other rein-forcement learning tasks.This study uses reward shaping to relieve the credit assignment problem and improve policy train-ing for the new generation of large-scale unmanned combat operations.We first prove that multiple reward shaping func-tions would not change the Nash Equilibrium in stochastic games,providing theoretical support for their use.According to the characteristics of combat operations,we propose tactical reward shaping(TRS)that comprises maneuver shaping advice and threat assessment-based attack shaping advice.Then,we investigate the effects of different types and combinations of shaping advice on combat policies through experiments.The results show that TRS improves both the efficiency and attack accuracy of combat policies,with the combination of maneuver reward shaping advice and ally-focused attack shaping advice achieving the best performance compared with that of the base-line strategy.展开更多
In the traditional well log depth matching tasks,manual adjustments are required,which means significantly labor-intensive for multiple wells,leading to low work efficiency.This paper introduces a multi-agent deep rei...In the traditional well log depth matching tasks,manual adjustments are required,which means significantly labor-intensive for multiple wells,leading to low work efficiency.This paper introduces a multi-agent deep reinforcement learning(MARL)method to automate the depth matching of multi-well logs.This method defines multiple top-down dual sliding windows based on the convolutional neural network(CNN)to extract and capture similar feature sequences on well logs,and it establishes an interaction mechanism between agents and the environment to control the depth matching process.Specifically,the agent selects an action to translate or scale the feature sequence based on the double deep Q-network(DDQN).Through the feedback of the reward signal,it evaluates the effectiveness of each action,aiming to obtain the optimal strategy and improve the accuracy of the matching task.Our experiments show that MARL can automatically perform depth matches for well-logs in multiple wells,and reduce manual intervention.In the application to the oil field,a comparative analysis of dynamic time warping(DTW),deep Q-learning network(DQN),and DDQN methods revealed that the DDQN algorithm,with its dual-network evaluation mechanism,significantly improves performance by identifying and aligning more details in the well log feature sequences,thus achieving higher depth matching accuracy.展开更多
To solve the problems of difficult control law design,poor portability,and poor stability of traditional multi-agent formation obstacle avoidance algorithms,a multi-agent formation obstacle avoidance method based on d...To solve the problems of difficult control law design,poor portability,and poor stability of traditional multi-agent formation obstacle avoidance algorithms,a multi-agent formation obstacle avoidance method based on deep reinforcement learning(DRL)is proposed.This method combines the perception ability of convolutional neural networks(CNNs)with the decision-making ability of reinforcement learning in a general form and realizes direct output control from the visual perception input of the environment to the action through an end-to-end learning method.The multi-agent system(MAS)model of the follow-leader formation method was designed with the wheelbarrow as the control object.An improved deep Q netwrok(DQN)algorithm(we improved its discount factor and learning efficiency and designed a reward value function that considers the distance relationship between the agent and the obstacle and the coordination factor between the multi-agents)was designed to achieve obstacle avoidance and collision avoidance in the process of multi-agent formation into the desired formation.The simulation results show that the proposed method achieves the expected goal of multi-agent formation obstacle avoidance and has stronger portability compared with the traditional algorithm.展开更多
Mobile-edge computing(MEC)is a promising technology for the fifth-generation(5G)and sixth-generation(6G)architectures,which provides resourceful computing capabilities for Internet of Things(IoT)devices,such as virtua...Mobile-edge computing(MEC)is a promising technology for the fifth-generation(5G)and sixth-generation(6G)architectures,which provides resourceful computing capabilities for Internet of Things(IoT)devices,such as virtual reality,mobile devices,and smart cities.In general,these IoT applications always bring higher energy consumption than traditional applications,which are usually energy-constrained.To provide persistent energy,many references have studied the offloading problem to save energy consumption.However,the dynamic environment dramatically increases the optimization difficulty of the offloading decision.In this paper,we aim to minimize the energy consumption of the entireMECsystemunder the latency constraint by fully considering the dynamic environment.UnderMarkov games,we propose amulti-agent deep reinforcement learning approach based on the bi-level actorcritic learning structure to jointly optimize the offloading decision and resource allocation,which can solve the combinatorial optimization problem using an asymmetric method and compute the Stackelberg equilibrium as a better convergence point than Nash equilibrium in terms of Pareto superiority.Our method can better adapt to a dynamic environment during the data transmission than the single-agent strategy and can effectively tackle the coordination problem in the multi-agent environment.The simulation results show that the proposed method could decrease the total computational overhead by 17.8%compared to the actor-critic-based method and reduce the total computational overhead by 31.3%,36.5%,and 44.7%compared with randomoffloading,all local execution,and all offloading execution,respectively.展开更多
Mobile CrowdSensing(MCS)is a promising sensing paradigm that recruits users to cooperatively perform sensing tasks.Recently,unmanned aerial vehicles(UAVs)as the powerful sensing devices are used to replace user partic...Mobile CrowdSensing(MCS)is a promising sensing paradigm that recruits users to cooperatively perform sensing tasks.Recently,unmanned aerial vehicles(UAVs)as the powerful sensing devices are used to replace user participation and carry out some special tasks,such as epidemic monitoring and earthquakes rescue.In this paper,we focus on scheduling UAVs to sense the task Point-of-Interests(PoIs)with different frequency coverage requirements.To accomplish the sensing task,the scheduling strategy needs to consider the coverage requirement,geographic fairness and energy charging simultaneously.We consider the complex interaction among UAVs and propose a grouping multi-agent deep reinforcement learning approach(G-MADDPG)to schedule UAVs distributively.G-MADDPG groups all UAVs into some teams by a distance-based clustering algorithm(DCA),then it regards each team as an agent.In this way,G-MADDPG solves the problem that the training time of traditional MADDPG is too long to converge when the number of UAVs is large,and the trade-off between training time and result accuracy could be controlled flexibly by adjusting the number of teams.Extensive simulation results show that our scheduling strategy has better performance compared with three baselines and is flexible in balancing training time and result accuracy.展开更多
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit...To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.展开更多
Reinforcement Learning(RL)techniques are being studied to solve the Demand and Capacity Balancing(DCB)problems to fully exploit their computational performance.A locally gen-eralised Multi-Agent Reinforcement Learning...Reinforcement Learning(RL)techniques are being studied to solve the Demand and Capacity Balancing(DCB)problems to fully exploit their computational performance.A locally gen-eralised Multi-Agent Reinforcement Learning(MARL)for real-world DCB problems is proposed.The proposed method can deploy trained agents directly to unseen scenarios in a specific Air Traffic Flow Management(ATFM)region to quickly obtain a satisfactory solution.In this method,agents of all flights in a scenario form a multi-agent decision-making system based on partial observation.The trained agent with the customised neural network can be deployed directly on the corresponding flight,allowing it to solve the DCB problem jointly.A cooperation coefficient is introduced in the reward function,which is used to adjust the agent’s cooperation preference in a multi-agent system,thereby controlling the distribution of flight delay time allocation.A multi-iteration mechanism is designed for the DCB decision-making framework to deal with problems arising from non-stationarity in MARL and to ensure that all hotspots are eliminated.Experiments based on large-scale high-complexity real-world scenarios are conducted to verify the effectiveness and efficiency of the method.From a statis-tical point of view,it is proven that the proposed method is generalised within the scope of the flights and sectors of interest,and its optimisation performance outperforms the standard computer-assisted slot allocation and state-of-the-art RL-based DCB methods.The sensitivity analysis preliminarily reveals the effect of the cooperation coefficient on delay time allocation.展开更多
Hybrid electric vehicles(HEVs)have the advantages of lower emissions and less noise pollution than traditional fuel vehicles.Developing reasonable energy management strategies(EMSs)can effectively reduce fuel consumpt...Hybrid electric vehicles(HEVs)have the advantages of lower emissions and less noise pollution than traditional fuel vehicles.Developing reasonable energy management strategies(EMSs)can effectively reduce fuel consumption and improve the fuel economy of HEVs.However,current EMSs still have problems,such as complex multi-objective optimization and poor algorithm robustness.Herein,a multi-agent reinforcement learning(MADRL)framework is proposed based on Multi-Agent Deep Deterministic Policy Gradient(MADDPG)algorithm to solve such problems.Specifically,a vehicle model and dynamics model are established,and based on this,a multi-objective EMS is developed by considering fuel economy,maintaining the battery State of Charge(SOC),and reducing battery degradation.Secondly,the proposed strategy regards the engine and battery as two agents,and the agents cooperate with each other to realize optimal power distribution and achieve the optimal control strategy.Finally,the WLTC and HWFET driving cycles are employed to verify the performances of the proposed method,the fuel consumption decreases by 26.91%and 8.41%on average compared to the other strategies.The simulation results demonstrate that the proposed strategy has remarkable superiority in multi-objective optimization.展开更多
Unmanned Aerial Vehicles(UAVs)are useful in dangerous and dynamic tasks such as search-and-rescue,forest surveillance,and anti-terrorist operations.These tasks can be solved better through the collaboration of multipl...Unmanned Aerial Vehicles(UAVs)are useful in dangerous and dynamic tasks such as search-and-rescue,forest surveillance,and anti-terrorist operations.These tasks can be solved better through the collaboration of multiple UAVs under human supervision.However,it is still difficult for human to monitor,understand,predict and control the behaviors of the UAVs due to the task complexity as well as the black-box machine learning and planning algorithms being used.In this paper,the coactive design method is adopted to analyze the cognitive capabilities required for the tasks and design the interdependencies among the heterogeneous teammates of UAVs or human for coherent collaboration.Then,an agent-based task planner is proposed to automatically decompose a complex task into a sequence of explainable subtasks under constrains of resources,execution time,social rules and costs.Besides,a deep reinforcement learning approach is designed for the UAVs to learn optimal policies of a flocking behavior and a path planner that are easy for the human operator to understand and control.Finally,a mixed-initiative action selection mechanism is used to evaluate the learned policies as well as the human’s decisions.Experimental results demonstrate the effectiveness of the proposed methods.展开更多
With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provi...With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provides reliable support for reconfiguration optimization in urban distribution networks.Thus,this study proposed a deep reinforcement learning based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,the multi-level dynamic reconfiguration method was discussed,which included feeder-,transformer-,and substation-levels.Subsequently,the multi-agent system was combined with the cloud-edge collaboration architecture to build a deep reinforcement learning model for multi-level dynamic reconfiguration in an urban distribution network.The cloud-edge collaboration architecture can effectively support the multi-agent system to conduct“centralized training and decentralized execution”operation modes and improve the learning efficiency of the model.Thereafter,for a multi-agent system,this study adopted a combination of offline and online learning to endow the model with the ability to realize automatic optimization and updation of the strategy.In the offline learning phase,a Q-learning-based multi-agent conservative Q-learning(MACQL)algorithm was proposed to stabilize the learning results and reduce the risk of the next online learning phase.In the online learning phase,a multi-agent deep deterministic policy gradient(MADDPG)algorithm based on policy gradients was proposed to explore the action space and update the experience pool.Finally,the effectiveness of the proposed method was verified through a simulation analysis of a real-world 445-node system.展开更多
With the rapid advancement of deep reinforcement learning(DRL)in multi-agent systems,a variety of practical application challenges and solutions in the direction of multi-agent deep reinforcement learning(MADRL)are su...With the rapid advancement of deep reinforcement learning(DRL)in multi-agent systems,a variety of practical application challenges and solutions in the direction of multi-agent deep reinforcement learning(MADRL)are surfacing.Path planning in a collision-free environment is essential for many robots to do tasks quickly and efficiently,and path planning for multiple robots using deep reinforcement learning is a new research area in the field of robotics and artificial intelligence.In this paper,we sort out the training methods for multi-robot path planning,as well as summarize the practical applications in the field of DRL-based multi-robot path planning based on the methods;finally,we suggest possible research directions for researchers.展开更多
This paper investigates the use of multi-agent deep Q-network(MADQN)to address the curse of dimensionality issue occurred in the traditional multi-agent reinforcement learning(MARL)approach.The proposed MADQN is appli...This paper investigates the use of multi-agent deep Q-network(MADQN)to address the curse of dimensionality issue occurred in the traditional multi-agent reinforcement learning(MARL)approach.The proposed MADQN is applied to traffic light controllers at multiple intersections with busy traffic and traffic disruptions,particularly rainfall.MADQN is based on deep Q-network(DQN),which is an integration of the traditional reinforcement learning(RL)and the newly emerging deep learning(DL)approaches.MADQN enables traffic light controllers to learn,exchange knowledge with neighboring agents,and select optimal joint actions in a collaborative manner.A case study based on a real traffic network is conducted as part of a sustainable urban city project in the Sunway City of Kuala Lumpur in Malaysia.Investigation is also performed using a grid traffic network(GTN)to understand that the proposed scheme is effective in a traditional traffic network.Our proposed scheme is evaluated using two simulation tools,namely Matlab and Simulation of Urban Mobility(SUMO).Our proposed scheme has shown that the cumulative delay of vehicles can be reduced by up to 30%in the simulations.展开更多
The multi-agent path planning problem presents significant challenges in dynamic environments,primarily due to the ever-changing positions of obstacles and the complex interactions between agents’actions.These factor...The multi-agent path planning problem presents significant challenges in dynamic environments,primarily due to the ever-changing positions of obstacles and the complex interactions between agents’actions.These factors contribute to a tendency for the solution to converge slowly,and in some cases,diverge altogether.In addressing this issue,this paper introduces a novel approach utilizing a double dueling deep Q-network(D3QN),tailored for dynamic multi-agent environments.A novel reward function based on multi-agent positional constraints is designed,and a training strategy based on incremental learning is performed to achieve collaborative path planning of multiple agents.Moreover,the greedy and Boltzmann probability selection policy is introduced for action selection and avoiding convergence to local extremum.To match radar and image sensors,a convolutional neural network-long short-term memory(CNN-LSTM)architecture is constructed to extract the feature of multi-source measurement as the input of the D3QN.The algorithm’s efficacy and reliability are validated in a simulated environment,utilizing robot operating system and Gazebo.The simulation results show that the proposed algorithm provides a real-time solution for path planning tasks in dynamic scenarios.In terms of the average success rate and accuracy,the proposed method is superior to other deep learning algorithms,and the convergence speed is also improved.展开更多
Heating,ventilation,and air conditioning(HVAC)systems consume a significant amount of energy to maintain thermal comfort and indoor air quality in buildings,which results in high operational costs.Reinforcement learni...Heating,ventilation,and air conditioning(HVAC)systems consume a significant amount of energy to maintain thermal comfort and indoor air quality in buildings,which results in high operational costs.Reinforcement learning is an effective method for controlling HVAC systems.However,in large and complex HVAC systems,traditional reinforcement learning algorithms often face the challenges of slow training speed and poor convergence performance.This paper proposes a multi-objective optimization control method based on the multi-agent deep deterministic policy gradient(MADDPG)algorithm,which aims to minimize HVAC energy consumption while ensuring optimal thermal comfort and indoor air quality in each zone.Using a multi-zone office building with fan coil units and a dedicated outdoor air system as a case study,we developed an EnergyPlus-Python co-simulation platform.The proposed control method was employed during both the heating and cooling seasons to independently control the temperature setpoints and fresh airflow in different zones of the office building.The simulation results from both the heating and cooling seasons demonstrate that the MADDPG control method exhibits faster convergence during training and excellent learning capabilities,allowing it to adapt effectively to changes in environmental conditions and implement appropriate control actions.Under similar indoor thermal comfort and air quality conditions,the MADDPG control method consumes less energy than the traditional reinforcement learning method,it saves 24.1%of energy during the heating season and 8.9%during the cooling season compared to the rule-based control method.Additionally,by adjusting the reward function in the MADDPG algorithm,it is possible to flexibly balance energy consumption,thermal comfort,and air quality preferences,demonstrating the algorithm’s strong applicability.展开更多
Unmanned Aerial Vehicles(UAvs)as aerial base stations to provide communication services for ground users is a flexible and cost-effective paradigm in B5G.Besides,dynamic resource allocation and multi-connectivity can ...Unmanned Aerial Vehicles(UAvs)as aerial base stations to provide communication services for ground users is a flexible and cost-effective paradigm in B5G.Besides,dynamic resource allocation and multi-connectivity can be adopted to further harness the potentials of UAVs in improving communication capacity,in such situations such that the interference among users becomes a pivotal disincentive requiring effective solutions.To this end,we investigate the Joint UAV-User Association,Channel Allocation,and transmission Power Control(J-UACAPC)problem in a multi-connectivity-enabled UAV network with constrained backhaul links,where each UAV can determine the reusable channels and transmission power to serve the selected ground users.The goal was to mitigate co-channel interference while maximizing long-term system utility.The problem was modeled as a cooperative stochastic game with hybrid discrete-continuous action space.A Multi-Agent Hybrid Deep Reinforcement Learning(MAHDRL)algorithm was proposed to address this problem.Extensive simulation results demonstrated the effectiveness of the proposed algorithm and showed that it has a higher system utility than the baseline methods.展开更多
基金sponsored by Qinglan Project of Jiangsu Province,and Jiangsu Provincial Key Research and Development Program(No.BE2020084-1).
文摘Opportunistic mobile crowdsensing(MCS)non-intrusively exploits human mobility trajectories,and the participants’smart devices as sensors have become promising paradigms for various urban data acquisition tasks.However,in practice,opportunistic MCS has several challenges from both the perspectives of MCS participants and the data platform.On the one hand,participants face uncertainties in conducting MCS tasks,including their mobility and implicit interactions among participants,and participants’economic returns given by the MCS data platform are determined by not only their own actions but also other participants’strategic actions.On the other hand,the platform can only observe the participants’uploaded sensing data that depends on the unknown effort/action exerted by participants to the platform,while,for optimizing its overall objective,the platform needs to properly reward certain participants for incentivizing them to provide high-quality data.To address the challenge of balancing individual incentives and platform objectives in MCS,this paper proposes MARCS,an online sensing policy based on multi-agent deep reinforcement learning(MADRL)with centralized training and decentralized execution(CTDE).Specifically,the interactions between MCS participants and the data platform are modeled as a partially observable Markov game,where participants,acting as agents,use DRL-based policies to make decisions based on local observations,such as task trajectories and platform payments.To align individual and platform goals effectively,the platform leverages Shapley value to estimate the contribution of each participant’s sensed data,using these estimates as immediate rewards to guide agent training.The experimental results on real mobility trajectory datasets indicate that the revenue of MARCS reaches almost 35%,53%,and 100%higher than DDPG,Actor-Critic,and model predictive control(MPC)respectively on the participant side and similar results on the platform side,which show superior performance compared to baselines.
基金funded by the National Key Research and Development Program of China under Grant 2019YFB1803301Beijing Natural Science Foundation (L202002)。
文摘Cybertwin-enabled 6th Generation(6G)network is envisioned to support artificial intelligence-native management to meet changing demands of 6G applications.Multi-Agent Deep Reinforcement Learning(MADRL)technologies driven by Cybertwins have been proposed for adaptive task offloading strategies.However,the existence of random transmission delay between Cybertwin-driven agents and underlying networks is not considered in related works,which destroys the standard Markov property and increases the decision reaction time to reduce the task offloading strategy performance.In order to address this problem,we propose a pipelining task offloading method to lower the decision reaction time and model it as a delay-aware Markov Decision Process(MDP).Then,we design a delay-aware MADRL algorithm to minimize the weighted sum of task execution latency and energy consumption.Firstly,the state space is augmented using the lastly-received state and historical actions to rebuild the Markov property.Secondly,Gate Transformer-XL is introduced to capture historical actions'importance and maintain the consistent input dimension dynamically changed due to random transmission delays.Thirdly,a sampling method and a new loss function with the difference between the current and target state value and the difference between real state-action value and augmented state-action value are designed to obtain state transition trajectories close to the real ones.Numerical results demonstrate that the proposed methods are effective in reducing reaction time and improving the task offloading performance in the random-delay Cybertwin-enabled 6G networks.
基金supported in part by NSFC (62102099, U22A2054, 62101594)in part by the Pearl River Talent Recruitment Program (2021QN02S643)+9 种基金Guangzhou Basic Research Program (2023A04J1699)in part by the National Research Foundation, SingaporeInfocomm Media Development Authority under its Future Communications Research Development ProgrammeDSO National Laboratories under the AI Singapore Programme under AISG Award No AISG2-RP-2020-019Energy Research Test-Bed and Industry Partnership Funding Initiative, Energy Grid (EG) 2.0 programmeDesCartes and the Campus for Research Excellence and Technological Enterprise (CREATE) programmeMOE Tier 1 under Grant RG87/22in part by the Singapore University of Technology and Design (SUTD) (SRG-ISTD-2021- 165)in part by the SUTD-ZJU IDEA Grant SUTD-ZJU (VP) 202102in part by the Ministry of Education, Singapore, through its SUTD Kickstarter Initiative (SKI 20210204)。
文摘Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metaverses. However, avatar tasks include a multitude of human-to-avatar and avatar-to-avatar interactive applications, e.g., augmented reality navigation,which consumes intensive computing resources. It is inefficient and impractical for vehicles to process avatar tasks locally. Fortunately, migrating avatar tasks to the nearest roadside units(RSU)or unmanned aerial vehicles(UAV) for execution is a promising solution to decrease computation overhead and reduce task processing latency, while the high mobility of vehicles brings challenges for vehicles to independently perform avatar migration decisions depending on current and future vehicle status. To address these challenges, in this paper, we propose a novel avatar task migration system based on multi-agent deep reinforcement learning(MADRL) to execute immersive vehicular avatar tasks dynamically. Specifically, we first formulate the problem of avatar task migration from vehicles to RSUs/UAVs as a partially observable Markov decision process that can be solved by MADRL algorithms. We then design the multi-agent proximal policy optimization(MAPPO) approach as the MADRL algorithm for the avatar task migration problem. To overcome slow convergence resulting from the curse of dimensionality and non-stationary issues caused by shared parameters in MAPPO, we further propose a transformer-based MAPPO approach via sequential decision-making models for the efficient representation of relationships among agents. Finally, to motivate terrestrial or non-terrestrial edge servers(e.g., RSUs or UAVs) to share computation resources and ensure traceability of the sharing records, we apply smart contracts and blockchain technologies to achieve secure sharing management. Numerical results demonstrate that the proposed approach outperforms the MAPPO approach by around 2% and effectively reduces approximately 20% of the latency of avatar task execution in UAV-assisted vehicular Metaverses.
基金partially supported by the National Key Research and Development Program of the Ministry of Science and Technology of China(2022YFE0114200)the National Natural Science Foundation of China(U20A6004).
文摘This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.
基金National Natural Science Foundation of China(Nos.61803260,61673262 and 61175028)。
文摘Multi-agent reinforcement learning has recently been applied to solve pursuit problems.However,it suffers from a large number of time steps per training episode,thus always struggling to converge effectively,resulting in low rewards and an inability for agents to learn strategies.This paper proposes a deep reinforcement learning(DRL)training method that employs an ensemble segmented multi-reward function design approach to address the convergence problem mentioned before.The ensemble reward function combines the advantages of two reward functions,which enhances the training effect of agents in long episode.Then,we eliminate the non-monotonic behavior in reward function introduced by the trigonometric functions in the traditional 2D polar coordinates observation representation.Experimental results demonstrate that this method outperforms the traditional single reward function mechanism in the pursuit scenario by enhancing agents’policy scores of the task.These ideas offer a solution to the convergence challenges faced by DRL models in long episode pursuit problems,leading to an improved model training performance.
文摘Future unmanned battles desperately require intelli-gent combat policies,and multi-agent reinforcement learning offers a promising solution.However,due to the complexity of combat operations and large size of the combat group,this task suffers from credit assignment problem more than other rein-forcement learning tasks.This study uses reward shaping to relieve the credit assignment problem and improve policy train-ing for the new generation of large-scale unmanned combat operations.We first prove that multiple reward shaping func-tions would not change the Nash Equilibrium in stochastic games,providing theoretical support for their use.According to the characteristics of combat operations,we propose tactical reward shaping(TRS)that comprises maneuver shaping advice and threat assessment-based attack shaping advice.Then,we investigate the effects of different types and combinations of shaping advice on combat policies through experiments.The results show that TRS improves both the efficiency and attack accuracy of combat policies,with the combination of maneuver reward shaping advice and ally-focused attack shaping advice achieving the best performance compared with that of the base-line strategy.
基金Supported by the China National Petroleum Corporation Limited-China University of Petroleum(Beijing)Strategic Cooperation Science and Technology Project(ZLZX2020-03).
文摘In the traditional well log depth matching tasks,manual adjustments are required,which means significantly labor-intensive for multiple wells,leading to low work efficiency.This paper introduces a multi-agent deep reinforcement learning(MARL)method to automate the depth matching of multi-well logs.This method defines multiple top-down dual sliding windows based on the convolutional neural network(CNN)to extract and capture similar feature sequences on well logs,and it establishes an interaction mechanism between agents and the environment to control the depth matching process.Specifically,the agent selects an action to translate or scale the feature sequence based on the double deep Q-network(DDQN).Through the feedback of the reward signal,it evaluates the effectiveness of each action,aiming to obtain the optimal strategy and improve the accuracy of the matching task.Our experiments show that MARL can automatically perform depth matches for well-logs in multiple wells,and reduce manual intervention.In the application to the oil field,a comparative analysis of dynamic time warping(DTW),deep Q-learning network(DQN),and DDQN methods revealed that the DDQN algorithm,with its dual-network evaluation mechanism,significantly improves performance by identifying and aligning more details in the well log feature sequences,thus achieving higher depth matching accuracy.
基金the National Natural Science Foun-dation of China(No.61963006)the Nat-ural Science Foundation of Guangxi Province(Nos.2020GXNSFDA238011,2018GXNSFAA050029,and 2018GXNSFAA294085)。
文摘To solve the problems of difficult control law design,poor portability,and poor stability of traditional multi-agent formation obstacle avoidance algorithms,a multi-agent formation obstacle avoidance method based on deep reinforcement learning(DRL)is proposed.This method combines the perception ability of convolutional neural networks(CNNs)with the decision-making ability of reinforcement learning in a general form and realizes direct output control from the visual perception input of the environment to the action through an end-to-end learning method.The multi-agent system(MAS)model of the follow-leader formation method was designed with the wheelbarrow as the control object.An improved deep Q netwrok(DQN)algorithm(we improved its discount factor and learning efficiency and designed a reward value function that considers the distance relationship between the agent and the obstacle and the coordination factor between the multi-agents)was designed to achieve obstacle avoidance and collision avoidance in the process of multi-agent formation into the desired formation.The simulation results show that the proposed method achieves the expected goal of multi-agent formation obstacle avoidance and has stronger portability compared with the traditional algorithm.
基金supported by the National Natural Science Foundation of China(62162050)the Fundamental Research Funds for the Central Universities(No.N2217002)the Natural Science Foundation of Liaoning ProvincialDepartment of Science and Technology(No.2022-KF-11-04).
文摘Mobile-edge computing(MEC)is a promising technology for the fifth-generation(5G)and sixth-generation(6G)architectures,which provides resourceful computing capabilities for Internet of Things(IoT)devices,such as virtual reality,mobile devices,and smart cities.In general,these IoT applications always bring higher energy consumption than traditional applications,which are usually energy-constrained.To provide persistent energy,many references have studied the offloading problem to save energy consumption.However,the dynamic environment dramatically increases the optimization difficulty of the offloading decision.In this paper,we aim to minimize the energy consumption of the entireMECsystemunder the latency constraint by fully considering the dynamic environment.UnderMarkov games,we propose amulti-agent deep reinforcement learning approach based on the bi-level actorcritic learning structure to jointly optimize the offloading decision and resource allocation,which can solve the combinatorial optimization problem using an asymmetric method and compute the Stackelberg equilibrium as a better convergence point than Nash equilibrium in terms of Pareto superiority.Our method can better adapt to a dynamic environment during the data transmission than the single-agent strategy and can effectively tackle the coordination problem in the multi-agent environment.The simulation results show that the proposed method could decrease the total computational overhead by 17.8%compared to the actor-critic-based method and reduce the total computational overhead by 31.3%,36.5%,and 44.7%compared with randomoffloading,all local execution,and all offloading execution,respectively.
基金supported by the Innovation Capacity Construction Project of Jilin Development and Reform Commission(2020C017-2)Science and Technology Development Plan Project of Jilin Province(20210201082GX)。
文摘Mobile CrowdSensing(MCS)is a promising sensing paradigm that recruits users to cooperatively perform sensing tasks.Recently,unmanned aerial vehicles(UAVs)as the powerful sensing devices are used to replace user participation and carry out some special tasks,such as epidemic monitoring and earthquakes rescue.In this paper,we focus on scheduling UAVs to sense the task Point-of-Interests(PoIs)with different frequency coverage requirements.To accomplish the sensing task,the scheduling strategy needs to consider the coverage requirement,geographic fairness and energy charging simultaneously.We consider the complex interaction among UAVs and propose a grouping multi-agent deep reinforcement learning approach(G-MADDPG)to schedule UAVs distributively.G-MADDPG groups all UAVs into some teams by a distance-based clustering algorithm(DCA),then it regards each team as an agent.In this way,G-MADDPG solves the problem that the training time of traditional MADDPG is too long to converge when the number of UAVs is large,and the trade-off between training time and result accuracy could be controlled flexibly by adjusting the number of teams.Extensive simulation results show that our scheduling strategy has better performance compared with three baselines and is flexible in balancing training time and result accuracy.
基金financial support from National Natural Science Foundation of China(Grant No.61601491)Natural Science Foundation of Hubei Province,China(Grant No.2018CFC865)Military Research Project of China(-Grant No.YJ2020B117)。
文摘To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.
基金co-funded by the National Natural Science Foundation of China(No.61903187)the National Key R&D Program of China(No.2021YFB1600500)+2 种基金the China Scholarship Council(No.202006830095)the Natural Science Foundation of Jiangsu Province(No.BK20190414)the Jiangsu Province Postgraduate Innovation Fund(No.KYCX20_0213).
文摘Reinforcement Learning(RL)techniques are being studied to solve the Demand and Capacity Balancing(DCB)problems to fully exploit their computational performance.A locally gen-eralised Multi-Agent Reinforcement Learning(MARL)for real-world DCB problems is proposed.The proposed method can deploy trained agents directly to unseen scenarios in a specific Air Traffic Flow Management(ATFM)region to quickly obtain a satisfactory solution.In this method,agents of all flights in a scenario form a multi-agent decision-making system based on partial observation.The trained agent with the customised neural network can be deployed directly on the corresponding flight,allowing it to solve the DCB problem jointly.A cooperation coefficient is introduced in the reward function,which is used to adjust the agent’s cooperation preference in a multi-agent system,thereby controlling the distribution of flight delay time allocation.A multi-iteration mechanism is designed for the DCB decision-making framework to deal with problems arising from non-stationarity in MARL and to ensure that all hotspots are eliminated.Experiments based on large-scale high-complexity real-world scenarios are conducted to verify the effectiveness and efficiency of the method.From a statis-tical point of view,it is proven that the proposed method is generalised within the scope of the flights and sectors of interest,and its optimisation performance outperforms the standard computer-assisted slot allocation and state-of-the-art RL-based DCB methods.The sensitivity analysis preliminarily reveals the effect of the cooperation coefficient on delay time allocation.
基金support from the National Natural Science Foundation of China under Grant No.52202465 and No.52202452Hebei Provincial Natural Science Foundation for Excellent Young Scholars No.E2023202007+2 种基金China Postdoctoral Science Foundation Funded Project No.2023M741452Hebei Province Introduced Overseas Talents Funding Project No.C20220312supported by State Key Laboratory of Explosion Science and Safety Protection(Beijing Institute of Technology,No.KFJJ24–11 M).
文摘Hybrid electric vehicles(HEVs)have the advantages of lower emissions and less noise pollution than traditional fuel vehicles.Developing reasonable energy management strategies(EMSs)can effectively reduce fuel consumption and improve the fuel economy of HEVs.However,current EMSs still have problems,such as complex multi-objective optimization and poor algorithm robustness.Herein,a multi-agent reinforcement learning(MADRL)framework is proposed based on Multi-Agent Deep Deterministic Policy Gradient(MADDPG)algorithm to solve such problems.Specifically,a vehicle model and dynamics model are established,and based on this,a multi-objective EMS is developed by considering fuel economy,maintaining the battery State of Charge(SOC),and reducing battery degradation.Secondly,the proposed strategy regards the engine and battery as two agents,and the agents cooperate with each other to realize optimal power distribution and achieve the optimal control strategy.Finally,the WLTC and HWFET driving cycles are employed to verify the performances of the proposed method,the fuel consumption decreases by 26.91%and 8.41%on average compared to the other strategies.The simulation results demonstrate that the proposed strategy has remarkable superiority in multi-objective optimization.
基金co-supported by the National Natural Science Foundation of China(Nos.61906203,61876187)the National Key Laboratory of Science and Technology on UAV,Northwestern Polytechnical University,China(No.614230110080817)。
文摘Unmanned Aerial Vehicles(UAVs)are useful in dangerous and dynamic tasks such as search-and-rescue,forest surveillance,and anti-terrorist operations.These tasks can be solved better through the collaboration of multiple UAVs under human supervision.However,it is still difficult for human to monitor,understand,predict and control the behaviors of the UAVs due to the task complexity as well as the black-box machine learning and planning algorithms being used.In this paper,the coactive design method is adopted to analyze the cognitive capabilities required for the tasks and design the interdependencies among the heterogeneous teammates of UAVs or human for coherent collaboration.Then,an agent-based task planner is proposed to automatically decompose a complex task into a sequence of explainable subtasks under constrains of resources,execution time,social rules and costs.Besides,a deep reinforcement learning approach is designed for the UAVs to learn optimal policies of a flocking behavior and a path planner that are easy for the human operator to understand and control.Finally,a mixed-initiative action selection mechanism is used to evaluate the learned policies as well as the human’s decisions.Experimental results demonstrate the effectiveness of the proposed methods.
基金supported by the National Natural Science Foundation of China under Grant 52077146.
文摘With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provides reliable support for reconfiguration optimization in urban distribution networks.Thus,this study proposed a deep reinforcement learning based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,the multi-level dynamic reconfiguration method was discussed,which included feeder-,transformer-,and substation-levels.Subsequently,the multi-agent system was combined with the cloud-edge collaboration architecture to build a deep reinforcement learning model for multi-level dynamic reconfiguration in an urban distribution network.The cloud-edge collaboration architecture can effectively support the multi-agent system to conduct“centralized training and decentralized execution”operation modes and improve the learning efficiency of the model.Thereafter,for a multi-agent system,this study adopted a combination of offline and online learning to endow the model with the ability to realize automatic optimization and updation of the strategy.In the offline learning phase,a Q-learning-based multi-agent conservative Q-learning(MACQL)algorithm was proposed to stabilize the learning results and reduce the risk of the next online learning phase.In the online learning phase,a multi-agent deep deterministic policy gradient(MADDPG)algorithm based on policy gradients was proposed to explore the action space and update the experience pool.Finally,the effectiveness of the proposed method was verified through a simulation analysis of a real-world 445-node system.
文摘With the rapid advancement of deep reinforcement learning(DRL)in multi-agent systems,a variety of practical application challenges and solutions in the direction of multi-agent deep reinforcement learning(MADRL)are surfacing.Path planning in a collision-free environment is essential for many robots to do tasks quickly and efficiently,and path planning for multiple robots using deep reinforcement learning is a new research area in the field of robotics and artificial intelligence.In this paper,we sort out the training methods for multi-robot path planning,as well as summarize the practical applications in the field of DRL-based multi-robot path planning based on the methods;finally,we suggest possible research directions for researchers.
文摘This paper investigates the use of multi-agent deep Q-network(MADQN)to address the curse of dimensionality issue occurred in the traditional multi-agent reinforcement learning(MARL)approach.The proposed MADQN is applied to traffic light controllers at multiple intersections with busy traffic and traffic disruptions,particularly rainfall.MADQN is based on deep Q-network(DQN),which is an integration of the traditional reinforcement learning(RL)and the newly emerging deep learning(DL)approaches.MADQN enables traffic light controllers to learn,exchange knowledge with neighboring agents,and select optimal joint actions in a collaborative manner.A case study based on a real traffic network is conducted as part of a sustainable urban city project in the Sunway City of Kuala Lumpur in Malaysia.Investigation is also performed using a grid traffic network(GTN)to understand that the proposed scheme is effective in a traditional traffic network.Our proposed scheme is evaluated using two simulation tools,namely Matlab and Simulation of Urban Mobility(SUMO).Our proposed scheme has shown that the cumulative delay of vehicles can be reduced by up to 30%in the simulations.
基金National Natural Science Foundation of China(Nos.61673262 and 50779033)National GF Basic Research Program(No.JCKY2021110B134)Fundamental Research Funds for the Central Universities。
文摘The multi-agent path planning problem presents significant challenges in dynamic environments,primarily due to the ever-changing positions of obstacles and the complex interactions between agents’actions.These factors contribute to a tendency for the solution to converge slowly,and in some cases,diverge altogether.In addressing this issue,this paper introduces a novel approach utilizing a double dueling deep Q-network(D3QN),tailored for dynamic multi-agent environments.A novel reward function based on multi-agent positional constraints is designed,and a training strategy based on incremental learning is performed to achieve collaborative path planning of multiple agents.Moreover,the greedy and Boltzmann probability selection policy is introduced for action selection and avoiding convergence to local extremum.To match radar and image sensors,a convolutional neural network-long short-term memory(CNN-LSTM)architecture is constructed to extract the feature of multi-source measurement as the input of the D3QN.The algorithm’s efficacy and reliability are validated in a simulated environment,utilizing robot operating system and Gazebo.The simulation results show that the proposed algorithm provides a real-time solution for path planning tasks in dynamic scenarios.In terms of the average success rate and accuracy,the proposed method is superior to other deep learning algorithms,and the convergence speed is also improved.
基金this study was sponsored by the National Natural Science Foundation of China(Grant Number:52278103)the Natural Science Foundation-Departmental Joint Fund of Hunan Province,China(Grant Number:2023JJ60570).
文摘Heating,ventilation,and air conditioning(HVAC)systems consume a significant amount of energy to maintain thermal comfort and indoor air quality in buildings,which results in high operational costs.Reinforcement learning is an effective method for controlling HVAC systems.However,in large and complex HVAC systems,traditional reinforcement learning algorithms often face the challenges of slow training speed and poor convergence performance.This paper proposes a multi-objective optimization control method based on the multi-agent deep deterministic policy gradient(MADDPG)algorithm,which aims to minimize HVAC energy consumption while ensuring optimal thermal comfort and indoor air quality in each zone.Using a multi-zone office building with fan coil units and a dedicated outdoor air system as a case study,we developed an EnergyPlus-Python co-simulation platform.The proposed control method was employed during both the heating and cooling seasons to independently control the temperature setpoints and fresh airflow in different zones of the office building.The simulation results from both the heating and cooling seasons demonstrate that the MADDPG control method exhibits faster convergence during training and excellent learning capabilities,allowing it to adapt effectively to changes in environmental conditions and implement appropriate control actions.Under similar indoor thermal comfort and air quality conditions,the MADDPG control method consumes less energy than the traditional reinforcement learning method,it saves 24.1%of energy during the heating season and 8.9%during the cooling season compared to the rule-based control method.Additionally,by adjusting the reward function in the MADDPG algorithm,it is possible to flexibly balance energy consumption,thermal comfort,and air quality preferences,demonstrating the algorithm’s strong applicability.
基金supported in part by the National Natural Science Foundation of China(grant nos.61971365,61871339,62171392)Digital Fujian Province Key Laboratory of IoT Communication,Architecture and Safety Technology(grant no.2010499)+1 种基金the State Key Program of the National Natural Science Foundation of China(grant no.61731012)the Natural Science Foundation of Fujian Province of China No.2021J01004.
文摘Unmanned Aerial Vehicles(UAvs)as aerial base stations to provide communication services for ground users is a flexible and cost-effective paradigm in B5G.Besides,dynamic resource allocation and multi-connectivity can be adopted to further harness the potentials of UAVs in improving communication capacity,in such situations such that the interference among users becomes a pivotal disincentive requiring effective solutions.To this end,we investigate the Joint UAV-User Association,Channel Allocation,and transmission Power Control(J-UACAPC)problem in a multi-connectivity-enabled UAV network with constrained backhaul links,where each UAV can determine the reusable channels and transmission power to serve the selected ground users.The goal was to mitigate co-channel interference while maximizing long-term system utility.The problem was modeled as a cooperative stochastic game with hybrid discrete-continuous action space.A Multi-Agent Hybrid Deep Reinforcement Learning(MAHDRL)algorithm was proposed to address this problem.Extensive simulation results demonstrated the effectiveness of the proposed algorithm and showed that it has a higher system utility than the baseline methods.