This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the cha...This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness.展开更多
Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch ...Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain.展开更多
We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled ...We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.展开更多
Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-S...Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-SPPO)based on centralized decision process and considering policy entropy(S)is proposed.The proposed framework aims to plan the best scheduling scheme with the objective of minimizing both the timeout of order requests and the flight impact of UAVs that may lead to conflicts.In this framework,the intents of matching act are generated through the observations of UAV agents,and the ultimate conflict-free matching results are output under the guidance of a centralized decision maker.Concurrently,a pre-activation operation is introduced to further enhance the cooperation among UAV agents.Simulation experiments based on real-world data from New York City are conducted.The results indicate that the proposed CSPPO outperforms the baseline algorithms in the Average Delay Time(ADT),the Maximum Delay Time(MDT),the Order Delay Rate(ODR),the Average Flight Distance(AFD),and the Flight Impact Ratio(FIR).Furthermore,the framework demonstrates scalability to scenarios of different sizes without requiring additional training.展开更多
This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved pr...This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.展开更多
Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soar...Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference.展开更多
In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure link...In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure links is reused by vehicle-to-vehicle links.To this end,we model it as a problem of deep reinforcement learning and tackle it with prox-imal policy optimization.A considerable number of interactions are often required for training an agent with good performance,so simulation-based training is commonly used in communication networks.Nev-ertheless,severe performance degradation may occur when the agent is directly deployed in the real world,even though it can perform well on the simulator,due to the reality gap between the simulation and the real environments.To address this issue,we make prelim-inary efforts by proposing an algorithm based on meta reinforcement learning.This algorithm enables the agent to rapidly adapt to a new task with the knowl-edge extracted from similar tasks,leading to fewer in-teractions and less training time.Numerical results show that our method achieves near-optimal perfor-mance and exhibits rapid convergence.展开更多
SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to m...SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to meet the observation requirements of various payloads.Due to the limitation of platform power supply and data storage systems,proposing reasonable mission planning schemes to improve scientific revenue of the payloads becomes a critical issue.In this article,we formulate the integrated task scheduling of SATech-01 as a multi-objective optimization problem and propose a novel Fair Integrated Scheduling with Proximal Policy Optimization(FIS-PPO)algorithm to solve it.We use multiple decision heads to generate decisions for each task and design the action mask to ensure the schedule meeting the platform constraints.Experimental results show that FIS-PPO could push the capability of the platform to the limit and improve the overall observation efficiency by 31.5%compared to rule-based plans currently used.Moreover,fairness is considered in the reward design and our method achieves much better performance in terms of equal task opportunities.Because of its low computational complexity,our task scheduling algorithm has the potential to be directly deployed on board for real-time task scheduling in future space projects.展开更多
In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement l...In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement learning to enhance these capabilities for critical missions like target search,positioning,and relay guidance.Structured on a dual-layer model,the algorithm’s lower layer manages basic aircraft maneuvers for optimal flight,while the upper layer processes battlefield dynamics,issuing precise navigational commands.This approach enables accurate navigation and effective reconnaissance for lead aircraft.Notably,our Hierarchical Prior-augmented Proximal Policy Optimization(HPE-PPO)algorithm employs a prior-based training,prior-free execution method,accelerating target positioning training and ensuring robust target reacquisition.This paper also improves missile relay guidance and promotes the effective guidance.By integrating this system with a human-piloted lead aircraft,this paper proposes a potent solution for cooperative aerial warfare.Rigorous experiments demonstrate enhanced survivability and efficiency of loyal wingmen,marking a significant contribution to Unmanned Aerial Vehicles(UAV)formation control research.This advancement is poised to drive substantial interest and progress in the related technological fields.展开更多
Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical o...Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical operation,there are still some problems with federated learning applications.Blockchain has the characteristics of decentralization,distribution,and security.The blockchain-enabled federated learning further improve the security and performance of model training,while also expanding the application scope of federated learning.Blockchain has natural financial attributes that help establish a federated learning data market.However,the data of federated learning tasks may be distributed across a large number of resource-constrained IoT devices,which have different computing,communication,and storage resources,and the data quality of each device may also vary.Therefore,how to effectively select the clients with the data required for federated learning task is a research hotspot.In this paper,a two-stage client selection scheme for blockchain-enabled federated learning is proposed,which first selects clients that satisfy federated learning task through attribute-based encryption,protecting the attribute privacy of clients.Then blockchain nodes select some clients for local model aggregation by proximal policy optimization algorithm.Experiments show that the model performance of our two-stage client selection scheme is higher than that of other client selection algorithms when some clients are offline and the data quality is poor.展开更多
In this paper,we investigate a Reconfigurable Intelligent Surface(RIS)-assisted secure Symbiosis Radio(SR)network to address the information leakage of the primary transmitter(PTx)to potential eavesdroppers.Specifical...In this paper,we investigate a Reconfigurable Intelligent Surface(RIS)-assisted secure Symbiosis Radio(SR)network to address the information leakage of the primary transmitter(PTx)to potential eavesdroppers.Specifically,the RIS serves as a secondary transmitter in the SR network to ensure the security of the communication between the PTx and the Primary Receiver(PRx),and simultaneously transmits its information to the PTx concurrently by configuring the phase shifts.Considering the presence of multiple eavesdroppers and uncertain channels in practical scenarios,we jointly optimize the active beamforming of PTx and the phase shifts of RIS to maximize the secrecy energy efficiency of RIS-supported SR networks while satisfying the quality of service requirement and the secure communication rate.To solve this complicated non-convex stochastic optimization problem,we propose a secure beamforming method based on Proximal Policy Optimization(PPO),which is an efficient deep reinforcement learning algorithm,to find the optimal beamforming strategy against eavesdroppers.Simulation results show that the proposed PPO-based method is able to achieve fast convergence and realize the secrecy energy efficiency gain by up to 22%when compared to the considered benchmarks.展开更多
Real-time gait switching of quadruped robot with speed change is a difficult problem in the field of robot research.It is a novel solution to apply reinforcement learning method to the quadruped robot problem.In this ...Real-time gait switching of quadruped robot with speed change is a difficult problem in the field of robot research.It is a novel solution to apply reinforcement learning method to the quadruped robot problem.In this paper,a quadruped robot simulation platform is built based on Robot Operating System(ROS).openai-gym is used as the RL framework,and Proximal Policy Optimization(PPO)algorithm is used for quadruped robot gait switching.The training task is to train different gait parameters according to different speed input,including gait type,gait cycle,gait offset,and gait interval.Then,the trained gait parameters are used as the input of the Model Predictive Control(MPC)controller,and the joint forces/torques are calculated by the MPC controller.The calculated joint forces are transmitted to the joint motor of the quadruped robot to control the joint rotation,and the gait switching of the quadruped robot under different speeds is realized.Thus,it can more realistically imitate the gait transformation of animals,walking at very low speed,trotting at medium speed and galloping at high speed.In this paper,a variety of factors affecting the gait training of quadruped robot are integrated,and many aspects of reward constraints are used,including velocity reward,time reward,energy reward and balance reward.Different weights are given to each reward,and the instant reward at each step of system training is obtained by multiplying each reward with its own weight,which ensures the reliability of training results.At the same time,multiple groups of comparative analysis simulation experiments are carried out.The results show that the priority of balance reward,velocity reward,energy reward and time reward decreases successively and the weight of each reward does not exceed 0.5.When the policy network and the value network are designed,a three-layer neural network is used,the number of neurons in each layer is 64 and the discount factor is 0.99,the training effect is better.展开更多
Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most ...Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most promising directions of development.This paper proposes an optimized schedulingmodel for a hydrogen-coupled electro-heat-gas integrated energy system(HCEHG-IES)using generative adversarial imitation learning(GAIL).The model aims to enhance renewable-energy absorption,reduce carbon emissions,and improve grid-regulation flexibility.First,the optimal scheduling problem of HCEHG-IES under uncertainty is modeled as a Markov decision process(MDP).To overcome the limitations of conventional deep reinforcement learning algorithms—including long optimization time,slow convergence,and subjective reward design—this study augments the PPO algorithm by incorporating a discriminator network and expert data.The newly developed algorithm,termed GAIL,enables the agent to perform imitation learning from expert data.Based on this model,dynamic scheduling decisions are made in continuous state and action spaces,generating optimal energy-allocation and management schemes.Simulation results indicate that,compared with traditional reinforcement-learning algorithms,the proposed algorithmoffers better economic performance.Guided by expert data,the agent avoids blind optimization,shortens the offline training time,and improves convergence performance.In the online phase,the algorithm enables flexible energy utilization,thereby promoting renewable-energy absorption and reducing carbon emissions.展开更多
The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to...The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.展开更多
To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This pape...To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments.展开更多
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit...To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.展开更多
Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehi...Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehicle charging provides inherent flexibility that can be used to compensate for the intermittency of photo-voltaic generation and optimize against fluctuating electricity prices.Exploiting this flexibility,however,requires smart control algorithms capable of handling uncertainties from photo-voltaic generation,electric vehicle energy demand and user’s behaviour.This paper proposes a control framework combining the advantages of reinforcement learning and rule-based control to coordinate the charging of a fleet of electric vehicles in an office building.The control objective is to maximize self-consumption of locally generated electricity and consequently,minimize the electricity cost of electric vehicle charging.The performance of the proposed framework is evaluated on a real-world data set from EnergyVille,a Belgian research institute.Simulation results show that the proposed control framework achieves a 62.5%electricity cost reduction compared to a business-as-usual or passive charging strategy.In addition,only a 5%performance gap is achieved in comparison to a theoretical near-optimal strategy that assumes perfect knowledge on the required energy and user behaviour of each electric vehicle.展开更多
This paper addresses the challenge of sample efficiency in reinforcement leaming(RL)for autonomous driving,a domain characterized by long-term dependencies and complex environments.While RL has shown success in variou...This paper addresses the challenge of sample efficiency in reinforcement leaming(RL)for autonomous driving,a domain characterized by long-term dependencies and complex environments.While RL has shown success in various fields,its application to autonomous driving is hindered by the need for numerous samples to leam effective policies.We propose a novel,lightweight reward-shaping method called room-of-adjust to maximize learning progress.This approach separates rewards into continuous tendency rewards for long-term guidance and discrete milestone rewards for short-term exploration.Our method is designed to be easily integrated with other approaches,such as efficient representation,imitation learning,and transfer learning.We evaluate our approach on a hill-climbing task with uneven surfaces,which simulates the spatial-temporal reasoning required in autonomous driving.Results show that our room-of-adjust reward shaping achieves near-human performance(81.93%),whereas other reward shaping and progress maximization methods struggle.When combined with imitation learning,the performance matches human levels(97.00%).The Study also explores the method's effectiveness in formulating control theory,such as 4-wheel independent drive(4WID)systems.With reduced spatial-temporal reasoning,reward shaping can match human performance(89.7%).However,control theory cannot be trained together with complicatedspatial-temporal progress maximization.展开更多
With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real ...With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations.展开更多
基金supported by the Foundation of Key Laboratory of System Control and Information Processing,Ministry of Education,China,Scip20240111Aeronautical Science Foundation of China,Grant 2024Z071108001the Foundation of Key Laboratory of Traffic Information and Safety of Anhui Higher Education Institutes,Anhui Sanlian University,KLAHEI18018.
文摘This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness.
基金the National Natural Science Foundation of China(No.62103009)。
文摘Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain.
基金Project supported by the National Key R&D Program of China(No.2018AAA0101400)the National Natural Science Foundation of China(Nos.61973074,U1713209,61520106009,and 61533008)+1 种基金the Science and Technology on Information System Engineering Laboratory(No.05201902)the Fundamental Research Funds for the Central Universities,China。
文摘We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.
基金the support of the Chinese Special Research Project for Civil Aircraft(No.MJZ17N22)the National Natural Science Foundation of China(Nos.U2133207,U2333214)+1 种基金the China Postdoctoral Science Foundation(No.2023M741687)the National Social Science Fund of China(No.22&ZD169)。
文摘Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-SPPO)based on centralized decision process and considering policy entropy(S)is proposed.The proposed framework aims to plan the best scheduling scheme with the objective of minimizing both the timeout of order requests and the flight impact of UAVs that may lead to conflicts.In this framework,the intents of matching act are generated through the observations of UAV agents,and the ultimate conflict-free matching results are output under the guidance of a centralized decision maker.Concurrently,a pre-activation operation is introduced to further enhance the cooperation among UAV agents.Simulation experiments based on real-world data from New York City are conducted.The results indicate that the proposed CSPPO outperforms the baseline algorithms in the Average Delay Time(ADT),the Maximum Delay Time(MDT),the Order Delay Rate(ODR),the Average Flight Distance(AFD),and the Flight Impact Ratio(FIR).Furthermore,the framework demonstrates scalability to scenarios of different sizes without requiring additional training.
基金partially supported by the National Key Research and Development Program of the Ministry of Science and Technology of China(2022YFE0114200)the National Natural Science Foundation of China(U20A6004).
文摘This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.
基金support received by the National Natural Science Foundation of China(Grant Nos.52372398&62003272).
文摘Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference.
基金L.Liang was supported in part by the Natural Science Foundation of Jiangsu Province under Grant BK20220810in part by the National Natural Science Foundation of China under Grant 62201145 and Grant 62231019S.Jin was supported in part by the National Natural Science Foundation of China(NSFC)under Grants 62261160576,62341107,61921004。
文摘In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure links is reused by vehicle-to-vehicle links.To this end,we model it as a problem of deep reinforcement learning and tackle it with prox-imal policy optimization.A considerable number of interactions are often required for training an agent with good performance,so simulation-based training is commonly used in communication networks.Nev-ertheless,severe performance degradation may occur when the agent is directly deployed in the real world,even though it can perform well on the simulator,due to the reality gap between the simulation and the real environments.To address this issue,we make prelim-inary efforts by proposing an algorithm based on meta reinforcement learning.This algorithm enables the agent to rapidly adapt to a new task with the knowl-edge extracted from similar tasks,leading to fewer in-teractions and less training time.Numerical results show that our method achieves near-optimal perfor-mance and exhibits rapid convergence.
基金supported by the Strategic Priority Program on Space Science,Chinese Academy of Sciences。
文摘SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to meet the observation requirements of various payloads.Due to the limitation of platform power supply and data storage systems,proposing reasonable mission planning schemes to improve scientific revenue of the payloads becomes a critical issue.In this article,we formulate the integrated task scheduling of SATech-01 as a multi-objective optimization problem and propose a novel Fair Integrated Scheduling with Proximal Policy Optimization(FIS-PPO)algorithm to solve it.We use multiple decision heads to generate decisions for each task and design the action mask to ensure the schedule meeting the platform constraints.Experimental results show that FIS-PPO could push the capability of the platform to the limit and improve the overall observation efficiency by 31.5%compared to rule-based plans currently used.Moreover,fairness is considered in the reward design and our method achieves much better performance in terms of equal task opportunities.Because of its low computational complexity,our task scheduling algorithm has the potential to be directly deployed on board for real-time task scheduling in future space projects.
基金This study was co-supported by the Natural Science Basic Research Program of Shaanxi,China(No.2022JQ-593)the Key R&D Program of Shaanxi Provincial Department of Science and Technology,China(No.2022GY-089)the Aeronautical Science Foundation of China(No.20220013053005).
文摘In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement learning to enhance these capabilities for critical missions like target search,positioning,and relay guidance.Structured on a dual-layer model,the algorithm’s lower layer manages basic aircraft maneuvers for optimal flight,while the upper layer processes battlefield dynamics,issuing precise navigational commands.This approach enables accurate navigation and effective reconnaissance for lead aircraft.Notably,our Hierarchical Prior-augmented Proximal Policy Optimization(HPE-PPO)algorithm employs a prior-based training,prior-free execution method,accelerating target positioning training and ensuring robust target reacquisition.This paper also improves missile relay guidance and promotes the effective guidance.By integrating this system with a human-piloted lead aircraft,this paper proposes a potent solution for cooperative aerial warfare.Rigorous experiments demonstrate enhanced survivability and efficiency of loyal wingmen,marking a significant contribution to Unmanned Aerial Vehicles(UAV)formation control research.This advancement is poised to drive substantial interest and progress in the related technological fields.
文摘Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical operation,there are still some problems with federated learning applications.Blockchain has the characteristics of decentralization,distribution,and security.The blockchain-enabled federated learning further improve the security and performance of model training,while also expanding the application scope of federated learning.Blockchain has natural financial attributes that help establish a federated learning data market.However,the data of federated learning tasks may be distributed across a large number of resource-constrained IoT devices,which have different computing,communication,and storage resources,and the data quality of each device may also vary.Therefore,how to effectively select the clients with the data required for federated learning task is a research hotspot.In this paper,a two-stage client selection scheme for blockchain-enabled federated learning is proposed,which first selects clients that satisfy federated learning task through attribute-based encryption,protecting the attribute privacy of clients.Then blockchain nodes select some clients for local model aggregation by proximal policy optimization algorithm.Experiments show that the model performance of our two-stage client selection scheme is higher than that of other client selection algorithms when some clients are offline and the data quality is poor.
基金supported by National Natural Science Foundation of China under Grant 62101277。
文摘In this paper,we investigate a Reconfigurable Intelligent Surface(RIS)-assisted secure Symbiosis Radio(SR)network to address the information leakage of the primary transmitter(PTx)to potential eavesdroppers.Specifically,the RIS serves as a secondary transmitter in the SR network to ensure the security of the communication between the PTx and the Primary Receiver(PRx),and simultaneously transmits its information to the PTx concurrently by configuring the phase shifts.Considering the presence of multiple eavesdroppers and uncertain channels in practical scenarios,we jointly optimize the active beamforming of PTx and the phase shifts of RIS to maximize the secrecy energy efficiency of RIS-supported SR networks while satisfying the quality of service requirement and the secure communication rate.To solve this complicated non-convex stochastic optimization problem,we propose a secure beamforming method based on Proximal Policy Optimization(PPO),which is an efficient deep reinforcement learning algorithm,to find the optimal beamforming strategy against eavesdroppers.Simulation results show that the proposed PPO-based method is able to achieve fast convergence and realize the secrecy energy efficiency gain by up to 22%when compared to the considered benchmarks.
基金funded by the Science and Technology Development Program of Jilin Province,China(Grant No.20230101117JC)National Natural Science Foundation of China(Grant No.51305157).
文摘Real-time gait switching of quadruped robot with speed change is a difficult problem in the field of robot research.It is a novel solution to apply reinforcement learning method to the quadruped robot problem.In this paper,a quadruped robot simulation platform is built based on Robot Operating System(ROS).openai-gym is used as the RL framework,and Proximal Policy Optimization(PPO)algorithm is used for quadruped robot gait switching.The training task is to train different gait parameters according to different speed input,including gait type,gait cycle,gait offset,and gait interval.Then,the trained gait parameters are used as the input of the Model Predictive Control(MPC)controller,and the joint forces/torques are calculated by the MPC controller.The calculated joint forces are transmitted to the joint motor of the quadruped robot to control the joint rotation,and the gait switching of the quadruped robot under different speeds is realized.Thus,it can more realistically imitate the gait transformation of animals,walking at very low speed,trotting at medium speed and galloping at high speed.In this paper,a variety of factors affecting the gait training of quadruped robot are integrated,and many aspects of reward constraints are used,including velocity reward,time reward,energy reward and balance reward.Different weights are given to each reward,and the instant reward at each step of system training is obtained by multiplying each reward with its own weight,which ensures the reliability of training results.At the same time,multiple groups of comparative analysis simulation experiments are carried out.The results show that the priority of balance reward,velocity reward,energy reward and time reward decreases successively and the weight of each reward does not exceed 0.5.When the policy network and the value network are designed,a three-layer neural network is used,the number of neurons in each layer is 64 and the discount factor is 0.99,the training effect is better.
基金supported by State Grid Corporation Technology Project(No.522437250003).
文摘Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most promising directions of development.This paper proposes an optimized schedulingmodel for a hydrogen-coupled electro-heat-gas integrated energy system(HCEHG-IES)using generative adversarial imitation learning(GAIL).The model aims to enhance renewable-energy absorption,reduce carbon emissions,and improve grid-regulation flexibility.First,the optimal scheduling problem of HCEHG-IES under uncertainty is modeled as a Markov decision process(MDP).To overcome the limitations of conventional deep reinforcement learning algorithms—including long optimization time,slow convergence,and subjective reward design—this study augments the PPO algorithm by incorporating a discriminator network and expert data.The newly developed algorithm,termed GAIL,enables the agent to perform imitation learning from expert data.Based on this model,dynamic scheduling decisions are made in continuous state and action spaces,generating optimal energy-allocation and management schemes.Simulation results indicate that,compared with traditional reinforcement-learning algorithms,the proposed algorithmoffers better economic performance.Guided by expert data,the agent avoids blind optimization,shortens the offline training time,and improves convergence performance.In the online phase,the algorithm enables flexible energy utilization,thereby promoting renewable-energy absorption and reducing carbon emissions.
基金the Project of National Natural Science Foundation of China(Grant No.62106283)the Project of National Natural Science Foundation of China(Grant No.72001214)to provide fund for conducting experimentsthe Project of Natural Science Foundation of Shaanxi Province(Grant No.2020JQ-484)。
文摘The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.
基金supported in part by the National Natural Science Foundation of China under grants 61901078,61771082,61871062,and U20A20157in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under grant KJQN201900609+2 种基金in part by the Natural Science Foundation of Chongqing under grant cstc2020jcyj-zdxmX0024in part by University Innovation Research Group of Chongqing under grant CXQT20017in part by the China University Industry-University-Research Collaborative Innovation Fund(Future Network Innovation Research and Application Project)under grant 2021FNA04008.
文摘To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments.
基金financial support from National Natural Science Foundation of China(Grant No.61601491)Natural Science Foundation of Hubei Province,China(Grant No.2018CFC865)Military Research Project of China(-Grant No.YJ2020B117)。
文摘To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.
文摘Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehicle charging provides inherent flexibility that can be used to compensate for the intermittency of photo-voltaic generation and optimize against fluctuating electricity prices.Exploiting this flexibility,however,requires smart control algorithms capable of handling uncertainties from photo-voltaic generation,electric vehicle energy demand and user’s behaviour.This paper proposes a control framework combining the advantages of reinforcement learning and rule-based control to coordinate the charging of a fleet of electric vehicles in an office building.The control objective is to maximize self-consumption of locally generated electricity and consequently,minimize the electricity cost of electric vehicle charging.The performance of the proposed framework is evaluated on a real-world data set from EnergyVille,a Belgian research institute.Simulation results show that the proposed control framework achieves a 62.5%electricity cost reduction compared to a business-as-usual or passive charging strategy.In addition,only a 5%performance gap is achieved in comparison to a theoretical near-optimal strategy that assumes perfect knowledge on the required energy and user behaviour of each electric vehicle.
文摘This paper addresses the challenge of sample efficiency in reinforcement leaming(RL)for autonomous driving,a domain characterized by long-term dependencies and complex environments.While RL has shown success in various fields,its application to autonomous driving is hindered by the need for numerous samples to leam effective policies.We propose a novel,lightweight reward-shaping method called room-of-adjust to maximize learning progress.This approach separates rewards into continuous tendency rewards for long-term guidance and discrete milestone rewards for short-term exploration.Our method is designed to be easily integrated with other approaches,such as efficient representation,imitation learning,and transfer learning.We evaluate our approach on a hill-climbing task with uneven surfaces,which simulates the spatial-temporal reasoning required in autonomous driving.Results show that our room-of-adjust reward shaping achieves near-human performance(81.93%),whereas other reward shaping and progress maximization methods struggle.When combined with imitation learning,the performance matches human levels(97.00%).The Study also explores the method's effectiveness in formulating control theory,such as 4-wheel independent drive(4WID)systems.With reduced spatial-temporal reasoning,reward shaping can match human performance(89.7%).However,control theory cannot be trained together with complicatedspatial-temporal progress maximization.
基金supported by State Grid Science and Technology Program“Research on Real-time Autonomous Control Strategies for Power Grid Based on AI Technologies”(No.5700-201958523A-0-0-00)
文摘With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations.