This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the cha...This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness.展开更多
Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch ...Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain.展开更多
The coordinated optimization problem of the electricity-gas-heat integrated energy system(IES)has the characteristics of strong coupling,non-convexity,and nonlinearity.The centralized optimization method has a high co...The coordinated optimization problem of the electricity-gas-heat integrated energy system(IES)has the characteristics of strong coupling,non-convexity,and nonlinearity.The centralized optimization method has a high cost of communication and complex modeling.Meanwhile,the traditional numerical iterative solution cannot deal with uncertainty and solution efficiency,which is difficult to apply online.For the coordinated optimization problem of the electricity-gas-heat IES in this study,we constructed a model for the distributed IES with a dynamic distribution factor and transformed the centralized optimization problem into a distributed optimization problem in the multi-agent reinforcement learning environment using multi-agent deep deterministic policy gradient.Introducing the dynamic distribution factor allows the system to consider the impact of changes in real-time supply and demand on system optimization,dynamically coordinating different energy sources for complementary utilization and effectively improving the system economy.Compared with centralized optimization,the distributed model with multiple decision centers can achieve similar results while easing the pressure on system communication.The proposed method considers the dual uncertainty of renewable energy and load in the training.Compared with the traditional iterative solution method,it can better cope with uncertainty and realize real-time decision making of the system,which is conducive to the online application.Finally,we verify the effectiveness of the proposed method using an example of an IES coupled with three energy hub agents.展开更多
This paper studies a distributed policy evaluation in multi-agent reinforcement learning.Under cooperative settings,each agent only obtains a local reward,while all agents share a common environmental state.To optimiz...This paper studies a distributed policy evaluation in multi-agent reinforcement learning.Under cooperative settings,each agent only obtains a local reward,while all agents share a common environmental state.To optimize the global return as the sum of local return,the agents exchange information with their neighbors through a communication network.The mean squared projected Bellman error minimization problem is reformulated as a constrained convex optimization problem with a consensus constraint;then,a distributed alternating directions method of multipliers(ADMM)algorithm is proposed to solve it.Furthermore,an inexact step for ADMM is used to achieve efficient computation at each iteration.The convergence of the proposed algorithm is established.yipeng@tongji.edu.cn;LilLi received the B.Sc.and M.Se.degrees from Shengyang Agri-culture University,China in 1996 and 1999.respectivly.and the Ph.D.degree from Shenyang Institute of Automation,Chinese Academy of Science,in 2003.She joined Tongji Universitry,Shanghai,China,in 2003,and is now a professor at the Depart-ment of Control Science and Engineering.Her research inter-ests are in data-driven modeling and opimization,computaional intelligence.展开更多
Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most ...Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most promising directions of development.This paper proposes an optimized schedulingmodel for a hydrogen-coupled electro-heat-gas integrated energy system(HCEHG-IES)using generative adversarial imitation learning(GAIL).The model aims to enhance renewable-energy absorption,reduce carbon emissions,and improve grid-regulation flexibility.First,the optimal scheduling problem of HCEHG-IES under uncertainty is modeled as a Markov decision process(MDP).To overcome the limitations of conventional deep reinforcement learning algorithms—including long optimization time,slow convergence,and subjective reward design—this study augments the PPO algorithm by incorporating a discriminator network and expert data.The newly developed algorithm,termed GAIL,enables the agent to perform imitation learning from expert data.Based on this model,dynamic scheduling decisions are made in continuous state and action spaces,generating optimal energy-allocation and management schemes.Simulation results indicate that,compared with traditional reinforcement-learning algorithms,the proposed algorithmoffers better economic performance.Guided by expert data,the agent avoids blind optimization,shortens the offline training time,and improves convergence performance.In the online phase,the algorithm enables flexible energy utilization,thereby promoting renewable-energy absorption and reducing carbon emissions.展开更多
To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This pape...To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments.展开更多
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit...To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.展开更多
Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soar...Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference.展开更多
We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled ...We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.展开更多
This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning(MARL),where agents communicating over a network aim to find an optimal policy that maximizes the average of all the ...This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning(MARL),where agents communicating over a network aim to find an optimal policy that maximizes the average of all the agents'local returns.To address the challenges of high variance and bias in stochastic policy gradients for MARL,this paper proposes a distributed policy gradient method with variance reduction,combined with gradient tracking to correct the bias resulting from the difference between local and global gradients.The authors also utilize importance sampling to solve the distribution shift problem in the sampling process.The authors then show that the proposed algorithm finds anε-approximate stationary point,where the convergence depends on the number of iterations,the mini-batch size,the epoch size,the problem parameters,and the network topology.The authors further establish the sample and communication complexity to obtain anε-approximate stationary point.Finally,numerical experiments are performed to validate the effectiveness of the proposed algorithm.展开更多
随着新能源发电比例不断提高,由此引发的短时大规模功率爬坡事件愈加频繁,因此研究多类型储能爬坡功率分配策略对防范极端爬坡风险、保障系统稳定运行具有重要意义。该文提出一种面向紧急爬坡需求的多类型储能功率优化分配策略,引入深...随着新能源发电比例不断提高,由此引发的短时大规模功率爬坡事件愈加频繁,因此研究多类型储能爬坡功率分配策略对防范极端爬坡风险、保障系统稳定运行具有重要意义。该文提出一种面向紧急爬坡需求的多类型储能功率优化分配策略,引入深度强化学习(deep reinforcement learning,DRL)方法以兼顾功率分配的准确性与时效性。首先,以绝热压缩空气储能(adiabatic compressed air energy storage,A-CAES)、风电联合储能、火电联合飞轮储能为代表,分析多类型储能的爬坡互补特性,重点研究A-CAES的非线性热动-气动耦合特征及风储系统的风机转子动能瞬态响应行为,并据此构建多类型储能爬坡功率响应模型;其次,将功率优化分配问题转化为适合DRL的马尔可夫决策过程,并引入学习率动态衰减、策略熵以及状态归一化等训练机制,提出基于近端策略优化算法的电力系统多类型储能爬坡功率分配策略;最后,在多种爬坡场景下开展算例分析。结果表明,所提分配策略能够充分发挥各类储能的调控优势,提高爬坡功率分配的灵活性、精准性、时效性。展开更多
基金supported by the Foundation of Key Laboratory of System Control and Information Processing,Ministry of Education,China,Scip20240111Aeronautical Science Foundation of China,Grant 2024Z071108001the Foundation of Key Laboratory of Traffic Information and Safety of Anhui Higher Education Institutes,Anhui Sanlian University,KLAHEI18018.
文摘This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness.
基金the National Natural Science Foundation of China(No.62103009)。
文摘Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain.
基金supported by The National Key R&D Program of China(2020YFB0905900):Research on artificial intelligence application of power internet of things.
文摘The coordinated optimization problem of the electricity-gas-heat integrated energy system(IES)has the characteristics of strong coupling,non-convexity,and nonlinearity.The centralized optimization method has a high cost of communication and complex modeling.Meanwhile,the traditional numerical iterative solution cannot deal with uncertainty and solution efficiency,which is difficult to apply online.For the coordinated optimization problem of the electricity-gas-heat IES in this study,we constructed a model for the distributed IES with a dynamic distribution factor and transformed the centralized optimization problem into a distributed optimization problem in the multi-agent reinforcement learning environment using multi-agent deep deterministic policy gradient.Introducing the dynamic distribution factor allows the system to consider the impact of changes in real-time supply and demand on system optimization,dynamically coordinating different energy sources for complementary utilization and effectively improving the system economy.Compared with centralized optimization,the distributed model with multiple decision centers can achieve similar results while easing the pressure on system communication.The proposed method considers the dual uncertainty of renewable energy and load in the training.Compared with the traditional iterative solution method,it can better cope with uncertainty and realize real-time decision making of the system,which is conducive to the online application.Finally,we verify the effectiveness of the proposed method using an example of an IES coupled with three energy hub agents.
基金the National Key Research and Development Program of Science and Technology,China(No.2018YFB1305304)the Shanghai Science and Technology Pilot Project,China(No.19511132100)+2 种基金the National Natural Science Foun-dation,China(No.51475334)the Shanghai Sailing Program,China(No.20YF1453000)the Fundamental Research Funds for the Central Univesitie,China(No.22120200048)。
文摘This paper studies a distributed policy evaluation in multi-agent reinforcement learning.Under cooperative settings,each agent only obtains a local reward,while all agents share a common environmental state.To optimize the global return as the sum of local return,the agents exchange information with their neighbors through a communication network.The mean squared projected Bellman error minimization problem is reformulated as a constrained convex optimization problem with a consensus constraint;then,a distributed alternating directions method of multipliers(ADMM)algorithm is proposed to solve it.Furthermore,an inexact step for ADMM is used to achieve efficient computation at each iteration.The convergence of the proposed algorithm is established.yipeng@tongji.edu.cn;LilLi received the B.Sc.and M.Se.degrees from Shengyang Agri-culture University,China in 1996 and 1999.respectivly.and the Ph.D.degree from Shenyang Institute of Automation,Chinese Academy of Science,in 2003.She joined Tongji Universitry,Shanghai,China,in 2003,and is now a professor at the Depart-ment of Control Science and Engineering.Her research inter-ests are in data-driven modeling and opimization,computaional intelligence.
基金supported by State Grid Corporation Technology Project(No.522437250003).
文摘Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most promising directions of development.This paper proposes an optimized schedulingmodel for a hydrogen-coupled electro-heat-gas integrated energy system(HCEHG-IES)using generative adversarial imitation learning(GAIL).The model aims to enhance renewable-energy absorption,reduce carbon emissions,and improve grid-regulation flexibility.First,the optimal scheduling problem of HCEHG-IES under uncertainty is modeled as a Markov decision process(MDP).To overcome the limitations of conventional deep reinforcement learning algorithms—including long optimization time,slow convergence,and subjective reward design—this study augments the PPO algorithm by incorporating a discriminator network and expert data.The newly developed algorithm,termed GAIL,enables the agent to perform imitation learning from expert data.Based on this model,dynamic scheduling decisions are made in continuous state and action spaces,generating optimal energy-allocation and management schemes.Simulation results indicate that,compared with traditional reinforcement-learning algorithms,the proposed algorithmoffers better economic performance.Guided by expert data,the agent avoids blind optimization,shortens the offline training time,and improves convergence performance.In the online phase,the algorithm enables flexible energy utilization,thereby promoting renewable-energy absorption and reducing carbon emissions.
基金supported in part by the National Natural Science Foundation of China under grants 61901078,61771082,61871062,and U20A20157in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under grant KJQN201900609+2 种基金in part by the Natural Science Foundation of Chongqing under grant cstc2020jcyj-zdxmX0024in part by University Innovation Research Group of Chongqing under grant CXQT20017in part by the China University Industry-University-Research Collaborative Innovation Fund(Future Network Innovation Research and Application Project)under grant 2021FNA04008.
文摘To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments.
基金financial support from National Natural Science Foundation of China(Grant No.61601491)Natural Science Foundation of Hubei Province,China(Grant No.2018CFC865)Military Research Project of China(-Grant No.YJ2020B117)。
文摘To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.
基金support received by the National Natural Science Foundation of China(Grant Nos.52372398&62003272).
文摘Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference.
基金Project supported by the National Key R&D Program of China(No.2018AAA0101400)the National Natural Science Foundation of China(Nos.61973074,U1713209,61520106009,and 61533008)+1 种基金the Science and Technology on Information System Engineering Laboratory(No.05201902)the Fundamental Research Funds for the Central Universities,China。
文摘We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.62003245,72171172,and 92367101the National Natural Science Foundation of China Basic Science Research Center Program under Grant No.62088101+1 种基金the Aeronautical Science Foundation of China under Grant No.2023Z066038001Shanghai Municipal Science and Technology Major Project under Grant No.2021SHZDZX0100。
文摘This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning(MARL),where agents communicating over a network aim to find an optimal policy that maximizes the average of all the agents'local returns.To address the challenges of high variance and bias in stochastic policy gradients for MARL,this paper proposes a distributed policy gradient method with variance reduction,combined with gradient tracking to correct the bias resulting from the difference between local and global gradients.The authors also utilize importance sampling to solve the distribution shift problem in the sampling process.The authors then show that the proposed algorithm finds anε-approximate stationary point,where the convergence depends on the number of iterations,the mini-batch size,the epoch size,the problem parameters,and the network topology.The authors further establish the sample and communication complexity to obtain anε-approximate stationary point.Finally,numerical experiments are performed to validate the effectiveness of the proposed algorithm.
文摘随着新能源发电比例不断提高,由此引发的短时大规模功率爬坡事件愈加频繁,因此研究多类型储能爬坡功率分配策略对防范极端爬坡风险、保障系统稳定运行具有重要意义。该文提出一种面向紧急爬坡需求的多类型储能功率优化分配策略,引入深度强化学习(deep reinforcement learning,DRL)方法以兼顾功率分配的准确性与时效性。首先,以绝热压缩空气储能(adiabatic compressed air energy storage,A-CAES)、风电联合储能、火电联合飞轮储能为代表,分析多类型储能的爬坡互补特性,重点研究A-CAES的非线性热动-气动耦合特征及风储系统的风机转子动能瞬态响应行为,并据此构建多类型储能爬坡功率响应模型;其次,将功率优化分配问题转化为适合DRL的马尔可夫决策过程,并引入学习率动态衰减、策略熵以及状态归一化等训练机制,提出基于近端策略优化算法的电力系统多类型储能爬坡功率分配策略;最后,在多种爬坡场景下开展算例分析。结果表明,所提分配策略能够充分发挥各类储能的调控优势,提高爬坡功率分配的灵活性、精准性、时效性。