期刊文献+
共找到25篇文章
< 1 2 >
每页显示 20 50 100
Dynamic hedging of 50ETF options using Proximal Policy Optimization
1
作者 Lei Liu Mengmeng Hao Jinde Cao 《Journal of Automation and Intelligence》 2025年第3期198-206,共9页
This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the cha... This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness. 展开更多
关键词 B-S model Option hedging Reinforcement learning 50ETF proximal policy optimization(PPO)
在线阅读 下载PDF
Gait Learning Reproduction for Quadruped Robots Based on Experience Evolution Proximal Policy Optimization
2
作者 LI Chunyang ZHU Xiaoqing +2 位作者 RUAN Xiaogang LIU Xinyuan ZHANG Siyuan 《Journal of Shanghai Jiaotong university(Science)》 2025年第6期1125-1133,共9页
Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch ... Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain. 展开更多
关键词 quadruped robot proximal policy optimization(PPO) priori knowledge evolutionary strategy bionic gait learning
原文传递
Proximal policy optimization with an integral compensator for quadrotor control 被引量:6
3
作者 Huan HU Qing-ling WANG 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2020年第5期777-795,共19页
We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled ... We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm. 展开更多
关键词 Reinforcement learning proximal policy optimization Quadrotor control Neural network
原文传递
C-SPPO:A deep reinforcement learning framework for large-scale dynamic logistics UAV routing problem
4
作者 Fei WANG Honghai ZHANG +2 位作者 Sen DU Mingzhuang HUA Gang ZHONG 《Chinese Journal of Aeronautics》 2025年第5期296-316,共21页
Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-S... Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-SPPO)based on centralized decision process and considering policy entropy(S)is proposed.The proposed framework aims to plan the best scheduling scheme with the objective of minimizing both the timeout of order requests and the flight impact of UAVs that may lead to conflicts.In this framework,the intents of matching act are generated through the observations of UAV agents,and the ultimate conflict-free matching results are output under the guidance of a centralized decision maker.Concurrently,a pre-activation operation is introduced to further enhance the cooperation among UAV agents.Simulation experiments based on real-world data from New York City are conducted.The results indicate that the proposed CSPPO outperforms the baseline algorithms in the Average Delay Time(ADT),the Maximum Delay Time(MDT),the Order Delay Rate(ODR),the Average Flight Distance(AFD),and the Flight Impact Ratio(FIR).Furthermore,the framework demonstrates scalability to scenarios of different sizes without requiring additional training. 展开更多
关键词 Unmanned aerial vehicle Vehicle routing problem Orderdelivery Reinforcement learning MULTI-AGENT proximal policy optimization
原文传递
Deep Reinforcement Learning-based Multi-Objective Scheduling for Distributed Heterogeneous Hybrid Flow Shops with Blocking Constraints
5
作者 Xueyan Sun Weiming Shen +3 位作者 Jiaxin Fan Birgit Vogel-Heuser Fandi Bi Chunjiang Zhang 《Engineering》 2025年第3期278-291,共14页
This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved pr... This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality. 展开更多
关键词 Multi-objective Markov decision process Multi-agent deep reinforcement learning proximal policy optimization Distributed hybrid flow-shop scheduling Blocking constraints
在线阅读 下载PDF
A novel trajectories optimizing method for dynamic soaring based on deep reinforcement learning
6
作者 Wanyong Zou Ni Li +2 位作者 Fengcheng An Kaibo Wang Changyin Dong 《Defence Technology(防务技术)》 2025年第4期99-108,共10页
Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soar... Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference. 展开更多
关键词 Dynamic soaring Differential flatness Trajectory optimization proximal policy optimization
在线阅读 下载PDF
Meta Reinforcement Learning for Fast Spectrum Sharing in Vehicular Networks
7
作者 Huang Kai Liang Le +1 位作者 Jin Shi Geoffrey Ye Li 《China Communications》 2025年第9期320-332,共13页
In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure link... In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure links is reused by vehicle-to-vehicle links.To this end,we model it as a problem of deep reinforcement learning and tackle it with prox-imal policy optimization.A considerable number of interactions are often required for training an agent with good performance,so simulation-based training is commonly used in communication networks.Nev-ertheless,severe performance degradation may occur when the agent is directly deployed in the real world,even though it can perform well on the simulator,due to the reality gap between the simulation and the real environments.To address this issue,we make prelim-inary efforts by proposing an algorithm based on meta reinforcement learning.This algorithm enables the agent to rapidly adapt to a new task with the knowl-edge extracted from similar tasks,leading to fewer in-teractions and less training time.Numerical results show that our method achieves near-optimal perfor-mance and exhibits rapid convergence. 展开更多
关键词 meta reinforcement learning proximal policy optimization spectrum sharing V2X communication
在线阅读 下载PDF
Efficient and fair PPO-based integrated scheduling method for multiple tasks of SATech-01 satellite 被引量:1
8
作者 Qi SHI Lu LI +5 位作者 Ziruo FANG Xingzi BI Huaqiu LIU Xiaofeng ZHANG Wen CHEN Jinpei YU 《Chinese Journal of Aeronautics》 SCIE EI CAS CSCD 2024年第2期417-430,共14页
SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to m... SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to meet the observation requirements of various payloads.Due to the limitation of platform power supply and data storage systems,proposing reasonable mission planning schemes to improve scientific revenue of the payloads becomes a critical issue.In this article,we formulate the integrated task scheduling of SATech-01 as a multi-objective optimization problem and propose a novel Fair Integrated Scheduling with Proximal Policy Optimization(FIS-PPO)algorithm to solve it.We use multiple decision heads to generate decisions for each task and design the action mask to ensure the schedule meeting the platform constraints.Experimental results show that FIS-PPO could push the capability of the platform to the limit and improve the overall observation efficiency by 31.5%compared to rule-based plans currently used.Moreover,fairness is considered in the reward design and our method achieves much better performance in terms of equal task opportunities.Because of its low computational complexity,our task scheduling algorithm has the potential to be directly deployed on board for real-time task scheduling in future space projects. 展开更多
关键词 Satellite observatories SATech-01 Multi-modes platform Scheduling algorithms Reinforcement learning proximal policy optimization(PPO)
原文传递
Loyal wingman task execution for future aerial combat:A hierarchical prior-based reinforcement learning approach 被引量:1
9
作者 Jiandong ZHANG Dinghan WANG +4 位作者 Qiming YANG Zhuoyong SHI Longmeng JI Guoqing SHI Yong WU 《Chinese Journal of Aeronautics》 SCIE EI CAS CSCD 2024年第5期462-481,共20页
In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement l... In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement learning to enhance these capabilities for critical missions like target search,positioning,and relay guidance.Structured on a dual-layer model,the algorithm’s lower layer manages basic aircraft maneuvers for optimal flight,while the upper layer processes battlefield dynamics,issuing precise navigational commands.This approach enables accurate navigation and effective reconnaissance for lead aircraft.Notably,our Hierarchical Prior-augmented Proximal Policy Optimization(HPE-PPO)algorithm employs a prior-based training,prior-free execution method,accelerating target positioning training and ensuring robust target reacquisition.This paper also improves missile relay guidance and promotes the effective guidance.By integrating this system with a human-piloted lead aircraft,this paper proposes a potent solution for cooperative aerial warfare.Rigorous experiments demonstrate enhanced survivability and efficiency of loyal wingmen,marking a significant contribution to Unmanned Aerial Vehicles(UAV)formation control research.This advancement is poised to drive substantial interest and progress in the related technological fields. 展开更多
关键词 Beyond-visual-range Loyal wingmen Hierarchical prior-augmented proximal policy optimization Unmanned aerial vehicles Warfare
原文传递
Two-Stage Client Selection Scheme for Blockchain-Enabled Federated Learning in IoT
10
作者 Xiaojun Jin Chao Ma +2 位作者 Song Luo Pengyi Zeng Yifei Wei 《Computers, Materials & Continua》 SCIE EI 2024年第11期2317-2336,共20页
Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical o... Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical operation,there are still some problems with federated learning applications.Blockchain has the characteristics of decentralization,distribution,and security.The blockchain-enabled federated learning further improve the security and performance of model training,while also expanding the application scope of federated learning.Blockchain has natural financial attributes that help establish a federated learning data market.However,the data of federated learning tasks may be distributed across a large number of resource-constrained IoT devices,which have different computing,communication,and storage resources,and the data quality of each device may also vary.Therefore,how to effectively select the clients with the data required for federated learning task is a research hotspot.In this paper,a two-stage client selection scheme for blockchain-enabled federated learning is proposed,which first selects clients that satisfy federated learning task through attribute-based encryption,protecting the attribute privacy of clients.Then blockchain nodes select some clients for local model aggregation by proximal policy optimization algorithm.Experiments show that the model performance of our two-stage client selection scheme is higher than that of other client selection algorithms when some clients are offline and the data quality is poor. 展开更多
关键词 Blockchain federated learning attribute-based encryption client selection proximal policy optimization
在线阅读 下载PDF
Secure transmission design for RIS-aided symbiotic radio networks:A DRL approach
11
作者 Bin Li Wenshuai Liu Wancheng Xie 《Digital Communications and Networks》 CSCD 2024年第6期1566-1575,共10页
In this paper,we investigate a Reconfigurable Intelligent Surface(RIS)-assisted secure Symbiosis Radio(SR)network to address the information leakage of the primary transmitter(PTx)to potential eavesdroppers.Specifical... In this paper,we investigate a Reconfigurable Intelligent Surface(RIS)-assisted secure Symbiosis Radio(SR)network to address the information leakage of the primary transmitter(PTx)to potential eavesdroppers.Specifically,the RIS serves as a secondary transmitter in the SR network to ensure the security of the communication between the PTx and the Primary Receiver(PRx),and simultaneously transmits its information to the PTx concurrently by configuring the phase shifts.Considering the presence of multiple eavesdroppers and uncertain channels in practical scenarios,we jointly optimize the active beamforming of PTx and the phase shifts of RIS to maximize the secrecy energy efficiency of RIS-supported SR networks while satisfying the quality of service requirement and the secure communication rate.To solve this complicated non-convex stochastic optimization problem,we propose a secure beamforming method based on Proximal Policy Optimization(PPO),which is an efficient deep reinforcement learning algorithm,to find the optimal beamforming strategy against eavesdroppers.Simulation results show that the proposed PPO-based method is able to achieve fast convergence and realize the secrecy energy efficiency gain by up to 22%when compared to the considered benchmarks. 展开更多
关键词 Symbiotic radio Reconfigurable intelligent surface Robust transmission Deep reinforcement learning proximal policy optimization
在线阅读 下载PDF
Research on Gait Switching Method Based on Speed Requirement
12
作者 Weijun Tian Kuiyue Zhou +6 位作者 Jian Song Xu Li Zhu Chen Ziteng Sheng Ruizhi Wang Jiang Lei Qian Cong 《Journal of Bionic Engineering》 CSCD 2024年第6期2817-2829,共13页
Real-time gait switching of quadruped robot with speed change is a difficult problem in the field of robot research.It is a novel solution to apply reinforcement learning method to the quadruped robot problem.In this ... Real-time gait switching of quadruped robot with speed change is a difficult problem in the field of robot research.It is a novel solution to apply reinforcement learning method to the quadruped robot problem.In this paper,a quadruped robot simulation platform is built based on Robot Operating System(ROS).openai-gym is used as the RL framework,and Proximal Policy Optimization(PPO)algorithm is used for quadruped robot gait switching.The training task is to train different gait parameters according to different speed input,including gait type,gait cycle,gait offset,and gait interval.Then,the trained gait parameters are used as the input of the Model Predictive Control(MPC)controller,and the joint forces/torques are calculated by the MPC controller.The calculated joint forces are transmitted to the joint motor of the quadruped robot to control the joint rotation,and the gait switching of the quadruped robot under different speeds is realized.Thus,it can more realistically imitate the gait transformation of animals,walking at very low speed,trotting at medium speed and galloping at high speed.In this paper,a variety of factors affecting the gait training of quadruped robot are integrated,and many aspects of reward constraints are used,including velocity reward,time reward,energy reward and balance reward.Different weights are given to each reward,and the instant reward at each step of system training is obtained by multiplying each reward with its own weight,which ensures the reliability of training results.At the same time,multiple groups of comparative analysis simulation experiments are carried out.The results show that the priority of balance reward,velocity reward,energy reward and time reward decreases successively and the weight of each reward does not exceed 0.5.When the policy network and the value network are designed,a three-layer neural network is used,the number of neurons in each layer is 64 and the discount factor is 0.99,the training effect is better. 展开更多
关键词 Gait swtiching Reinforcement learning proximal policy optimization MPC controller
在线阅读 下载PDF
Optimization Scheduling of Hydrogen-Coupled Electro-Heat-Gas Integrated Energy System Based on Generative Adversarial Imitation Learning
13
作者 Baiyue Song Chenxi Zhang +1 位作者 Wei Zhang Leiyu Wan 《Energy Engineering》 2025年第12期4919-4945,共27页
Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most ... Hydrogen energy is a crucial support for China’s low-carbon energy transition.With the large-scale integration of renewable energy,the combination of hydrogen and integrated energy systems has become one of the most promising directions of development.This paper proposes an optimized schedulingmodel for a hydrogen-coupled electro-heat-gas integrated energy system(HCEHG-IES)using generative adversarial imitation learning(GAIL).The model aims to enhance renewable-energy absorption,reduce carbon emissions,and improve grid-regulation flexibility.First,the optimal scheduling problem of HCEHG-IES under uncertainty is modeled as a Markov decision process(MDP).To overcome the limitations of conventional deep reinforcement learning algorithms—including long optimization time,slow convergence,and subjective reward design—this study augments the PPO algorithm by incorporating a discriminator network and expert data.The newly developed algorithm,termed GAIL,enables the agent to perform imitation learning from expert data.Based on this model,dynamic scheduling decisions are made in continuous state and action spaces,generating optimal energy-allocation and management schemes.Simulation results indicate that,compared with traditional reinforcement-learning algorithms,the proposed algorithmoffers better economic performance.Guided by expert data,the agent avoids blind optimization,shortens the offline training time,and improves convergence performance.In the online phase,the algorithm enables flexible energy utilization,thereby promoting renewable-energy absorption and reducing carbon emissions. 展开更多
关键词 Hydrogen energy optimization dispatch generative adversarial imitation learning proximal policy optimization imitation learning renewable energy
在线阅读 下载PDF
基于多智能体深度强化学习的无人机路径规划 被引量:10
14
作者 司鹏搏 吴兵 +2 位作者 杨睿哲 李萌 孙艳华 《北京工业大学学报》 CAS CSCD 北大核心 2023年第4期449-458,共10页
为解决多无人机(unmanned aerial vehicle, UAV)在复杂环境下的路径规划问题,提出一个多智能体深度强化学习UAV路径规划框架.该框架首先将路径规划问题建模为部分可观测马尔可夫过程,采用近端策略优化算法将其扩展至多智能体,通过设计UA... 为解决多无人机(unmanned aerial vehicle, UAV)在复杂环境下的路径规划问题,提出一个多智能体深度强化学习UAV路径规划框架.该框架首先将路径规划问题建模为部分可观测马尔可夫过程,采用近端策略优化算法将其扩展至多智能体,通过设计UAV的状态观测空间、动作空间及奖赏函数等实现多UAV无障碍路径规划;其次,为适应UAV搭载的有限计算资源条件,进一步提出基于网络剪枝的多智能体近端策略优化(network pruning-based multi-agent proximal policy optimization, NP-MAPPO)算法,提高了训练效率.仿真结果验证了提出的多UAV路径规划框架在各参数配置下的有效性及NP-MAPPO算法在训练时间上的优越性. 展开更多
关键词 无人机(unmanned aerial vehicle UAV) 复杂环境 路径规划 马尔可夫决策过程 多智能体近端策略优化算法(multi-agent proximal policy optimization MAPPO) 网络剪枝(network pruning NP)
在线阅读 下载PDF
Task assignment in ground-to-air confrontation based on multiagent deep reinforcement learning 被引量:4
15
作者 Jia-yi Liu Gang Wang +2 位作者 Qiang Fu Shao-hua Yue Si-yuan Wang 《Defence Technology(防务技术)》 SCIE EI CAS CSCD 2023年第1期210-219,共10页
The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to... The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified. 展开更多
关键词 Ground-to-air confrontation Task assignment General and narrow agents Deep reinforcement learning proximal policy optimization(PPO)
在线阅读 下载PDF
Multi-agent reinforcement learning for edge information sharing in vehicular networks 被引量:3
16
作者 Ruyan Wang Xue Jiang +5 位作者 Yujie Zhou Zhidu Li Dapeng Wu Tong Tang Alexander Fedotov Vladimir Badenko 《Digital Communications and Networks》 SCIE CSCD 2022年第3期267-277,共11页
To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This pape... To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments. 展开更多
关键词 Vehicular networks Edge information sharing Delay guarantee Multi-agent reinforcement learning proximal policy optimization
在线阅读 下载PDF
Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning 被引量:2
17
作者 Jiawei Xia Yasong Luo +3 位作者 Zhikun Liu Yalun Zhang Haoran Shi Zhong Liu 《Defence Technology(防务技术)》 SCIE EI CAS CSCD 2023年第11期80-94,共15页
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit... To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified. 展开更多
关键词 Unmanned surface vehicles Multi-agent deep reinforcement learning Cooperative hunting Feature embedding proximal policy optimization
在线阅读 下载PDF
A hybrid policy gradient and rule-based control framework for electric vehicle charging 被引量:1
18
作者 Brida V.Mbuwir Lennert Vanmunster +1 位作者 Klaas Thoelen Geert Deconinck 《Energy and AI》 2021年第2期1-15,共15页
Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehi... Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehicle charging provides inherent flexibility that can be used to compensate for the intermittency of photo-voltaic generation and optimize against fluctuating electricity prices.Exploiting this flexibility,however,requires smart control algorithms capable of handling uncertainties from photo-voltaic generation,electric vehicle energy demand and user’s behaviour.This paper proposes a control framework combining the advantages of reinforcement learning and rule-based control to coordinate the charging of a fleet of electric vehicles in an office building.The control objective is to maximize self-consumption of locally generated electricity and consequently,minimize the electricity cost of electric vehicle charging.The performance of the proposed framework is evaluated on a real-world data set from EnergyVille,a Belgian research institute.Simulation results show that the proposed control framework achieves a 62.5%electricity cost reduction compared to a business-as-usual or passive charging strategy.In addition,only a 5%performance gap is achieved in comparison to a theoretical near-optimal strategy that assumes perfect knowledge on the required energy and user behaviour of each electric vehicle. 展开更多
关键词 Electric vehicles Smart charging proximal policy optimization Reinforcement learning
在线阅读 下载PDF
A lightweight,easy-integration reward shaping study for progress maximization in Reinforcement Learning for autonomous driving
19
作者 Hongze Fu Kunqiang Qing 《Advances in Engineering Innovation》 2024年第8期31-43,共13页
This paper addresses the challenge of sample efficiency in reinforcement leaming(RL)for autonomous driving,a domain characterized by long-term dependencies and complex environments.While RL has shown success in variou... This paper addresses the challenge of sample efficiency in reinforcement leaming(RL)for autonomous driving,a domain characterized by long-term dependencies and complex environments.While RL has shown success in various fields,its application to autonomous driving is hindered by the need for numerous samples to leam effective policies.We propose a novel,lightweight reward-shaping method called room-of-adjust to maximize learning progress.This approach separates rewards into continuous tendency rewards for long-term guidance and discrete milestone rewards for short-term exploration.Our method is designed to be easily integrated with other approaches,such as efficient representation,imitation learning,and transfer learning.We evaluate our approach on a hill-climbing task with uneven surfaces,which simulates the spatial-temporal reasoning required in autonomous driving.Results show that our room-of-adjust reward shaping achieves near-human performance(81.93%),whereas other reward shaping and progress maximization methods struggle.When combined with imitation learning,the performance matches human levels(97.00%).The Study also explores the method's effectiveness in formulating control theory,such as 4-wheel independent drive(4WID)systems.With reduced spatial-temporal reasoning,reward shaping can match human performance(89.7%).However,control theory cannot be trained together with complicatedspatial-temporal progress maximization. 展开更多
关键词 Reinforcement Learning Autonomous Driving proximal policy optimization(PPO) End to end learing Reward Shaping
在线阅读 下载PDF
A Data-driven Method for Fast AC Optimal Power Flow Solutions via Deep Reinforcement Learning 被引量:13
20
作者 Yuhao Zhou Bei Zhang +5 位作者 Chunlei Xu Tu Lan Ruisheng Diao Di Shi Zhiwei Wang Wei-Jen Lee 《Journal of Modern Power Systems and Clean Energy》 SCIE EI CSCD 2020年第6期1128-1139,共12页
With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real ... With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations. 展开更多
关键词 Alternating current(AC)optimal power flow(OPF) deep reinforcement learning(DRL) imitation learning proximal policy optimization
原文传递
上一页 1 2 下一页 到第
使用帮助 返回顶部