Effective partitioning is crucial for enabling parallel restoration of power systems after blackouts.This paper proposes a novel partitioning method based on deep reinforcement learning.First,the partitioning decision...Effective partitioning is crucial for enabling parallel restoration of power systems after blackouts.This paper proposes a novel partitioning method based on deep reinforcement learning.First,the partitioning decision process is formulated as a Markov decision process(MDP)model to maximize the modularity.Corresponding key partitioning constraints on parallel restoration are considered.Second,based on the partitioning objective and constraints,the reward function of the partitioning MDP model is set by adopting a relative deviation normalization scheme to reduce mutual interference between the reward and penalty in the reward function.The soft bonus scaling mechanism is introduced to mitigate overestimation caused by abrupt jumps in the reward.Then,the deep Q network method is applied to solve the partitioning MDP model and generate partitioning schemes.Two experience replay buffers are employed to speed up the training process of the method.Finally,case studies on the IEEE 39-bus test system demonstrate that the proposed method can generate a high-modularity partitioning result that meets all key partitioning constraints,thereby improving the parallelism and reliability of the restoration process.Moreover,simulation results demonstrate that an appropriate discount factor is crucial for ensuring both the convergence speed and the stability of the partitioning training.展开更多
Current damage detection methods based on model updating and sensitivity Jacobian matrixes show a low convergence ratio and computational efficiency for online calculations.The aim of this paper is to construct a real...Current damage detection methods based on model updating and sensitivity Jacobian matrixes show a low convergence ratio and computational efficiency for online calculations.The aim of this paper is to construct a real-time automated damage detection method by developing a theory-assisted adaptive mutiagent twin delayed deep deterministic(TA2-MATD3)policy gradient algorithm.First,the theoretical framework of reinforcement-learning-driven damage detection is established.To address the disadvantages of traditional mutiagent twin delayed deep deterministic(MATD3)method,the theory-assisted mechanism and the adaptive experience playback mechanism are introduced.Moreover,a historical residential house built in 1889 was taken as an example,using its 12-month structural health monitoring data.TA2-MATD3 was compared with existing damage detection methods in terms of the convergence ratio,online computing efficiency,and damage detection accuracy.The results show that the computational efficiency of TA2-MATD3 is approximately 117–160 times that of the traditional methods.The convergence ratio of damage detection on the training set is approximately 97%,and that on the test set is in the range of 86.2%–91.9%.In addition,the main apparent damages found in the field survey were identified by TA2-MATD3.The results indicate that the proposed method can significantly improve the online computing efficiency and damage detection accuracy.This research can provide novel perspectives for the use of reinforcement learning methods to conduct damage detection in online structural health monitoring.展开更多
Unmanned Aerial Vehicles(UAVs)play a vital role in military warfare.In a variety of battlefield mission scenarios,UAVs are required to safely fly to designated locations without human intervention.Therefore,finding a ...Unmanned Aerial Vehicles(UAVs)play a vital role in military warfare.In a variety of battlefield mission scenarios,UAVs are required to safely fly to designated locations without human intervention.Therefore,finding a suitable method to solve the UAV Autonomous Motion Planning(AMP)problem can improve the success rate of UAV missions to a certain extent.In recent years,many studies have used Deep Reinforcement Learning(DRL)methods to address the AMP problem and have achieved good results.From the perspective of sampling,this paper designs a sampling method with double-screening,combines it with the Deep Deterministic Policy Gradient(DDPG)algorithm,and proposes the Relevant Experience Learning-DDPG(REL-DDPG)algorithm.The REL-DDPG algorithm uses a Prioritized Experience Replay(PER)mechanism to break the correlation of continuous experiences in the experience pool,finds the experiences most similar to the current state to learn according to the theory in human education,and expands the influence of the learning process on action selection at the current state.All experiments are applied in a complex unknown simulation environment constructed based on the parameters of a real UAV.The training experiments show that REL-DDPG improves the convergence speed and the convergence result compared to the state-of-the-art DDPG algorithm,while the testing experiments show the applicability of the algorithm and investigate the performance under different parameter conditions.展开更多
In this paper,we present a novel data-driven design method for the human-robot interaction(HRI)system,where a given task is achieved by cooperation between the human and the robot.The presented HRI controller design i...In this paper,we present a novel data-driven design method for the human-robot interaction(HRI)system,where a given task is achieved by cooperation between the human and the robot.The presented HRI controller design is a two-level control design approach consisting of a task-oriented performance optimization design and a plant-oriented impedance controller design.The task-oriented design minimizes the human effort and guarantees the perfect task tracking in the outer-loop,while the plant-oriented achieves the desired impedance from the human to the robot manipulator end-effector in the inner-loop.Data-driven reinforcement learning techniques are used for performance optimization in the outer-loop to assign the optimal impedance parameters.In the inner-loop,a velocity-free filter is designed to avoid the requirement of end-effector velocity measurement.On this basis,an adaptive controller is designed to achieve the desired impedance of the robot manipulator in the task space.The simulation and experiment of a robot manipulator are conducted to verify the efficacy of the presented HRI design framework.展开更多
In this paper,a new optimal adaptive backstepping control approach for nonlinear systems under deception attacks via reinforcement learning is presented in this paper.The existence of nonlinear terms in the studied sy...In this paper,a new optimal adaptive backstepping control approach for nonlinear systems under deception attacks via reinforcement learning is presented in this paper.The existence of nonlinear terms in the studied system makes it very difficult to design the optimal controller using traditional methods.To achieve optimal control,RL algorithm based on critic–actor architecture is considered for the nonlinear system.Due to the significant security risks of network transmission,the system is vulnerable to deception attacks,which can make all the system state unavailable.By using the attacked states to design coordinate transformation,the harm brought by unknown deception attacks has been overcome.The presented control strategy can ensure that all signals in the closed-loop system are semi-globally ultimately bounded.Finally,the simulation experiment is shown to prove the effectiveness of the strategy.展开更多
Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus o...Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.展开更多
In this paper,a new algorithm combining the features of bi-direction evolutionary structural optimization(BESO)and reinforcement learning(RL)is proposed for continuum structural topology optimization(STO).In contrast ...In this paper,a new algorithm combining the features of bi-direction evolutionary structural optimization(BESO)and reinforcement learning(RL)is proposed for continuum structural topology optimization(STO).In contrast to conventional approaches which only generate a certain quasi-optimal solution,the goal of the combined method is to provide more quasi-optimal solutions for designers such as the idea of generative design.Two key components were adopted.First,besides sensitivity,value function updated by Monte-Carlo reinforcement learning was utilized to measure the importance of each element,which made the solving process convergent and closer to the optimum.Second,ε-greedy policy added a random perturbation to the main search direction so as to extend the search ability.Finally,the quality and diversity of solutions could be guaranteed by controlling the value of compliance as well as Intersection-over-Union(IoU).Results of several 2D and 3D compliance minimization problems,including a geometrically nonlinear case,show that the combined method is capable of generating a group of good and different solutions that satisfy various possible requirements in engineering design within acceptable computation cost.展开更多
There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesi...There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesian Networks, Mixture Probability and Clustering Distribution, etc. However such methods give rise to the increase of the computational complexity. For another method, the adaptation performance to more complex environments such as multi-layer environments is required. In this study, we used profit-sharing method for the agent to learn its policy, and added a mixture probability into the RL system to recognize changes in the environment and appropriately improve the agent’s policy to adjust to the changing environment. We also introduced a clustering that enables a smaller, suitable selection in order to reduce the computational complexity and simultaneously maintain the system’s performance. The results of experiments presented that the agent successfully learned the policy and efficiently adjusted to the changing in multi-layer environment. Finally, the computational complexity and the decline in effectiveness of the policy improvement were controlled by using our proposed system.展开更多
Measurement-while-drilling(MWD)and guidance technologies have been extensively deployed in the exploitation of oil,natural gas,and other energy resources.Conventional control approaches are plagued by challenges,inclu...Measurement-while-drilling(MWD)and guidance technologies have been extensively deployed in the exploitation of oil,natural gas,and other energy resources.Conventional control approaches are plagued by challenges,including limited anti-interference capabilities and the insufficient generalization of decision-making experience.To address the intricate problem of directional well trajectory control,an intelligent algorithm design framework grounded in the high-level interaction mechanism between geology and engineering is put forward.This framework aims to facilitate the rapid batch migration and update of drilling strategies.The proposed directional well trajectory control method comprehensively considers the multi-source heterogeneous attributes of drilling experience data,leverages the generative simulation of the geological drilling environment,and promptly constructs a directional well trajectory control model with self-adaptive capabilities to environmental variations.This construction is carried out based on three hierarchical levels:“offline pre-drilling learning,online during-drilling interaction,and post-drilling model transfer”.Simulation results indicate that the guidance model derived from this method demonstrates remarkable generalization performance and accuracy.It can significantly boost the adaptability of the control algorithm to diverse environments and enhance the penetration rate of the target reservoir during drilling operations.展开更多
In unmanned aerial vehicle(UAV)applications,efficient multi-target coverage with reliable connectivity is critical for reconnaissance,search and rescue,and environmental monitoring[1].However,real-world deployments fa...In unmanned aerial vehicle(UAV)applications,efficient multi-target coverage with reliable connectivity is critical for reconnaissance,search and rescue,and environmental monitoring[1].However,real-world deployments face two major challenges:restricted airspace(no-fly zones,NFZs)that constrain trajectories,and limited communication ranges that require team connectivity[2].Existing potential field,geometric,or decentralized connectivity methods address these objectives separately[3–5],but they struggle with scalability and fail to ensure safe and efficient coverage in dynamic NFZ environments.展开更多
Accurate estimation of the background error covariance matrix denoted as B remains a critical challenge in numerical weather prediction(NWP),directly influencing data assimilation(DA)performance and forecast accuracy....Accurate estimation of the background error covariance matrix denoted as B remains a critical challenge in numerical weather prediction(NWP),directly influencing data assimilation(DA)performance and forecast accuracy.Although hybrid ensemble–variational(En Var)methods combine static and flow-dependent matrices to improve assimilation,their effectiveness is constrained by empirically fixed weights.To address this limitation,we propose DRL-En Var,an adaptive hybrid En Var DA method enhanced with deep reinforcement learning.DRL-En Var integrates deep learning(DL)components,including a novel cyclic convolution module to extract abstract features from data,and employs reinforcement learning(RL)to dynamically optimize hybrid weighting strategies.The system adaptively combines multiple ensemble-based flow-dependent matrices with one or more static matrices to construct a time-varying hybrid matrix B that better reflects real-time background errors.Experimental results demonstrate that DRL-En Var performs better than the traditional ensemble Kalman filter(En KF)and hybrid covariance DA(HCDA)methods,especially under sparse observations or transitional changes in state variables.It achieves competitive or superior assimilation accuracy with lower computational cost,and can be flexibly integrated into both three-dimensional variational assimilation(3DVar)and four-dimensional variational assimilation(4DVar)frameworks.Overall,DRL-En Var offers a novel and efficient approach to adaptive DA,particularly valuable for improving forecast skill during transitional weather regimes.展开更多
This paper investigates the adaptive optimal tracking control(AOTC)for underactuated surface vessels(USVs).Compared to the majority of existing studies,the control strategy in this paper innovatively combines an exten...This paper investigates the adaptive optimal tracking control(AOTC)for underactuated surface vessels(USVs).Compared to the majority of existing studies,the control strategy in this paper innovatively combines an extended state observer(ESO)with reinforcement learning(RL).The designed ESO has high estimation accuracy and robust disturbance rejection capabilities for the unmeasurable information for USVs.To obtain the AOTC,the actor–critic(AC)networks based on RL are constructed to solve the Hamilton–Jacobi–Bellman(HJB)equations.Due to the uncertainties,it is challenging to obtain the optimal controller by directly solving the HJB equations.To address this issue,this paper employs neural networks(NNs)to approximate the uncertainties and solves the optimal controller via AC-RL and ESO.In addition,the adaptive parameters of the optimal controller is trained in parallel with AC networks,which can ensure that the trained networks can further improve tracking performance.The boundedness of AOTC for USVs is shown by Lyapunov stability theorem.Finally,simulation results demonstrate the effectiveness of the proposed algorithm.展开更多
Exploration strategy design is a challenging problem in reinforcement learning(RL),especially when the environment contains a large state space or sparse rewards.During exploration,the agent tries to discover unexplor...Exploration strategy design is a challenging problem in reinforcement learning(RL),especially when the environment contains a large state space or sparse rewards.During exploration,the agent tries to discover unexplored(novel)areas or high reward(quality)areas.Most existing methods perform exploration by only utilizing the novelty of states.The novelty and quality in the neighboring area of the current state have not been well utilized to simultaneously guide the agent’s exploration.To address this problem,this paper proposes a novel RL framework,called clustered reinforcement learning(CRL),for efficient exploration in RL.CRL adopts clustering to divide the collected states into several clusters,based on which a bonus reward reflecting both novelty and quality in the neighboring area(cluster)of the current state is given to the agent.CRL leverages these bonus rewards to guide the agent to perform efficient exploration.Moreover,CRL can be combined with existing exploration strategies to improve their performance,as the bonus rewards employed by these existing exploration strategies solely capture the novelty of states.Experiments on four continuous control tasks and six hard-exploration Atari-2600 games show that our method can outperform other state-of-the-art methods to achieve the best performance.展开更多
Based on the existing pivot rules,the simplex method for linear programming is not polynomial in the worst case.Therefore,the optimal pivot of the simplex method is crucial.In this paper,we propose the optimal rule to...Based on the existing pivot rules,the simplex method for linear programming is not polynomial in the worst case.Therefore,the optimal pivot of the simplex method is crucial.In this paper,we propose the optimal rule to find all the shortest pivot paths of the simplex method for linear programming problems based on Monte Carlo tree search.Specifically,we first propose the SimplexPseudoTree to transfer the simplex method into tree search mode while avoiding repeated basis variables.Secondly,we propose four reinforcement learning models with two actions and two rewards to make the Monte Carlo tree search suitable for the simplex method.Thirdly,we set a new action selection criterion to ameliorate the inaccurate evaluation in the initial exploration.It is proved that when the number of vertices in the feasible region is C_(n)^(m),our method can generate all the shortest pivot paths,which is the polynomial of the number of variables.In addition,we experimentally validate that the proposed schedule can avoid unnecessary search and provide the optimal pivot path.Furthermore,this method can provide the best pivot labels for all kinds of supervised learning methods to solve linear programming problems.展开更多
基金funded by the Beijing Engineering Research Center of Electric Rail Transportation.
文摘Effective partitioning is crucial for enabling parallel restoration of power systems after blackouts.This paper proposes a novel partitioning method based on deep reinforcement learning.First,the partitioning decision process is formulated as a Markov decision process(MDP)model to maximize the modularity.Corresponding key partitioning constraints on parallel restoration are considered.Second,based on the partitioning objective and constraints,the reward function of the partitioning MDP model is set by adopting a relative deviation normalization scheme to reduce mutual interference between the reward and penalty in the reward function.The soft bonus scaling mechanism is introduced to mitigate overestimation caused by abrupt jumps in the reward.Then,the deep Q network method is applied to solve the partitioning MDP model and generate partitioning schemes.Two experience replay buffers are employed to speed up the training process of the method.Finally,case studies on the IEEE 39-bus test system demonstrate that the proposed method can generate a high-modularity partitioning result that meets all key partitioning constraints,thereby improving the parallelism and reliability of the restoration process.Moreover,simulation results demonstrate that an appropriate discount factor is crucial for ensuring both the convergence speed and the stability of the partitioning training.
基金supported by National Key Research and Development Program of China(2023YFF0906100)National Natural Science Foundation of China(52408008)Key Research and Development Program of Jiangsu Province(BE2022833).
文摘Current damage detection methods based on model updating and sensitivity Jacobian matrixes show a low convergence ratio and computational efficiency for online calculations.The aim of this paper is to construct a real-time automated damage detection method by developing a theory-assisted adaptive mutiagent twin delayed deep deterministic(TA2-MATD3)policy gradient algorithm.First,the theoretical framework of reinforcement-learning-driven damage detection is established.To address the disadvantages of traditional mutiagent twin delayed deep deterministic(MATD3)method,the theory-assisted mechanism and the adaptive experience playback mechanism are introduced.Moreover,a historical residential house built in 1889 was taken as an example,using its 12-month structural health monitoring data.TA2-MATD3 was compared with existing damage detection methods in terms of the convergence ratio,online computing efficiency,and damage detection accuracy.The results show that the computational efficiency of TA2-MATD3 is approximately 117–160 times that of the traditional methods.The convergence ratio of damage detection on the training set is approximately 97%,and that on the test set is in the range of 86.2%–91.9%.In addition,the main apparent damages found in the field survey were identified by TA2-MATD3.The results indicate that the proposed method can significantly improve the online computing efficiency and damage detection accuracy.This research can provide novel perspectives for the use of reinforcement learning methods to conduct damage detection in online structural health monitoring.
基金co-supported by the National Natural Science Foundation of China(Nos.62003267,61573285)the Aeronautical Science Foundation of China(ASFC)(No.20175553027)Natural Science Basic Research Plan in Shaanxi Province of China(No.2020JQ-220)。
文摘Unmanned Aerial Vehicles(UAVs)play a vital role in military warfare.In a variety of battlefield mission scenarios,UAVs are required to safely fly to designated locations without human intervention.Therefore,finding a suitable method to solve the UAV Autonomous Motion Planning(AMP)problem can improve the success rate of UAV missions to a certain extent.In recent years,many studies have used Deep Reinforcement Learning(DRL)methods to address the AMP problem and have achieved good results.From the perspective of sampling,this paper designs a sampling method with double-screening,combines it with the Deep Deterministic Policy Gradient(DDPG)algorithm,and proposes the Relevant Experience Learning-DDPG(REL-DDPG)algorithm.The REL-DDPG algorithm uses a Prioritized Experience Replay(PER)mechanism to break the correlation of continuous experiences in the experience pool,finds the experiences most similar to the current state to learn according to the theory in human education,and expands the influence of the learning process on action selection at the current state.All experiments are applied in a complex unknown simulation environment constructed based on the parameters of a real UAV.The training experiments show that REL-DDPG improves the convergence speed and the convergence result compared to the state-of-the-art DDPG algorithm,while the testing experiments show the applicability of the algorithm and investigate the performance under different parameter conditions.
基金This work was supported in part by the National Natural Science Foundation of China(61903028)the Youth Innovation Promotion Association,Chinese Academy of Sciences(2020137)+1 种基金the Lifelong Learning Machines Program from DARPA/Microsystems Technology Officethe Army Research Laboratory(W911NF-18-2-0260).
文摘In this paper,we present a novel data-driven design method for the human-robot interaction(HRI)system,where a given task is achieved by cooperation between the human and the robot.The presented HRI controller design is a two-level control design approach consisting of a task-oriented performance optimization design and a plant-oriented impedance controller design.The task-oriented design minimizes the human effort and guarantees the perfect task tracking in the outer-loop,while the plant-oriented achieves the desired impedance from the human to the robot manipulator end-effector in the inner-loop.Data-driven reinforcement learning techniques are used for performance optimization in the outer-loop to assign the optimal impedance parameters.In the inner-loop,a velocity-free filter is designed to avoid the requirement of end-effector velocity measurement.On this basis,an adaptive controller is designed to achieve the desired impedance of the robot manipulator in the task space.The simulation and experiment of a robot manipulator are conducted to verify the efficacy of the presented HRI design framework.
基金supported in part by the National Key R&D Program of China under Grants 2021YFE0206100in part by the National Natural Science Foundation of China under Grant 62073321+2 种基金in part by National Defense Basic Scientific Research Program JCKY2019203C029in part by the Science and Technology Development Fund,Macao SAR under Grants FDCT-22-009-MISE,0060/2021/A2 and 0015/2020/AMJin part by the financial support from the National Defense Basic Scientific Research Project(JCKY2020130C025).
文摘In this paper,a new optimal adaptive backstepping control approach for nonlinear systems under deception attacks via reinforcement learning is presented in this paper.The existence of nonlinear terms in the studied system makes it very difficult to design the optimal controller using traditional methods.To achieve optimal control,RL algorithm based on critic–actor architecture is considered for the nonlinear system.Due to the significant security risks of network transmission,the system is vulnerable to deception attacks,which can make all the system state unavailable.By using the attacked states to design coordinate transformation,the harm brought by unknown deception attacks has been overcome.The presented control strategy can ensure that all signals in the closed-loop system are semi-globally ultimately bounded.Finally,the simulation experiment is shown to prove the effectiveness of the strategy.
基金supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155885, Artificial Intelligence Convergence Innovation Human Resources Development (Hanyang University ERICA))supported by the National Natural Science Foundation of China under Grant No. 61971264the National Natural Science Foundation of China/Research Grants Council Collaborative Research Scheme under Grant No. 62261160390
文摘Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.
文摘In this paper,a new algorithm combining the features of bi-direction evolutionary structural optimization(BESO)and reinforcement learning(RL)is proposed for continuum structural topology optimization(STO).In contrast to conventional approaches which only generate a certain quasi-optimal solution,the goal of the combined method is to provide more quasi-optimal solutions for designers such as the idea of generative design.Two key components were adopted.First,besides sensitivity,value function updated by Monte-Carlo reinforcement learning was utilized to measure the importance of each element,which made the solving process convergent and closer to the optimum.Second,ε-greedy policy added a random perturbation to the main search direction so as to extend the search ability.Finally,the quality and diversity of solutions could be guaranteed by controlling the value of compliance as well as Intersection-over-Union(IoU).Results of several 2D and 3D compliance minimization problems,including a geometrically nonlinear case,show that the combined method is capable of generating a group of good and different solutions that satisfy various possible requirements in engineering design within acceptable computation cost.
文摘There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesian Networks, Mixture Probability and Clustering Distribution, etc. However such methods give rise to the increase of the computational complexity. For another method, the adaptation performance to more complex environments such as multi-layer environments is required. In this study, we used profit-sharing method for the agent to learn its policy, and added a mixture probability into the RL system to recognize changes in the environment and appropriately improve the agent’s policy to adjust to the changing environment. We also introduced a clustering that enables a smaller, suitable selection in order to reduce the computational complexity and simultaneously maintain the system’s performance. The results of experiments presented that the agent successfully learned the policy and efficiently adjusted to the changing in multi-layer environment. Finally, the computational complexity and the decline in effectiveness of the policy improvement were controlled by using our proposed system.
基金supported by the National Key R&D Program of China(No.2019YFA0708304)the CNPC Innovation Fund(No.2022DQ02-0609)the Scientific research and technology development Project of CNPC(No.2022DJ4507).
文摘Measurement-while-drilling(MWD)and guidance technologies have been extensively deployed in the exploitation of oil,natural gas,and other energy resources.Conventional control approaches are plagued by challenges,including limited anti-interference capabilities and the insufficient generalization of decision-making experience.To address the intricate problem of directional well trajectory control,an intelligent algorithm design framework grounded in the high-level interaction mechanism between geology and engineering is put forward.This framework aims to facilitate the rapid batch migration and update of drilling strategies.The proposed directional well trajectory control method comprehensively considers the multi-source heterogeneous attributes of drilling experience data,leverages the generative simulation of the geological drilling environment,and promptly constructs a directional well trajectory control model with self-adaptive capabilities to environmental variations.This construction is carried out based on three hierarchical levels:“offline pre-drilling learning,online during-drilling interaction,and post-drilling model transfer”.Simulation results indicate that the guidance model derived from this method demonstrates remarkable generalization performance and accuracy.It can significantly boost the adaptability of the control algorithm to diverse environments and enhance the penetration rate of the target reservoir during drilling operations.
基金supported by the National Natural Science Foundation of China(Grant Nos.62303297,62273223,62503083,62336005,62421004,62461160313)the Shanghai Sailing Program(Grant No.23YF1413100)the Shanghai Municipal Commission of Education(Grant No.24SG38)。
文摘In unmanned aerial vehicle(UAV)applications,efficient multi-target coverage with reliable connectivity is critical for reconnaissance,search and rescue,and environmental monitoring[1].However,real-world deployments face two major challenges:restricted airspace(no-fly zones,NFZs)that constrain trajectories,and limited communication ranges that require team connectivity[2].Existing potential field,geometric,or decentralized connectivity methods address these objectives separately[3–5],but they struggle with scalability and fail to ensure safe and efficient coverage in dynamic NFZ environments.
基金Project supported by the National Key R&D Program of China(No.2022YFB3207304)the National Natural Science Foundation of China(No.42205161)the Natural Science Foundation of Hunan Province,China(No.2023JJ30630)。
文摘Accurate estimation of the background error covariance matrix denoted as B remains a critical challenge in numerical weather prediction(NWP),directly influencing data assimilation(DA)performance and forecast accuracy.Although hybrid ensemble–variational(En Var)methods combine static and flow-dependent matrices to improve assimilation,their effectiveness is constrained by empirically fixed weights.To address this limitation,we propose DRL-En Var,an adaptive hybrid En Var DA method enhanced with deep reinforcement learning.DRL-En Var integrates deep learning(DL)components,including a novel cyclic convolution module to extract abstract features from data,and employs reinforcement learning(RL)to dynamically optimize hybrid weighting strategies.The system adaptively combines multiple ensemble-based flow-dependent matrices with one or more static matrices to construct a time-varying hybrid matrix B that better reflects real-time background errors.Experimental results demonstrate that DRL-En Var performs better than the traditional ensemble Kalman filter(En KF)and hybrid covariance DA(HCDA)methods,especially under sparse observations or transitional changes in state variables.It achieves competitive or superior assimilation accuracy with lower computational cost,and can be flexibly integrated into both three-dimensional variational assimilation(3DVar)and four-dimensional variational assimilation(4DVar)frameworks.Overall,DRL-En Var offers a novel and efficient approach to adaptive DA,particularly valuable for improving forecast skill during transitional weather regimes.
基金supported by the National Natural Science Foundation of China under Grants 62203338,62173259 and U1913602Zhejiang Provincial Natural Science Foundation of China under Grant LZ24F0390006the Postdoctoral Science Foundation of China under Grant 2022M722485.
文摘This paper investigates the adaptive optimal tracking control(AOTC)for underactuated surface vessels(USVs).Compared to the majority of existing studies,the control strategy in this paper innovatively combines an extended state observer(ESO)with reinforcement learning(RL).The designed ESO has high estimation accuracy and robust disturbance rejection capabilities for the unmeasurable information for USVs.To obtain the AOTC,the actor–critic(AC)networks based on RL are constructed to solve the Hamilton–Jacobi–Bellman(HJB)equations.Due to the uncertainties,it is challenging to obtain the optimal controller by directly solving the HJB equations.To address this issue,this paper employs neural networks(NNs)to approximate the uncertainties and solves the optimal controller via AC-RL and ESO.In addition,the adaptive parameters of the optimal controller is trained in parallel with AC networks,which can ensure that the trained networks can further improve tracking performance.The boundedness of AOTC for USVs is shown by Lyapunov stability theorem.Finally,simulation results demonstrate the effectiveness of the proposed algorithm.
基金supported by the National Natural Science Foundation of China(Gtant No.62192783)Fundamental Research Funds for the Central Universities(No.020214380108).
文摘Exploration strategy design is a challenging problem in reinforcement learning(RL),especially when the environment contains a large state space or sparse rewards.During exploration,the agent tries to discover unexplored(novel)areas or high reward(quality)areas.Most existing methods perform exploration by only utilizing the novelty of states.The novelty and quality in the neighboring area of the current state have not been well utilized to simultaneously guide the agent’s exploration.To address this problem,this paper proposes a novel RL framework,called clustered reinforcement learning(CRL),for efficient exploration in RL.CRL adopts clustering to divide the collected states into several clusters,based on which a bonus reward reflecting both novelty and quality in the neighboring area(cluster)of the current state is given to the agent.CRL leverages these bonus rewards to guide the agent to perform efficient exploration.Moreover,CRL can be combined with existing exploration strategies to improve their performance,as the bonus rewards employed by these existing exploration strategies solely capture the novelty of states.Experiments on four continuous control tasks and six hard-exploration Atari-2600 games show that our method can outperform other state-of-the-art methods to achieve the best performance.
基金supported by National Key R&D Program of China(Grant No.2021YFA1000403)National Natural Science Foundation of China(Grant No.11991022)+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(Grant No.XDA27000000)the Fundamental Research Funds for the Central Universities。
文摘Based on the existing pivot rules,the simplex method for linear programming is not polynomial in the worst case.Therefore,the optimal pivot of the simplex method is crucial.In this paper,we propose the optimal rule to find all the shortest pivot paths of the simplex method for linear programming problems based on Monte Carlo tree search.Specifically,we first propose the SimplexPseudoTree to transfer the simplex method into tree search mode while avoiding repeated basis variables.Secondly,we propose four reinforcement learning models with two actions and two rewards to make the Monte Carlo tree search suitable for the simplex method.Thirdly,we set a new action selection criterion to ameliorate the inaccurate evaluation in the initial exploration.It is proved that when the number of vertices in the feasible region is C_(n)^(m),our method can generate all the shortest pivot paths,which is the polynomial of the number of variables.In addition,we experimentally validate that the proposed schedule can avoid unnecessary search and provide the optimal pivot path.Furthermore,this method can provide the best pivot labels for all kinds of supervised learning methods to solve linear programming problems.