Missile interception problem can be regarded as a two-person zero-sum differential games problem,which depends on the solution of Hamilton-Jacobi-Isaacs(HJI)equa-tion.It has been proved impossible to obtain a closed-f...Missile interception problem can be regarded as a two-person zero-sum differential games problem,which depends on the solution of Hamilton-Jacobi-Isaacs(HJI)equa-tion.It has been proved impossible to obtain a closed-form solu-tion due to the nonlinearity of HJI equation,and many iterative algorithms are proposed to solve the HJI equation.Simultane-ous policy updating algorithm(SPUA)is an effective algorithm for solving HJI equation,but it is an on-policy integral reinforce-ment learning(IRL).For online implementation of SPUA,the dis-turbance signals need to be adjustable,which is unrealistic.In this paper,an off-policy IRL algorithm based on SPUA is pro-posed without making use of any knowledge of the systems dynamics.Then,a neural-network based online adaptive critic implementation scheme of the off-policy IRL algorithm is pre-sented.Based on the online off-policy IRL method,a computa-tional intelligence interception guidance(CIIG)law is developed for intercepting high-maneuvering target.As a model-free method,intercepting targets can be achieved through measur-ing system data online.The effectiveness of the CIIG is verified through two missile and target engagement scenarios.展开更多
This paper considers the value iteration algorithms of stochastic zero-sum linear quadratic games with unkown dynamics.On-policy and off-policy learning algorithms are developed to solve the stochastic zero-sum games,...This paper considers the value iteration algorithms of stochastic zero-sum linear quadratic games with unkown dynamics.On-policy and off-policy learning algorithms are developed to solve the stochastic zero-sum games,where the system dynamics is not required.By analyzing the value function iterations,the convergence of the model-based algorithm is shown.The equivalence of several types of value iteration algorithms is established.The effectiveness of model-free algorithms is demonstrated by a numerical example.展开更多
To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SA...To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.展开更多
Occupant-centric controls(OcC)is an indoor climate control approach whereby occupant feedback is used in the sequence of operation of building energy systems.While OcC has been used in a wide range of building applica...Occupant-centric controls(OcC)is an indoor climate control approach whereby occupant feedback is used in the sequence of operation of building energy systems.While OcC has been used in a wide range of building applications,an OcC category that has received considerable research interest is learning occupants'thermal preferences through their thermostat interactions and adapting temperature setpoints accordingly.Many recent studies used reinforcement learning(RL)as an agent for OcC to optimize energy use and occupant comfort.These studies depended on predicted mean vote(PMV)models or constant comfort ranges to represent comfort,while only few of them used thermostat interactions.This paper addresses this gap by introducing a new off-policy reinforcement learning(RL)algorithm that imitates the occupant behaviour by utilizing unsolicited occupant thermostat overrides.The algorithm is tested with a number of synthetically generated occupant behaviour models implemented via the Python APl of EnergyPlus.The simulation results indicate that the RL algorithm could rapidly learn preferences for all tested occupant behaviour scenarios with minimal exploration events.While substantial energy savings were observed with most occupant scenarios,the impact on the energy savings varied depending on occupants'preferences and thermostat use behaviour stochasticity.展开更多
This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into ...This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into a multi-agent nonzero-sum game.Then,the Nash equilibrium can be achieved by solving the coupled Hamilton–Jacobi–Bellman(HJB)equations with nonquadratic input energy terms.A novel off-policy reinforcement learning method is presented to obtain the Nash equilibrium solution without the system models,and the critic neural networks(NNs)and actor NNs are introduced to implement the presented method.Theoretical analysis is provided,which shows that the iterative control laws converge to the Nash equilibrium.Simulation results show the good performance of the presented method.展开更多
文摘Missile interception problem can be regarded as a two-person zero-sum differential games problem,which depends on the solution of Hamilton-Jacobi-Isaacs(HJI)equa-tion.It has been proved impossible to obtain a closed-form solu-tion due to the nonlinearity of HJI equation,and many iterative algorithms are proposed to solve the HJI equation.Simultane-ous policy updating algorithm(SPUA)is an effective algorithm for solving HJI equation,but it is an on-policy integral reinforce-ment learning(IRL).For online implementation of SPUA,the dis-turbance signals need to be adjustable,which is unrealistic.In this paper,an off-policy IRL algorithm based on SPUA is pro-posed without making use of any knowledge of the systems dynamics.Then,a neural-network based online adaptive critic implementation scheme of the off-policy IRL algorithm is pre-sented.Based on the online off-policy IRL method,a computa-tional intelligence interception guidance(CIIG)law is developed for intercepting high-maneuvering target.As a model-free method,intercepting targets can be achieved through measur-ing system data online.The effectiveness of the CIIG is verified through two missile and target engagement scenarios.
基金supported by the National Natural Science Foundation of China under Grant Nos.62122043,62192753,62433020,T2293770Natural Science Foundation of Shandong Province for Distinguished Young Scholars under Grant No.ZR2022JQ31.
文摘This paper considers the value iteration algorithms of stochastic zero-sum linear quadratic games with unkown dynamics.On-policy and off-policy learning algorithms are developed to solve the stochastic zero-sum games,where the system dynamics is not required.By analyzing the value function iterations,the convergence of the model-based algorithm is shown.The equivalence of several types of value iteration algorithms is established.The effectiveness of model-free algorithms is demonstrated by a numerical example.
基金supported in part by the National Natural Science Foundation of China(62176259,62373364)the Key Research and Development Program of Jiangsu Province(BE2022095)。
文摘To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.
文摘Occupant-centric controls(OcC)is an indoor climate control approach whereby occupant feedback is used in the sequence of operation of building energy systems.While OcC has been used in a wide range of building applications,an OcC category that has received considerable research interest is learning occupants'thermal preferences through their thermostat interactions and adapting temperature setpoints accordingly.Many recent studies used reinforcement learning(RL)as an agent for OcC to optimize energy use and occupant comfort.These studies depended on predicted mean vote(PMV)models or constant comfort ranges to represent comfort,while only few of them used thermostat interactions.This paper addresses this gap by introducing a new off-policy reinforcement learning(RL)algorithm that imitates the occupant behaviour by utilizing unsolicited occupant thermostat overrides.The algorithm is tested with a number of synthetically generated occupant behaviour models implemented via the Python APl of EnergyPlus.The simulation results indicate that the RL algorithm could rapidly learn preferences for all tested occupant behaviour scenarios with minimal exploration events.While substantial energy savings were observed with most occupant scenarios,the impact on the energy savings varied depending on occupants'preferences and thermostat use behaviour stochasticity.
基金Project supported by the National Key R&D Program of China(No.2018YFB1702300)the National Natural Science Foundation of China(Nos.61722312 and 61533017)。
文摘This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into a multi-agent nonzero-sum game.Then,the Nash equilibrium can be achieved by solving the coupled Hamilton–Jacobi–Bellman(HJB)equations with nonquadratic input energy terms.A novel off-policy reinforcement learning method is presented to obtain the Nash equilibrium solution without the system models,and the critic neural networks(NNs)and actor NNs are introduced to implement the presented method.Theoretical analysis is provided,which shows that the iterative control laws converge to the Nash equilibrium.Simulation results show the good performance of the presented method.