期刊文献+
共找到8篇文章
< 1 >
每页显示 20 50 100
基于Off-policy Q-学习的时延系统线性二次型跟踪控制算法
1
作者 刘文 蔚保国 +1 位作者 郝菁 王卿 《无线电工程》 2026年第1期166-176,共11页
对被控系统数学模型参数未知的线性离散时间系统,同时考虑工业过程中数据存在控制输入时间延时的问题,提出一种数据驱动算法,解决时延系统线性二次型跟踪(Linear Quadratic Tracking,LQT)控制问题。通过对时延系统控制问题的描述,构建... 对被控系统数学模型参数未知的线性离散时间系统,同时考虑工业过程中数据存在控制输入时间延时的问题,提出一种数据驱动算法,解决时延系统线性二次型跟踪(Linear Quadratic Tracking,LQT)控制问题。通过对时延系统控制问题的描述,构建了基于模型驱动的强化学习算法框架,在此基础上为了避免使用数学模型参数和状态数据信息,引入Smith预估器,提出了基于On-policy Q-学习的时延系统LQT控制算法。考虑On-policy Q-学习算法中探测噪声对学习结果的影响,进一步采用Off-policy算法解决时延系统LQT控制问题。在此基础上改进Q-学习算法中使用的Bellman方程,提出数据驱动的Off-policy Q-学习算法,该算法不受探测噪声的影响,求解得到的解是无偏差的。理论分析和仿真实验表明,在避免依赖系统数学模型参数和状态数据的前提下,有效实现了时延系统的跟踪控制。 展开更多
关键词 时延系统 强化学习 off-policy 数据驱动 输出反馈
在线阅读 下载PDF
一种基于Off-Policy的无模型输出数据反馈H∞控制方法 被引量:6
2
作者 李臻 范家璐 +1 位作者 姜艺 柴天佑 《自动化学报》 EI CAS CSCD 北大核心 2021年第9期2182-2193,共12页
针对模型未知的线性离散系统在扰动存在条件下的调节控制问题,提出了一种基于Off-policy的输入输出数据反馈的H∞控制方法.本文从状态反馈在线学习算法出发,针对系统运行过程中状态数据难以测得的问题,通过引入增广数据向量将状态反馈... 针对模型未知的线性离散系统在扰动存在条件下的调节控制问题,提出了一种基于Off-policy的输入输出数据反馈的H∞控制方法.本文从状态反馈在线学习算法出发,针对系统运行过程中状态数据难以测得的问题,通过引入增广数据向量将状态反馈策略迭代在线学习算法转化为输入输出数据反馈在线学习算法.更进一步,通过引入辅助项的方法将输入输出数据反馈策略迭代在线学习算法转化为无模型输入输出数据反馈Off-policy学习算法.该算法利用历史输入输出数据实现最优输出反馈策略的学习,同时克服了On-policy算法需要频繁与实际环境进行交互这一缺点.除此之外,与Onpolicy算法相比,Off-policy学习算法具有克服学习噪声的影响,使学习结果收敛于理论最优值这一优点.最终,通过仿真实验验证了学习算法的收敛性. 展开更多
关键词 H∞控制 强化学习 off-policy 数据驱动
在线阅读 下载PDF
Computational intelligence interception guidance law using online off-policy integral reinforcement learning 被引量:1
3
作者 WANG Qi LIAO Zhizhong 《Journal of Systems Engineering and Electronics》 SCIE CSCD 2024年第4期1042-1052,共11页
Missile interception problem can be regarded as a two-person zero-sum differential games problem,which depends on the solution of Hamilton-Jacobi-Isaacs(HJI)equa-tion.It has been proved impossible to obtain a closed-f... Missile interception problem can be regarded as a two-person zero-sum differential games problem,which depends on the solution of Hamilton-Jacobi-Isaacs(HJI)equa-tion.It has been proved impossible to obtain a closed-form solu-tion due to the nonlinearity of HJI equation,and many iterative algorithms are proposed to solve the HJI equation.Simultane-ous policy updating algorithm(SPUA)is an effective algorithm for solving HJI equation,but it is an on-policy integral reinforce-ment learning(IRL).For online implementation of SPUA,the dis-turbance signals need to be adjustable,which is unrealistic.In this paper,an off-policy IRL algorithm based on SPUA is pro-posed without making use of any knowledge of the systems dynamics.Then,a neural-network based online adaptive critic implementation scheme of the off-policy IRL algorithm is pre-sented.Based on the online off-policy IRL method,a computa-tional intelligence interception guidance(CIIG)law is developed for intercepting high-maneuvering target.As a model-free method,intercepting targets can be achieved through measur-ing system data online.The effectiveness of the CIIG is verified through two missile and target engagement scenarios. 展开更多
关键词 two-person zero-sum differential games Hamilton–Jacobi–Isaacs(HJI)equation off-policy integral reinforcement learning(IRL) online learning computational intelligence inter-ception guidance(CIIG)law
在线阅读 下载PDF
On-Policy and Off-Policy Value Iteration Algorithms for Stochastic Zero-Sum Dynamic Games
4
作者 GUO Liangyuan WANG Bing-Chang ZHANG Ji-Feng 《Journal of Systems Science & Complexity》 2025年第1期421-435,共15页
This paper considers the value iteration algorithms of stochastic zero-sum linear quadratic games with unkown dynamics.On-policy and off-policy learning algorithms are developed to solve the stochastic zero-sum games,... This paper considers the value iteration algorithms of stochastic zero-sum linear quadratic games with unkown dynamics.On-policy and off-policy learning algorithms are developed to solve the stochastic zero-sum games,where the system dynamics is not required.By analyzing the value function iterations,the convergence of the model-based algorithm is shown.The equivalence of several types of value iteration algorithms is established.The effectiveness of model-free algorithms is demonstrated by a numerical example. 展开更多
关键词 Approximate dynamic programming on-policy off-policy stochastic zero-sum games valueiteration
原文传递
基于逆强化学习的奇异摄动系统最优控制算法研究
5
作者 沈敏胤 刘飞 《计算机测量与控制》 2025年第12期96-104,共9页
针对具有双时间尺度特性的奇异摄动系统最优控制,给出一种基于全阶模型直接求解的逆强化学习算法,对比传统的将原始奇异摄动系统经时间尺度分离为快慢两个时间尺度的复合控制方法,降低了问题求解的复杂度;首先设计了一种基于模型的策略... 针对具有双时间尺度特性的奇异摄动系统最优控制,给出一种基于全阶模型直接求解的逆强化学习算法,对比传统的将原始奇异摄动系统经时间尺度分离为快慢两个时间尺度的复合控制方法,降低了问题求解的复杂度;首先设计了一种基于模型的策略迭代逆强化学习算法,利用系统动力学和最优控制策略增益来重构未知成本函数;在此基础上,采用无模型off-policy逆强化学习算法,仅依赖于系统显示的最优行为数据,无需系统动力学模型和最优控制策略增益的先验知识,即可准确重构成本函数,使系统能够跟踪学习最优行为,同时在存在探测噪声的情况下也能实现无偏估计,仿真算例实验验证了方法的有效性。 展开更多
关键词 奇异摄动系统 逆强化学习 最优控制 off-policy 数据驱动控制
在线阅读 下载PDF
Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation
6
作者 Shuo Cao Xuesong Wang Yuhu Cheng 《IEEE/CAA Journal of Automatica Sinica》 CSCD 2024年第12期2497-2511,共15页
To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SA... To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%. 展开更多
关键词 Offline reinforcement learning off-policy QL-style on-policy SARSA-style policy evaluation(PE) Q-value estimation
在线阅读 下载PDF
A reinforcement learning approach for thermostat setpoint preference learning 被引量:2
7
作者 Hussein Elehwany Mohamed Ouf +2 位作者 Burak Gunay Nunzio Cotrufo Jean-Simon Venne 《Building Simulation》 SCIE EI CSCD 2024年第1期131-146,共16页
Occupant-centric controls(OcC)is an indoor climate control approach whereby occupant feedback is used in the sequence of operation of building energy systems.While OcC has been used in a wide range of building applica... Occupant-centric controls(OcC)is an indoor climate control approach whereby occupant feedback is used in the sequence of operation of building energy systems.While OcC has been used in a wide range of building applications,an OcC category that has received considerable research interest is learning occupants'thermal preferences through their thermostat interactions and adapting temperature setpoints accordingly.Many recent studies used reinforcement learning(RL)as an agent for OcC to optimize energy use and occupant comfort.These studies depended on predicted mean vote(PMV)models or constant comfort ranges to represent comfort,while only few of them used thermostat interactions.This paper addresses this gap by introducing a new off-policy reinforcement learning(RL)algorithm that imitates the occupant behaviour by utilizing unsolicited occupant thermostat overrides.The algorithm is tested with a number of synthetically generated occupant behaviour models implemented via the Python APl of EnergyPlus.The simulation results indicate that the RL algorithm could rapidly learn preferences for all tested occupant behaviour scenarios with minimal exploration events.While substantial energy savings were observed with most occupant scenarios,the impact on the energy savings varied depending on occupants'preferences and thermostat use behaviour stochasticity. 展开更多
关键词 reinforcement learning preference learning occupant-centric controls smart thermostats off-policy learning
原文传递
Optimal synchronization control formulti-agent systems with input saturation:a nonzero-sum game 被引量:1
8
作者 Hongyang LI Qinglai WEI 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2022年第7期1010-1019,共10页
This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into ... This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into a multi-agent nonzero-sum game.Then,the Nash equilibrium can be achieved by solving the coupled Hamilton–Jacobi–Bellman(HJB)equations with nonquadratic input energy terms.A novel off-policy reinforcement learning method is presented to obtain the Nash equilibrium solution without the system models,and the critic neural networks(NNs)and actor NNs are introduced to implement the presented method.Theoretical analysis is provided,which shows that the iterative control laws converge to the Nash equilibrium.Simulation results show the good performance of the presented method. 展开更多
关键词 Optimal synchronization control Multi-agent systems Nonzero-sum game Adaptive dynamic programming Input saturation off-policy reinforcement learning Policy iteration
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部