To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SA...To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.展开更多
基金supported in part by the National Natural Science Foundation of China(62176259,62373364)the Key Research and Development Program of Jiangsu Province(BE2022095)。
文摘To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.