首页|期刊导航|自动化学报(英文版)|Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation

Robust Offline Actor-Critic With On-policy Regularized Policy EvaluationOACSTPCDEI

Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation

英文摘要

To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for match-ing the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This nat-urally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theo-retically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demon-strate that OPRAC can rapidly learn robust and effective task-solving policies owing to the stable estimate of Q-value,outper-forming state-of-the-art offline RLs by at least 15%.

Shuo Cao;Xuesong Wang;Yuhu Cheng

Engineering Research Center of Intelligent Control for Underground Space,Ministry of Education,and the School of Infor-mation and Control Engineering,China University of Mining and Tech-nology,Xuzhou 221116,China

Offline reinforcement learningoff-policy QL-styleon-policy SARSA-stylepolicy evaluation(PE)Q-value estimation

《自动化学报(英文版)》 2024 (012)

2497-2511 / 15

This work was supported in part by the National Natural Science Foundation of China(62176259,62373364)and the Key Research and Development Program of Jiangsu Province(BE2022095).

10.1109/JAS.2024.124494

评论