智能系统学报2025,Vol.20Issue(5):1093-1102,10.DOI:10.11992/tis.202410016
基于自适应分位数的离线强化学习算法
Offline reinforcement learning with adaptive quantile
摘要
Abstract
Offline reinforcement learning aims to reduce the high cost of environmental interaction by learning effective policies solely from precollected offline datasets.However,the absence of interactive feedback can cause a distribution shift between the learned policy and the offline dataset,leading to increased extrapolation errors.Most existing methods address this problem using policy constraints or imitation learning,but they often result in overly conservative policies.To address the above problems,an adaptive quantile-based method is proposed.Building upon dual Q-estimation,the re-lationship between dual Q-estimates is further analyzed,using their differences to assess overestimation in out-of-distri-bution actions.The quantile is then adaptively adjusted to correct bias overestimation.Additionally,a quantile advant-age is introduced,which serves as a weight for the policy constraint term,balancing exploration and imitation to reduce policy conservativeness.Finally,the proposed approach is validated on the D4RL dataset,where it achieves excellent performance across multiple tasks,showing its potential for broad application in various scenarios.关键词
离线强化学习/分布偏移/外推误差/策略约束/模仿学习/双Q估计/价值高估/分位数Key words
offline reinforcement learning/distribution shift/extrapolation error/policy constraint/imitation learning/double Q-estimation/overestimation/quantile分类
信息技术与安全科学引用本文复制引用
周娴玮,王宇翔,罗仕鑫,余松森..基于自适应分位数的离线强化学习算法[J].智能系统学报,2025,20(5):1093-1102,10.基金项目
广东省应用型科技研发重大专项(2016B020244003) (2016B020244003)
广东省企业科技特派员项目(GDKTP2020014000) (GDKTP2020014000)
广东省基础与应用基础研究基金项目(2020B1515120089,2020A1515110783). (2020B1515120089,2020A1515110783)