Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Type: Article

Publication Date: 2019-07-17

Citations: 59

DOI: https://doi.org/10.1609/aaai.v33i01.33013647

Abstract

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.’s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.

Locations

  • Proceedings of the AAAI Conference on Artificial Intelligence - View - PDF
  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift 2019 Carles Gelada
Marc G. Bellemare
+ Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift 2019 Carles Gelada
Marc G. Bellemare
+ Adaptive Trade-Offs in Off-Policy Learning 2019 Mark Rowland
Will Dabney
Rémi Munos
+ Adaptive Trade-Offs in Off-Policy Learning 2019 Mark Rowland
Will Dabney
Rémi Munos
+ Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning 2023 Akash Velu
Skanda Vaidyanath
Dilip Arumugam
+ Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning 2023 Melrose Roderick
Gaurav Manek
Felix Berkenkamp
J. Zico Kolter
+ PDF Chat CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning 2024 Zeyuan Liu
Kai Yang
Xiu Li
+ Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach 2022 Baturay Sağlam
Dogan C. Cicek
Furkan B. Mutlu
Süleyman S. Kozat
+ Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning 2023 J. Markowitz
Jesse L. Silverberg
Gary S. Collins
+ Return-based Scaling: Yet Another Normalisation Trick for Deep RL 2021 Tom Schaul
Georg Ostrovski
Iurii Kemaev
Diana Borsa
+ PDF Chat Divergence-Augmented Policy Optimization 2025 Qing Wang
Yingru Li
Jiechao Xiong
Tong Zhang
+ Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning 2017 Shixiang Gu
Timothy Lillicrap
Zoubin Ghahramani
Richard E. Turner
Bernhard Schölkopf
Sergey Levine
+ Value-aware Importance Weighting for Off-policy Reinforcement Learning 2023 Kristopher De Asis
Eric Graves
Richard S. Sutton
+ Conservative Q-Learning for Offline Reinforcement Learning 2020 Aviral Kumar
Aurick Zhou
George Tucker
Sergey Levine
+ Conservative Q-Learning for Offline Reinforcement Learning 2020 Aviral Kumar
Aurick Zhou
George Tucker
Sergey Levine
+ Generalized Proximal Policy Optimization with Sample Reuse 2021 James Queeney
Ioannis Ch. Paschalidis
Christos G. Cassandras
+ Generalized Proximal Policy Optimization with Sample Reuse 2021 James Queeney
Ioannis Ch. Paschalidis
Christos G. Cassandras
+ Offline Reinforcement Learning with On-Policy Q-Function Regularization 2023 Laixi Shi
Robert Dadashi
Yuejie Chi
Pablo Samuel Castro
Matthieu Geist
+ Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks 2023 Ryan Sullivan
Akarsh Kumar
Sheng‐Yi Huang
John P. Dickerson
Joseph Suárez
+ Combining policy gradient and Q-learning 2016 Brendan O’Donoghue
Rémi Munos
Koray Kavukcuoglu
Volodymyr Mnih