Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Carles Gelada, Marc G. Bellemare

Type: Article

Publication Date: 2019-07-17

Citations: 59

DOI: https://doi.org/10.1609/aaai.v33i01.33013647

Abstract

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.’s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.

Locations

Proceedings of the AAAI Conference on Artificial Intelligence - View - PDF
arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift	2019	Carles Gelada Marc G. Bellemare
+	Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift	2019	Carles Gelada Marc G. Bellemare
+	Adaptive Trade-Offs in Off-Policy Learning	2019	Mark Rowland Will Dabney Rémi Munos
+	Adaptive Trade-Offs in Off-Policy Learning	2019	Mark Rowland Will Dabney Rémi Munos
+	Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning	2023	Akash Velu Skanda Vaidyanath Dilip Arumugam
+	Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning	2023	Melrose Roderick Gaurav Manek Felix Berkenkamp J. Zico Kolter
+ PDF Chat	CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning	2024	Zeyuan Liu Kai Yang Xiu Li
+	Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach	2022	Baturay Sağlam Dogan C. Cicek Furkan B. Mutlu Süleyman S. Kozat
+	Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning	2023	J. Markowitz Jesse L. Silverberg Gary S. Collins
+	Return-based Scaling: Yet Another Normalisation Trick for Deep RL	2021	Tom Schaul Georg Ostrovski Iurii Kemaev Diana Borsa
+ PDF Chat	Divergence-Augmented Policy Optimization	2025	Qing Wang Yingru Li Jiechao Xiong Tong Zhang
+	Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning	2017	Shixiang Gu Timothy Lillicrap Zoubin Ghahramani Richard E. Turner Bernhard Schölkopf Sergey Levine
+	Value-aware Importance Weighting for Off-policy Reinforcement Learning	2023	Kristopher De Asis Eric Graves Richard S. Sutton
+	Conservative Q-Learning for Offline Reinforcement Learning	2020	Aviral Kumar Aurick Zhou George Tucker Sergey Levine
+	Conservative Q-Learning for Offline Reinforcement Learning	2020	Aviral Kumar Aurick Zhou George Tucker Sergey Levine
+	Generalized Proximal Policy Optimization with Sample Reuse	2021	James Queeney Ioannis Ch. Paschalidis Christos G. Cassandras
+	Generalized Proximal Policy Optimization with Sample Reuse	2021	James Queeney Ioannis Ch. Paschalidis Christos G. Cassandras
+	Offline Reinforcement Learning with On-Policy Q-Function Regularization	2023	Laixi Shi Robert Dadashi Yuejie Chi Pablo Samuel Castro Matthieu Geist
+	Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks	2023	Ryan Sullivan Akarsh Kumar Sheng‐Yi Huang John P. Dickerson Joseph Suárez
+	Combining policy gradient and Q-learning	2016	Brendan O’Donoghue Rémi Munos Koray Kavukcuoglu Volodymyr Mnih

Works That Cite This (47)

Action	Title	Year	Authors
+	Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog	2019	Natasha Jaques Asma Ghandeharioun Judy Hanwen Shen Craig Ferguson Àgata Lapedriza Noah Jones Shixiang Gu Rosalind W. Picard
+	Off-Policy Policy Gradient with State Distribution Correction	2019	Yao Liu Adith Swaminathan Alekh Agarwal Emma Brunskill
+	Generalized Off-Policy Actor-Critic	2019	Shangtong Zhang Wendelin Boehmer Shimon Whiteson
+	Average-Reward Off-Policy Policy Evaluation with Function Approximation	2021	Shangtong Zhang Yi Wan Richard S. Sutton Shimon Whiteson
+ PDF Chat	Learning Expected Emphatic Traces for Deep RL	2022	Ray Jiang Shangtong Zhang Veronica Chelu Adam White Hado van Hasselt
+	A Convergent Off-Policy Temporal Difference Algorithm	2019	Raghuram Bharadwaj Diddigi Chandramouli Kamanchi Shalabh Bhatnagar
+	GenDICE: Generalized Offline Estimation of Stationary Values	2020	Ruiyi Zhang Bo Dai Lihong Li Dale Schuurmans
+	Learning Retrospective Knowledge with Reverse Reinforcement Learning	2020	Shangtong Zhang Vivek Veeriah Shimon Whiteson
+	Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling	2019	Yao Liu Pierre‐Luc Bacon Emma Brunskill
+ PDF Chat	Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning	2023	Guoxi Zhang Hisashi Kashima

Works Cited by This (10)

Action	Title	Year	Authors
+	Prioritized Experience Replay	2015	Tom Schaul John Quan Ioannis Antonoglou David Silver
+ PDF Chat	Deep Reinforcement Learning with Double Q-Learning	2016	Hado van Hasselt Arthur Guez David Silver
+	Reinforcement Learning with Unsupervised Auxiliary Tasks	2016	Max Jaderberg Volodymyr Mnih Wojciech Marian Czarnecki Tom Schaul Joel Z. Leibo David Silver Koray Kavukcuoglu
+	Unifying task specification in reinforcement learning	2016	Martha White
+	Safe and Efficient Off-Policy Reinforcement Learning	2016	Rémi Munos Tom Stepleton Anna Harutyunyan Marc G. Bellemare
+ PDF Chat	The Arcade Learning Environment: An Evaluation Platform for General Agents	2013	Marc G. Bellemare Yavar Naddaf Joel Veness Michael Bowling
+	The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning	2017	Audrūnas Gruslys Will Dabney Mohammad Gheshlaghi Azar Bilal Piot Marc F. Bellemare Rémi Munos
+	Consistent On-Line Off-Policy Evaluation	2017	Assaf Hallak Shie Mannor
+	A Distributional Perspective on Reinforcement Learning	2017	Marc G. Bellemare Will Dabney Rémi Munos
+	Playing hard exploration games by watching YouTube	2018	Yusuf Aytar Tobias Pfaff David Budden Tom Le Paine Ziyu Wang Nando de Freitas