Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.’s solution is appealing, it cannot easily be …