Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language
Models
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language
Models
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation …