Ask a Question

Prefer a chat interface with context about you and your work?

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation …