Ask a Question

Prefer a chat interface with context about you and your work?

OPTune: Efficient Online Preference Tuning

OPTune: Efficient Online Preference Tuning

Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emph{e.g.} direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new …