A Common Pitfall of Margin-based Language Model Alignment: Gradient
Entanglement
A Common Pitfall of Margin-based Language Model Alignment: Gradient
Entanglement
Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based …