Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a
supervised-friendly fashion
Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a
supervised-friendly fashion
Reinforcement Learning (RL) has been used to finetune Large Language Models (LLMs) using a reward model trained from preference data, to better align with human judgment. The recently introduced direct alignment methods, which are often simpler, more stable, and computationally lighter, can more directly achieve this. However, these approaches cannot …