Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Type: Preprint

Publication Date: 2024-04-16

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2404.10719

Abstract

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across various a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ Fine-Tuning Language Models with Advantage-Induced Policy Alignment 2023 Banghua Zhu
Hiteshi Sharma
Felipe Vieira Frujeri
Shi Dong
Chenguang Zhu
Michael I. Jordan
Jiantao Jiao
+ PDF Chat Bootstrapping Language Models with DPO Implicit Rewards 2024 Changyu Chen
Zichen Liu
Chao Du
Tianyu Pang
Qian Liu
Arunesh Sinha
Pradeep Varakantham
Min Lin
+ PDF Chat A Comprehensive Survey of Datasets, Theories, Variants, and Applications in Direct Preference Optimization 2024 Wenyi Xiao
Zhenning Wang
Leilei Gan
Shuai Zhao
Wanggui He
Luu Anh Tuan
Long Chen
Hao Jiang
Zhou Zhao
Fei Wu
+ ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models 2023 Ziniu Li
Tian Xu
Yushun Zhang
Yang Yu
Ruoyu Sun
Zhi‐Quan Luo
+ PDF Chat REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models 2025 Hu Jian
+ PDF Chat ALaRM: Align Language Models via Hierarchical Rewards Modeling 2024 Yuhang Lai
Siyuan Wang
Shujun Liu
Xuanjing Huang
Zhongyu Wei
+ Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment 2023 Tianhao Wu
Banghua Zhu
Ruoyu Zhang
Zhaojin Wen
Kannan Ramchandran
Jiantao Jiao
+ Secrets of RLHF in Large Language Models Part I: PPO 2023 Rui Zheng
Shihan Dou
Songyang Gao
Wei Shen
Binghai Wang
Yan Liu
Senjie Jin
Qin Liu
Limao Xiong
Lu Chen
+ PDF Chat SAIL: Self-Improving Efficient Online Alignment of Large Language Models 2024 Mucong Ding
Souradip Chakraborty
V. C. Agrawal
Zora Che
Alec Koppel
Mengdi Wang
Amrit Singh Bedi
Furong Huang
+ PDF Chat Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data 2024 Han Xia
Songyang Gao
Qiming Ge
Zhiheng Xi
Qi Zhang
Xuanjing Huang
+ PDF Chat Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model 2024 Qi Gou
Cam-Tu Nguyen
+ PDF Chat Accelerated Preference Optimization for Large Language Model Alignment 2024 Jie He
Huizhuo Yuan
Quanquan Gu
+ PDF Chat $\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs 2024 Junkang Wu
Xue Wang
Zhengyi Yang
Jiancan Wu
Jinyang Gao
Bolin Ding
Xiang Wang
Rong Jin
Xiangnan He
+ PDF Chat Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization 2024 Amir Saeidi
Shivanshu Verma
Aswin RRV
Chitta Baral
+ PDF Chat Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration 2024 Xin Mao
Feng-Lin Li
Huimin Xu
Wei Zhan
Anh Tuan Luu
+ PDF Chat Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment 2025 Prashant Trivedi
Souradip Chakraborty
Avinash Reddy
Vaneet Aggarwal
Amrit Singh Bedi
George K. Atia
+ Improving Language Models with Advantage-based Offline Policy Gradients 2023 Ashutosh Baheti
Ximing Lu
Faeze Brahman
Ronan Le Bras
Maarten Sap
Mark Riedl
+ PDF Chat Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy 2024 Yu Zhu
Chuxiong Sun
Wenfei Yang
Wenqiang Wei
Bo Tang
Tianzhu Zhang
Zhiyu Li
Shifeng Zhang
Feiyu Xiong
Jie Hu
+ PDF Chat WPO: Enhancing RLHF with Weighted Preference Optimization 2024 Wenxuan Zhou
Ravi Agrawal
Shujian Zhang
Sathish Reddy Indurthi
Sanqiang Zhao
Kaiqiang Song
Silei Xu
Chenguang Zhu
+ PDF Chat Offline Regularised Reinforcement Learning for Large Language Models Alignment 2024 Pierre Harvey Richemond
Yunhao Tang
Daniel Guo
Daniele Calandriello
Mohammad Gheshlaghi Azar
Rafael Rafailov
Bernardo Ávila Pires
Eugene Tarassov
Lucas Spangher
Will Ellsworth

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors