Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, Ramakrishnan Srikant

Type: Preprint

Publication Date: 2024-02-15

Citations: 1

DOI: https://doi.org/10.48550/arxiv.2402.10342

Abstract

Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed and the algorithm relies on trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to get good performance with RLHF. A key novelty is our trajectory-level elliptical potential analysis technique used to infer reward function parameters when comparison queries rather than reward observations are used. We provide and analyze algorithms in two settings: linear and neural function approximation, PG-RLHF and NN-PG-RLHF, respectively.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+	PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning	2020	Alekh Agarwal Mikael Henaff Sham M. Kakade Wen Sun
+	PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning	2020	Alekh Agarwal Mikael Henaff Sham M. Kakade Wen Sun
+ PDF Chat	Preference-Guided Reinforcement Learning for Efficient Exploration	2024	Guojian Wang Faguo Wu Xiao Zhang Tianyuan Chen Xuyang Chen Lin Zhao
+	Reward-Free Exploration for Reinforcement Learning	2020	Chi Jin Akshay Krishnamurthy Max Simchowitz Tiancheng Yu
+ PDF Chat	Reward-Free Exploration for Reinforcement Learning	2020	Chi Jin Akshay Krishnamurthy Max Simchowitz Tiancheng Yu
+	Reward-Free Exploration for Reinforcement Learning	2020	Chi Jin Akshay Krishnamurthy Max Simchowitz Tiancheng Yu
+	Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation	2022	Xiaoyu Chen Han Zhong Zhuoran Yang Zhaoran Wang Liwei Wang
+	Making RL with Preference-based Feedback Efficient via Randomization	2023	Runzhe Wu Wen Sun
+	Trajectory-Oriented Policy Optimization with Sparse Rewards	2024	Guojian Wang Faguo Wu Xiao Zhang
+	Efficient Online Reinforcement Learning with Offline Data	2023	Philip Ball Laura Smith Ilya Kostrikov Sergey Levine
+	Information Directed Reward Learning for Reinforcement Learning	2021	David Lindner Matteo Turchetta Sebastian Tschiatschek Kamil Ciosek Andreas Krause
+	Human-Inspired Framework to Accelerate Reinforcement Learning	2023	Ali Beikmohammadi Sindri Magnússon
+	Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces	2019	Guy Lorberbom Chris J. Maddison Nicolas Heess Tamir Hazan Daniel Tarlow
+	On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game	2021	Shuang Qiu Jieping Ye Zhaoran Wang Zhuoran Yang
+	Information Directed Reward Learning for Reinforcement Learning	2021	David Lindner Matteo Turchetta Sebastian Tschiatschek Kamil Ciosek Andreas Krause
+	Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces	2019	Guy Lorberbom Chris J. Maddison Nicolas Heess Tamir Hazan Daniel Tarlow
+	Query-Policy Misalignment in Preference-Based Reinforcement Learning	2023	Xiao Hu Jianxiong Li Xianyuan Zhan Qing‐Shan Jia Ya-Qin Zhang
+ PDF Chat	Reflective Policy Optimization	2024	Yaozhong Gan Renye Yan Zhe Wu Junliang Xing
+	Offline Prioritized Experience Replay	2023	Yue Yang Bingyi Kang Xiao Ma Gao Huang Shiji Song Shuicheng Yan
+	Bi-Level Offline Policy Optimization with Limited Exploration	2023	Wenzhuo Zhou

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors