Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on
Efficient Data Utilization
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on
Efficient Data Utilization
Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider …