Ask a Question

Prefer a chat interface with context about you and your work?

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider …