Countering Reward Over-optimization in LLM with Demonstration-Guided
Reinforcement Learning
Countering Reward Over-optimization in LLM with Demonstration-Guided
Reinforcement Learning
While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: …