Sycophancy to Subterfuge: Investigating Reward-Tampering in Large
Language Models
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large
Language Models
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious …