Robust Prompt Optimization for Defending Language Models Against
Jailbreaking Attacks
Robust Prompt Optimization for Defending Language Models Against
Jailbreaking Attacks
Despite advances in AI alignment, language models (LM) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. While some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, …