Ask a Question

Prefer a chat interface with context about you and your work?

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Despite advances in AI alignment, language models (LM) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. While some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, …