Ask a Question

Prefer a chat interface with context about you and your work?

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and …