Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank
Modifications
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank
Modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and …