BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Type: Preprint

Publication Date: 2024-10-28

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2410.20971

Abstract

Despite their superb multimodal capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks, which are inference-time attacks that induce the model to output harmful responses with tricky prompts. It is thus essential to defend VLMs against potential jailbreaks for their trustworthy deployment in real-world applications. In this work, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends the black-box target VLM against jailbreak attacks without compromising its performance. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator fine-tuned via reinforcement learning for enhancing cross-modal robustness. We empirically show on three VLMs (LLaVA, MiniGPT-4, and Gemini) and two safety benchmarks (MM-SafetyBench and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment 2024 Soumya Suvra Ghosal
Souradip Chakraborty
Vaibhav Kumar Singh
Tianrui Guan
Mengdi Wang
Ahmad Beirami
Furong Huang
Alvaro Velasquez
Dinesh Manocha
Amrit Singh Bedi
+ PDF Chat IDEATOR: Jailbreaking VLMs Using VLMs 2024 Ruofan Wang
Bo Wang
Xingjun Ma
Yu-Gang Jiang
+ PDF Chat White-box Multimodal Jailbreaks Against Large Vision-Language Models 2024 Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu‐Gang Jiang
+ PDF Chat Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts 2024 Yi Liu
Chengjun Cai
Xiaoli Zhang
Xingliang Yuan
Cong Wang
+ PDF Chat ImgTrojan: Jailbreaking Vision-Language Models with ONE Image 2024 Xijia Tao
Shuai Zhong
Lei Li
Qi Liu
Lingpeng Kong
+ PDF Chat Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt 2024 Zonghao Ying
Aishan Liu
Tianyuan Zhang
Zhengmin Yu
Siyuan Liang
Xianglong Liu
Dacheng Tao
+ FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts 2023 Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
+ PDF Chat Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks 2024 Han Wang
Gang Wang
Huan Zhang
+ PDF Chat UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models 2024 Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Ernest Ma
Gaurav Verma
Srijan Kumar
+ PDF Chat Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? 2024 Shuo Chen
Zhen Han
Bailan He
Zifeng Ding
Wenqian Yu
Philip Torr
Volker Tresp
Jindong Gu
+ PDF Chat Defending Jailbreak Attack in VLMs via Cross-modality Information Detector 2024 Yue Xu
Xiuyuan Qi
Qin Zhan
Wenjie Wang
+ PDF Chat Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models 2024 Shuyang Hao
Bryan Hooi
Jun Liu
Kai-Wei Chang
Zi Xuan Huang
Yujun Cai
+ PDF Chat Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks 2024 Md Zarif Hossain
Ahmed Imteaj
+ PDF Chat Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation 2024 Yunhao Gou
Chaoyu Chen
Zhili Liu
Lanqing Hong
Hang Xu
Zhenguo Li
Dit‐Yan Yeung
James T. Kwok
Yu Zhang
+ Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models 2023 Erfan Shayegani
Yue Dong
Nael Abu‐Ghazaleh
+ PDF Chat JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks 2024 Weidi Luo
Siyuan Ma
Xiaogeng Liu
Xiaoyu Guo
Chaowei Xiao
+ PDF Chat Best-of-N Jailbreaking 2024 John D. Hughes
Stephen J. Price
Andy G. Lynch
Rylan Schaeffer
Fazl Barez
Sanmi Koyejo
Henry Sleight
E. W. Jones
Ethan Perez
Mrinank Sharma
+ PDF Chat Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models 2024 Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
+ PDF Chat TrojVLM: Backdoor Attack Against Vision Language Models 2024 Weimin Lyu
Lu Pang
Tengfei Ma
Haibin Ling
Chao Chen
+ PDF Chat Revisiting Backdoor Attacks against Large Vision-Language Models 2024 Siyuan Liang
Jiawei Liang
Tianyu Pang
Chao Du
Aishan Liu
Ee‐Chien Chang
Xiaochun Cao

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors