Paper Reading AI Learner

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

2025-01-09 02:47:01
Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei

Abstract

Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.

Abstract (translated)

多模态大型语言模型(MLLMs)已经取得了令人印象深刻的性能,并在商业应用中得到了实际运用,但它们仍然存在潜在的安全机制漏洞。绕过攻击是红队使用的方法之一,旨在避开安全机制并发现MLLM的潜在风险。现有的MLLM绕过方法通常通过复杂的优化方法或精心设计的文字和图像提示来规避模型的安全机制。尽管取得了一些进展,但在针对商业闭源MLLM时,它们的成功率仍然很低。 与以往的研究不同的是,我们通过实证研究发现了MLLM在处理乱序有害指令方面存在一种理解能力与安全能力之间的不一致性,即“Shuffle Inconsistency”。具体来说,从理解能力的角度来看,MLLM能够很好地理解乱序的有害文字-图像指令。然而,在安全性角度来看,它们很容易被这些乱序的有害指令绕过,从而产生有害响应。 基于此发现,我们创新性地提出了一种名为SI-Attack的文字和图像组合的绕过攻击方法。具体来说,为了充分利用这种不一致性并克服随机性的挑战,我们采用了一种查询式的黑盒优化方法来根据毒害判断模型的反馈选择最具有危害性的乱序输入。 一系列实验表明,SI-Attack在三个基准测试上显著提高了攻击性能。特别是对于GPT-4o或Claude-3.5-Sonnet等商业MLLM而言,SI-Attack明显提升了绕过成功的几率。

URL

https://arxiv.org/abs/2501.04931

PDF

https://arxiv.org/pdf/2501.04931.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot