Abstract
We introduce a new type of indirect injection vulnerabilities in language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. Unlike jailbreaking attacks and adversarial examples, the outputs resulting from these images are plausible and based on the visual content of the image, yet follow the adversary's (meta-)instructions. We describe the risks of these attacks, including misinformation and spin, evaluate their efficacy for multiple visual language models and adversarial meta-objectives, and demonstrate how they can "unlock" the capabilities of the underlying language models that are unavailable via explicit text instructions. Finally, we discuss defenses against these attacks.
Abstract (translated)
我们在语言模型上引入了一种新的间接注入漏洞:在图像上运行的语言模型中的隐藏"元指令",它们影响模型如何解释图像及其将模型的输出引导为具有敌方选择的风格、情感或观点。我们解释了如何通过生成作为软提示的图像来创建元指令。与破解攻击和恶意示例不同,这些图像产生的输出是可信的,并且基于图像的视觉内容,但是遵循了敌方的(元)指令。我们描述了这些攻击的风险,包括错误或扭曲,评估了它们对多个视觉语言模型和敌方元目标的效力,并展示了它们如何通过隐式文本指令无法实现的底层语言模型的能力"解锁"。最后,我们讨论了防御这些攻击的方法。
URL
https://arxiv.org/abs/2407.08970