Abstract
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: this https URL Code and data: this https URL
Abstract (translated)
现已成为在现实环境中构建具有自主多模态行动能力的智能体的重要手段。在本文中,我们证明了多模态智能体即使攻击者相对于以前攻击者的难度更大,也会引发新的安全风险。我们的攻击使用对抗性文本字符串来在环境中的一个触发图像上引导梯度基于扰动: (1) 如果将捕获器用于将图像转换为标题作为VLM的附加输入,则攻击者会攻击白盒captioner。 (2) 我们的CLIP攻击会攻击一系列CLIP模型,这些模型可以转移到专有VLMs。 为了评估攻击,我们创建了VisualWebArena-Adv,这是一个基于VisualWebArena的攻击任务集。在一个单个图像的L-inf范数下,捕获器攻击可以在75%的成功率下使captioner-augmented的GPT-4V智能体执行攻击目标。当我们移除捕获器或使用GPT-4V生成其自己的标题时,CLIP攻击可以在分别为21%和43%的成功率下实现。基于其他VLMs的实验表明,它们的鲁棒性有所不同。进一步的分析揭示了一些导致攻击成功的主要因素,我们还讨论了对于防御的影响。项目页面:https:// this URL 代码和数据:https:// this URL
URL
https://arxiv.org/abs/2406.12814