Paper Reading AI Learner

Adversarial Attacks on Multimodal Agents

2024-06-18 17:32:48
Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

Abstract

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: this https URL Code and data: this https URL

Abstract (translated)

现已成为在现实环境中构建具有自主多模态行动能力的智能体的重要手段。在本文中,我们证明了多模态智能体即使攻击者相对于以前攻击者的难度更大,也会引发新的安全风险。我们的攻击使用对抗性文本字符串来在环境中的一个触发图像上引导梯度基于扰动: (1) 如果将捕获器用于将图像转换为标题作为VLM的附加输入,则攻击者会攻击白盒captioner。 (2) 我们的CLIP攻击会攻击一系列CLIP模型,这些模型可以转移到专有VLMs。 为了评估攻击,我们创建了VisualWebArena-Adv,这是一个基于VisualWebArena的攻击任务集。在一个单个图像的L-inf范数下,捕获器攻击可以在75%的成功率下使captioner-augmented的GPT-4V智能体执行攻击目标。当我们移除捕获器或使用GPT-4V生成其自己的标题时,CLIP攻击可以在分别为21%和43%的成功率下实现。基于其他VLMs的实验表明,它们的鲁棒性有所不同。进一步的分析揭示了一些导致攻击成功的主要因素,我们还讨论了对于防御的影响。项目页面:https:// this URL 代码和数据:https:// this URL

URL

https://arxiv.org/abs/2406.12814

PDF

https://arxiv.org/pdf/2406.12814.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot