Abstract
Targeted transfer-based attacks involving adversarial examples pose a significant threat to large visual-language models (VLMs). However, the state-of-the-art (SOTA) transfer-based attacks incur high costs due to excessive iteration counts. Furthermore, the generated adversarial examples exhibit pronounced adversarial noise and demonstrate limited efficacy in evading defense methods such as DiffPure. To address these issues, inspired by score matching, we introduce AdvDiffVLM, which utilizes diffusion models to generate natural, unrestricted adversarial examples. Specifically, AdvDiffVLM employs Adaptive Ensemble Gradient Estimation to modify the score during the diffusion model's reverse generation process, ensuring the adversarial examples produced contain natural adversarial semantics and thus possess enhanced transferability. Simultaneously, to enhance the quality of adversarial examples further, we employ the GradCAM-guided Mask method to disperse adversarial semantics throughout the image, rather than concentrating them in a specific area. Experimental results demonstrate that our method achieves a speedup ranging from 10X to 30X compared to existing transfer-based attack methods, while maintaining superior quality of adversarial examples. Additionally, the generated adversarial examples possess strong transferability and exhibit increased robustness against adversarial defense methods. Notably, AdvDiffVLM can successfully attack commercial VLMs, including GPT-4V, in a black-box manner.
Abstract (translated)
针对具有对抗性样本的定向转移攻击对大型视觉语言模型(VLMs)构成了重大威胁。然而,最先进的(SOTA)转移攻击由于迭代计数过多而产生了高昂的成本。此外,生成的对抗性样本表现出明显的对抗性噪声,并表明在躲避防御方法如DiffPure等情况下效果有限。为了应对这些问题,受到评分匹配的启发,我们引入了AdvDiffVLM,它利用扩散模型生成自然、无限制的对抗性样本。具体来说,AdvDiffVLM在扩散模型反向生成过程中采用自适应集成梯度估计来修改得分,确保生成的对抗性样本包含自然对抗性语义,从而具有增强的转移性。同时,为了进一步提高对抗性样本的质量,我们采用GradCAM指导的遮罩方法将对抗性语义分散在整个图像中,而不仅仅集中在一个特定区域。实验结果表明,与现有的转移攻击方法相比,我们的方法实现了从10倍到30倍的加速,同时保持了优越的对抗性样本质量。此外,生成的对抗性样本具有很强的转移性,对防御方法具有较强的抵抗力。值得注意的是,AdvDiffVLM可以在黑盒方式下成功攻击商业VLMs,包括GPT-4V。
URL
https://arxiv.org/abs/2404.10335