Abstract
Existing data poisoning attacks on retrieval-augmented generation (RAG) systems scale poorly because they require costly optimization of poisoned documents for each target phrase. We introduce Eyes-on-Me, a modular attack that decomposes an adversarial document into reusable Attention Attractors and Focus Regions. Attractors are optimized to direct attention to the Focus Region. Attackers can then insert semantic baits for the retriever or malicious instructions for the generator, adapting to new targets at near zero cost. This is achieved by steering a small subset of attention heads that we empirically identify as strongly correlated with attack success. Across 18 end-to-end RAG settings (3 datasets $\times$ 2 retrievers $\times$ 3 generators), Eyes-on-Me raises average attack success rates from 21.9 to 57.8 (+35.9 points, 2.6$\times$ over prior work). A single optimized attractor transfers to unseen black box retrievers and generators without retraining. Our findings establish a scalable paradigm for RAG data poisoning and show that modular, reusable components pose a practical threat to modern AI systems. They also reveal a strong link between attention concentration and model outputs, informing interpretability research.
Abstract (translated)
现有的针对检索增强生成(RAG)系统的数据投毒攻击扩展性较差,因为它们需要为每个目标短语对中毒文档进行昂贵的优化。我们引入了Eyes-on-Me模块化攻击方法,将对抗性文档分解成可重复使用的注意力吸引器和焦点区域。吸引力被优化以引导注意指向焦点区域。攻击者可以随后插入检索器或生成器的语义诱饵或恶意指令,几乎无需成本即可针对新目标进行调整。这是通过操控我们实证识别出与攻击成功率强相关的少量注意力头来实现的。 在18种端到端RAG设置中(3个数据集 × 2个检索器 × 3个生成器),Eyes-on-Me将平均攻击成功率从21.9提高到了57.8(增加了35.9个百分点,为先前工作的2.6倍)。一个经过优化的吸引器可以在没有重新训练的情况下迁移到未知的黑盒检索器和生成器。我们的研究成果建立了一种针对RAG数据投毒可扩展的方法,并表明模块化、可重复使用的组件对现代AI系统构成了实际威胁。此外,它们还揭示了注意力集中度与模型输出之间的强烈联系,这为解释性研究提供了信息。
URL
https://arxiv.org/abs/2510.00586