Abstract
Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failure points at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
Abstract (translated)
尽管视觉 transformer(ViTs)在各种设置中已经取得了最先进的性能,但在涉及视觉关系任务的操作中,它们表现出了令人惊讶的失败。这引发了一个问题:ViTs 尝试如何执行涉及计算物体之间视觉关系任务的任务?之前对 ViTs 的解释性尝试通常集中在刻画相关低级视觉特征上。相比之下,我们采用机制解释性方法研究 ViTs 使用的高层次视觉算法来进行抽象视觉推理。我们举一个基本但非常困难的关系推理任务为例:判断两个视觉实体是否相同或不同。我们发现,预训练的 ViTs 在这个任务上进行微调时,往往表现出两种质量不同的处理阶段,尽管它们没有明显的归纳偏见:1)感知阶段,其中局部物体特征被提取并存储在分离表示中;2)关系阶段,其中物体表示进行比较。在第二阶段,我们发现 ViTs 确实可以学会表示 somewhat abstract visual relations,这是人工智能神经网络长期以来认为不可能的能力。最后,我们证明了在第二个阶段,失败点可以阻止模型学习到我们任务的高层次通用解决方案。通过理解 ViTs 在离散处理阶段,我们可以更精确地诊断和纠正现有和未来模型的不足。
URL
https://arxiv.org/abs/2406.15955