Paper Reading AI Learner

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

2024-06-22 22:43:10
Michael A. Lepori, Alexa R. Tartaglini, Wai Keen Vong, Thomas Serre, Brenden M. Lake, Ellie Pavlick

Abstract

Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failure points at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.

Abstract (translated)

尽管视觉 transformer(ViTs)在各种设置中已经取得了最先进的性能,但在涉及视觉关系任务的操作中,它们表现出了令人惊讶的失败。这引发了一个问题:ViTs 尝试如何执行涉及计算物体之间视觉关系任务的任务?之前对 ViTs 的解释性尝试通常集中在刻画相关低级视觉特征上。相比之下,我们采用机制解释性方法研究 ViTs 使用的高层次视觉算法来进行抽象视觉推理。我们举一个基本但非常困难的关系推理任务为例:判断两个视觉实体是否相同或不同。我们发现,预训练的 ViTs 在这个任务上进行微调时,往往表现出两种质量不同的处理阶段,尽管它们没有明显的归纳偏见:1)感知阶段,其中局部物体特征被提取并存储在分离表示中;2)关系阶段,其中物体表示进行比较。在第二阶段,我们发现 ViTs 确实可以学会表示 somewhat abstract visual relations,这是人工智能神经网络长期以来认为不可能的能力。最后,我们证明了在第二个阶段,失败点可以阻止模型学习到我们任务的高层次通用解决方案。通过理解 ViTs 在离散处理阶段,我们可以更精确地诊断和纠正现有和未来模型的不足。

URL

https://arxiv.org/abs/2406.15955

PDF

https://arxiv.org/pdf/2406.15955.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot