Abstract
Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.
Abstract (translated)
最近,大型视觉语言模型(LVLM)作为一种能够理解和处理视觉及文本信息的强大架构崭露头角。这些模型通常依赖于两个关键组件:视觉变压器(ViT)和大型语言模型(LLM)。ViT将视觉内容编码为一系列图像标记,并作为感知前端——模型的“眼睛”。相比之下,LLM则解释这些标记以进行高级推理、生成响应,并充当认知核心——模型的“大脑”。然而,目前还不清楚哪些视觉标记对理解和推理最为关键,以及这些信号从ViT传播到LLM的有效性如何。 尽管大多数现有研究集中在识别LLM内的注意力陷阱(即那些获得不成比例高注意值但语义较低的令牌),我们转而关注视觉编码器,并通过识别一类来自ViT的高范数视觉标记——称为ViT注意力陷阱,来解决这一问题。这种问题虽然鲜有探讨,但对于LVLM而言确实非常重要。 我们的研究发现显示,这些ViT陷阱包含了来自图像的高级语义概念,从而让LLM能够进行更有效的理解和推理。尽管它们的重要性不容忽视,但现有的LVLM架构往往忽略了这些陷阱令牌。为了探索其贡献,我们提供了对嵌入在这些陷阱令牌中的信息的定性和定量分析。 此外,我们还提出了无需训练和基于训练的方法来更好地利用LLM如何解读这些信息的程度,并以此为基础改进模型性能。通过显式地使用这些标记,我们在一系列LVLM和视觉推理任务上展示出了显著的改进,突显了ViT注意力陷阱在增强视觉推理方面的未被发掘的潜力。
URL
https://arxiv.org/abs/2510.08510