Paper Reading AI Learner

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

2025-10-09 17:44:42
Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal

Abstract

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Abstract (translated)

最近,大型视觉语言模型(LVLM)作为一种能够理解和处理视觉及文本信息的强大架构崭露头角。这些模型通常依赖于两个关键组件:视觉变压器(ViT)和大型语言模型(LLM)。ViT将视觉内容编码为一系列图像标记,并作为感知前端——模型的“眼睛”。相比之下,LLM则解释这些标记以进行高级推理、生成响应,并充当认知核心——模型的“大脑”。然而,目前还不清楚哪些视觉标记对理解和推理最为关键,以及这些信号从ViT传播到LLM的有效性如何。 尽管大多数现有研究集中在识别LLM内的注意力陷阱(即那些获得不成比例高注意值但语义较低的令牌),我们转而关注视觉编码器,并通过识别一类来自ViT的高范数视觉标记——称为ViT注意力陷阱,来解决这一问题。这种问题虽然鲜有探讨,但对于LVLM而言确实非常重要。 我们的研究发现显示,这些ViT陷阱包含了来自图像的高级语义概念,从而让LLM能够进行更有效的理解和推理。尽管它们的重要性不容忽视,但现有的LVLM架构往往忽略了这些陷阱令牌。为了探索其贡献,我们提供了对嵌入在这些陷阱令牌中的信息的定性和定量分析。 此外,我们还提出了无需训练和基于训练的方法来更好地利用LLM如何解读这些信息的程度,并以此为基础改进模型性能。通过显式地使用这些标记,我们在一系列LVLM和视觉推理任务上展示出了显著的改进,突显了ViT注意力陷阱在增强视觉推理方面的未被发掘的潜力。

URL

https://arxiv.org/abs/2510.08510

PDF

https://arxiv.org/pdf/2510.08510.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot