Paper Reading AI Learner

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

2024-04-21 09:23:36
Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, Arash Eshghi

Abstract

An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability

Abstract (translated)

一种有效的将冻存的大语言模型(LLM)和视觉编码器相结合的方法包括一个重新采样模块,该模块为LLM创建了一个视觉提示,并提供了文本提示。虽然这种方法在许多粗粒度任务(如图像摘要和视觉问题回答)中取得了令人印象深刻的性能,但尚未对需要空间理解更细粒度任务进行全面评估。在本文中,我们使用\textit{诊断分类器}来衡量重新采样器产生的视觉提示是否编码了空间信息。我们的结果表明,在分类器训练期间,该信息基本上不存在于重新采样器的输出中。然而,当重新采样器和分类器一起训练时,我们观察到显著的性能提升。这说明通过重新采样器获得的压缩在原则上可以编码所需的空间信息,但需要更多的目标感知目标在预训练阶段以实现这种能力。

URL

https://arxiv.org/abs/2404.13594

PDF

https://arxiv.org/pdf/2404.13594.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot