Abstract
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
Abstract (translated)
一种有效的将冻存的大语言模型(LLM)和视觉编码器相结合的方法包括一个重新采样模块,该模块为LLM创建了一个视觉提示,并提供了文本提示。虽然这种方法在许多粗粒度任务(如图像摘要和视觉问题回答)中取得了令人印象深刻的性能,但尚未对需要空间理解更细粒度任务进行全面评估。在本文中,我们使用\textit{诊断分类器}来衡量重新采样器产生的视觉提示是否编码了空间信息。我们的结果表明,在分类器训练期间,该信息基本上不存在于重新采样器的输出中。然而,当重新采样器和分类器一起训练时,我们观察到显著的性能提升。这说明通过重新采样器获得的压缩在原则上可以编码所需的空间信息,但需要更多的目标感知目标在预训练阶段以实现这种能力。
URL
https://arxiv.org/abs/2404.13594