Abstract
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.
Abstract (translated)
近期,视觉语言模型(VLM)在利用大型语言模型(LLM)方面取得了进展,实现了与GPT-4V等封闭源系统相当的性能。然而,在资源受限设备上部署这些模型仍然面临挑战,因为它们需要大量的计算资源。这一现状促使人们研究如何将大型VLM的知识提炼到更小、更高效的模型中去。这里的一个关键挑战来自于各种不同的VLM架构,这些架构基于不同的LLM并采用不同类型的标记(在词汇大小、标记分割和标记索引排序方面有所不同)。为了解决这个问题,并不限定于特定类型的VLM,我们提出了“再校准后生成”(Generation after Recalibration, GenRecal),这是一种新颖的通用蒸馏框架。GenRecal包含一个再校准器(Recalibrator),用于在异构VLM之间对齐和适应特征表示,从而实现不同类型VLM之间的有效知识转移。通过多个具有挑战性的基准测试中的广泛实验,我们展示了GenRecal显著提高了基线性能,并最终超过了大规模开源和封闭源的VLM系统。
URL
https://arxiv.org/abs/2506.15681