Abstract
Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.
Abstract (translated)
多模态基础模型,如CLIP,已经展示了令人印象深刻的零样本能力。然而,由于它们具有大量参数和高推理时间,这些模型在资源受限的环境中的应用有限。虽然现有的方法已经将整个CLIP架构缩小,但我们关注于训练更小的图像编码器变体,这对于高效的零样本分类是足够的。使用合成数据已经表明,从更大的教师表示中提取表示具有潜力,导致强大的零样本和线性探测性能。然而,我们发现,在真正的零样本设置中,这种方法在对比损失方面表现令人失望。我们发现,这种方法在合成和真实数据之间的泛化差上存在问题。然而,通过使用基于图像特征的L2蒸馏损失,我们缓解了这些问题,并培训学生实现零样本性能,这在与DataCompXL数据集上训练的ViT-B/32教师模型相当的四域特定数据集上。
URL
https://arxiv.org/abs/2404.16637