Teaching CLIP to Count to Ten

Abstract
Abstract (translated)
URL
PDF

Abstract

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

Abstract (translated)

大型视觉语言模型(VLMs),如CLIP,学习丰富的 joint 图像-文本表示,促进了许多后续任务的进步,包括零样本分类和文本到图像生成。然而,现有的VLMs表现出一个显著且已证明的限制 - 它们无法包含诸如计数等构成性概念。我们提出了一种简单但有效的方法,以改善VLMs的量化理解,同时保持它们在常见基准上的整体表现。具体来说,我们提议了一种新的计数对比损失,用于微调已训练的VLM并与其原目标协同优化。我们的计数损失部署在自动生成的反事实示例中,每个示例包含一张图像和一张包含错误对象计数的caption。例如,一张描述三狗的图像与“六个狗在花园里玩”的caption配对。我们的损失鼓励对正确caption和其反事实变种的区分,作为强硬的负面例子。据我们所知,这项工作是第一款将CLIP的能力扩展至对象计数的工作。此外,我们提出了“Countbench” - 一个新的图像-文本计数基准,用于评估模型对对象计数的理解。我们在这个任务上展示了比当前最佳基准模型显著提高的表现。最后,我们利用我们的计数意识到的CLIP模型进行图像检索和文本Condition图像生成,表明我们的模型可以产生比现有对象计数更多的具体计数。

URL

https://arxiv.org/abs/2302.12066

PDF

https://arxiv.org/pdf/2302.12066.pdf