Paper Reading AI Learner

Teaching CLIP to Count to Ten

2023-02-23 14:43:53
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel

Abstract

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

Abstract (translated)

大型视觉语言模型(VLMs),如CLIP,学习丰富的 joint 图像-文本表示,促进了许多后续任务的进步,包括零样本分类和文本到图像生成。然而,现有的VLMs表现出一个显著且已证明的限制 - 它们无法包含诸如计数等构成性概念。我们提出了一种简单但有效的方法,以改善VLMs的量化理解,同时保持它们在常见基准上的整体表现。具体来说,我们提议了一种新的计数对比损失,用于微调已训练的VLM并与其原目标协同优化。我们的计数损失部署在自动生成的反事实示例中,每个示例包含一张图像和一张包含错误对象计数的caption。例如,一张描述三狗的图像与“六个狗在花园里玩”的caption配对。我们的损失鼓励对正确caption和其反事实变种的区分,作为强硬的负面例子。据我们所知,这项工作是第一款将CLIP的能力扩展至对象计数的工作。此外,我们提出了“Countbench” - 一个新的图像-文本计数基准,用于评估模型对对象计数的理解。我们在这个任务上展示了比当前最佳基准模型显著提高的表现。最后,我们利用我们的计数意识到的CLIP模型进行图像检索和文本Condition图像生成,表明我们的模型可以产生比现有对象计数更多的具体计数。

URL

https://arxiv.org/abs/2302.12066

PDF

https://arxiv.org/pdf/2302.12066.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot