CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding

Abstract
Abstract (translated)
URL
PDF

Abstract

Visual Grounding (VG) refers to locating a region described by expressions in a specific image, which is a critical topic in vision-language fields. To alleviate the dependence on labeled data, existing unsupervised methods try to locate regions using task-unrelated pseudo-labels. However, a large proportion of pseudo-labels are noisy and diversity scarcity in language taxonomy. Inspired by the advances in V-L pretraining, we consider utilizing the VLP models to realize unsupervised transfer learning in downstream grounding task. Thus, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP via exploiting pseudo-language labels to solve VG problem. By elaborating an efficient model structure, we first propose a single-source and multi-source curriculum adapting method for unsupervised VG to progressively sample more reliable cross-modal pseudo-labels to obtain the optimal model, thus achieving implicit knowledge exploiting and denoising. Our method outperforms the existing state-of-the-art unsupervised VG method Pseudo-Q in both single-source and multi-source scenarios with a large margin, i.e., 6.78%~10.67% and 11.39%~24.87% on RefCOCO/+/g datasets, even outperforms existing weakly supervised methods. The code and models will be released at \url{this https URL}.

Abstract (translated)

视觉grounding(VG)是指在特定的图像中利用表达式描述区域的方法，是视觉语言领域的关键问题。为了减轻依赖标记数据的情况，现有未监督学习方法试图使用任务无关的伪标签来确定区域。然而，在语言分类中，大量的伪标签是噪声性的并且缺乏多样性。受到V-L预训练的进展启发，我们考虑使用VLP模型在后续grounding任务中实现未监督 Transfer Learning。因此，我们提出了CLIP-VG，一种新的方法，可以通过利用伪语言标签来解决VG问题。通过优化高效的模型结构，我们首先提出了单源和多源的未监督VG适应方法，逐步样本更可靠的跨模态伪标签，以获得最优模型，从而实现潜在的知识利用和去噪。我们的方法在单源和多源场景中比现有的未监督VG方法 pseudo-Q表现出色，比如在refCOCO/+/g数据集上，比现有的弱监督方法 pseudo-Q表现优异。代码和模型将在不久的将来发布，地址为 \url{this https URL}。

URL

https://arxiv.org/abs/2305.08685

PDF

https://arxiv.org/pdf/2305.08685.pdf