Abstract
Visual Grounding (VG) refers to locating a region described by expressions in a specific image, which is a critical topic in vision-language fields. To alleviate the dependence on labeled data, existing unsupervised methods try to locate regions using task-unrelated pseudo-labels. However, a large proportion of pseudo-labels are noisy and diversity scarcity in language taxonomy. Inspired by the advances in V-L pretraining, we consider utilizing the VLP models to realize unsupervised transfer learning in downstream grounding task. Thus, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP via exploiting pseudo-language labels to solve VG problem. By elaborating an efficient model structure, we first propose a single-source and multi-source curriculum adapting method for unsupervised VG to progressively sample more reliable cross-modal pseudo-labels to obtain the optimal model, thus achieving implicit knowledge exploiting and denoising. Our method outperforms the existing state-of-the-art unsupervised VG method Pseudo-Q in both single-source and multi-source scenarios with a large margin, i.e., 6.78%~10.67% and 11.39%~24.87% on RefCOCO/+/g datasets, even outperforms existing weakly supervised methods. The code and models will be released at \url{this https URL}.
Abstract (translated)
视觉grounding(VG)是指在特定的图像中利用表达式描述区域的方法,是视觉语言领域的关键问题。为了减轻依赖标记数据的情况,现有未监督学习方法试图使用任务无关的伪标签来确定区域。然而,在语言分类中,大量的伪标签是噪声性的并且缺乏多样性。受到V-L预训练的进展启发,我们考虑使用VLP模型在后续grounding任务中实现未监督 Transfer Learning。因此,我们提出了CLIP-VG,一种新的方法,可以通过利用伪语言标签来解决VG问题。通过优化高效的模型结构,我们首先提出了单源和多源的未监督VG适应方法,逐步样本更可靠的跨模态伪标签,以获得最优模型,从而实现潜在的知识利用和去噪。我们的方法在单源和多源场景中比现有的未监督VG方法 pseudo-Q表现出色,比如在refCOCO/+/g数据集上,比现有的弱监督方法 pseudo-Q表现优异。代码和模型将在不久的将来发布,地址为 \url{this https URL}。
URL
https://arxiv.org/abs/2305.08685