Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Abstract
Abstract (translated)
URL
PDF

Abstract

Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.

Abstract (translated)

参考图像分割（RIS）旨在通过相应的自然语言表达精确分割图像中的指称，然而却依赖于代价高昂的掩膜注释。因此，弱监督的RIS从图像-文本对中学习像素级的语义，这使得对细粒度掩膜进行分割具有挑战性。增强分割精度的自然方法是使用图像分割基础模型SAM来增强弱监督的RIS。然而，我们观察到，仅仅通过集成SAM并不能带来很大的益处，甚至由于过度的关注对象部分而导致性能下降。在本文中，我们提出了一个创新框架，Point Prompting (PPT)，结合了所提出的多源课程学习策略来解决这些挑战。具体来说，PPT的核心是一个点生成器，它不仅利用了CLIP的文本-图像对齐能力和SAM的强大掩膜生成能力，还生成负点提示来解决噪音和过度关注对象部分的问题，从而有效地解决其自身的缺陷。此外，我们还引入了一个以物体为中心的 curriculum 学习策略，帮助PPT逐渐从简单的语义对齐学习到更复杂的 RIS。实验证明，我们的PPT在 mIoU 上的性能比之前弱监督技术提高了11.34%、14.14% 和 6.97%，分别应用于 RefCOCO、RefCOCO+ 和 G-Ref。

URL

https://arxiv.org/abs/2404.11998

PDF

https://arxiv.org/pdf/2404.11998.pdf

Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Abstract

Abstract (translated)

URL

PDF Copy

PDF