Paper Reading AI Learner

Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

2024-04-18 08:46:12
Qiyuan Dai, Sibei Yang

Abstract

Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.

Abstract (translated)

参考图像分割(RIS)旨在通过相应的自然语言表达精确分割图像中的指称,然而却依赖于代价高昂的掩膜注释。因此,弱监督的RIS从图像-文本对中学习像素级的语义,这使得对细粒度掩膜进行分割具有挑战性。增强分割精度的自然方法是使用图像分割基础模型SAM来增强弱监督的RIS。然而,我们观察到,仅仅通过集成SAM并不能带来很大的益处,甚至由于过度的关注对象部分而导致性能下降。在本文中,我们提出了一个创新框架,Point Prompting (PPT),结合了所提出的多源课程学习策略来解决这些挑战。具体来说,PPT的核心是一个点生成器,它不仅利用了CLIP的文本-图像对齐能力和SAM的强大掩膜生成能力,还生成负点提示来解决噪音和过度关注对象部分的问题,从而有效地解决其自身的缺陷。此外,我们还引入了一个以物体为中心的 curriculum 学习策略,帮助PPT逐渐从简单的语义对齐学习到更复杂的 RIS。实验证明,我们的PPT在 mIoU 上的性能比之前弱监督技术提高了11.34%、14.14% 和 6.97%,分别应用于 RefCOCO、RefCOCO+ 和 G-Ref。

URL

https://arxiv.org/abs/2404.11998

PDF

https://arxiv.org/pdf/2404.11998.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot