Paper Reading AI Learner

GS: Generative Segmentation via Label Diffusion

2025-08-27 16:28:15
Yuhao Chen, Shubin Chen, Liang Lin, Guangrun Wang

Abstract

Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions. Traditional methods approach this as a discriminative problem, assigning each pixel to foreground or background based on semantic alignment. Recently, diffusion models have been introduced to this domain, but existing approaches remain image-centric: they either (i) use image diffusion models as visual feature extractors, (ii) synthesize segmentation data via image generation to train discriminative models, or (iii) perform diffusion inversion to extract attention cues from pre-trained image diffusion models-thereby treating segmentation as an auxiliary process. In this paper, we propose GS (Generative Segmentation), a novel framework that formulates segmentation itself as a generative task via label diffusion. Instead of generating images conditioned on label maps and text, GS reverses the generative process: it directly generates segmentation masks from noise, conditioned on both the input image and the accompanying language description. This paradigm makes label generation the primary modeling target, enabling end-to-end training with explicit control over spatial and semantic fidelity. To demonstrate the effectiveness of our approach, we evaluate GS on Panoptic Narrative Grounding (PNG), a representative and challenging benchmark for multimodal segmentation that requires panoptic-level reasoning guided by narrative captions. Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.

Abstract (translated)

语言驱动的图像分割是视觉-语言理解中的基本任务,要求模型根据自然语言表达对图像进行区域划分。传统方法将这个问题视为判别性问题,即基于语义一致性将每个像素分配给前景或背景。最近,扩散模型被引入到这个领域中,但现有的方法仍然以图像为中心:它们要么(i) 使用图像扩散模型作为视觉特征提取器,(ii) 通过图像生成合成分割数据来训练判别模型,或者 (iii) 进行扩散反转,从预训练的图像扩散模型中抽取注意力线索——从而将分割视为辅助过程。在本文中,我们提出了 GS(生成式分割),这是一种新颖的框架,它通过标签扩散将分割本身表述为一个生成性任务。与基于标签图和文本生成图像不同,GS 反转了这一生成过程:它直接从噪声中生成分割掩码,并根据输入图像及其伴随的语言描述进行条件化处理。这种范式使标签生成成为主要的建模目标,使得可以进行端到端训练并显式地控制空间和语义保真度。为了展示我们方法的有效性,我们在 Panoptic Narrative Grounding (PNG) 上评估了 GS,这是一个具有挑战性的多模式分割基准测试,需要通过叙述性标题引导进行全景级别推理。实验结果表明,GS 在语言驱动的分割方面显著优于现有的判别性和基于扩散的方法,并且为这一任务设定了新的最先进水平。

URL

https://arxiv.org/abs/2508.20020

PDF

https://arxiv.org/pdf/2508.20020.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot