Abstract
Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions. Traditional methods approach this as a discriminative problem, assigning each pixel to foreground or background based on semantic alignment. Recently, diffusion models have been introduced to this domain, but existing approaches remain image-centric: they either (i) use image diffusion models as visual feature extractors, (ii) synthesize segmentation data via image generation to train discriminative models, or (iii) perform diffusion inversion to extract attention cues from pre-trained image diffusion models-thereby treating segmentation as an auxiliary process. In this paper, we propose GS (Generative Segmentation), a novel framework that formulates segmentation itself as a generative task via label diffusion. Instead of generating images conditioned on label maps and text, GS reverses the generative process: it directly generates segmentation masks from noise, conditioned on both the input image and the accompanying language description. This paradigm makes label generation the primary modeling target, enabling end-to-end training with explicit control over spatial and semantic fidelity. To demonstrate the effectiveness of our approach, we evaluate GS on Panoptic Narrative Grounding (PNG), a representative and challenging benchmark for multimodal segmentation that requires panoptic-level reasoning guided by narrative captions. Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
Abstract (translated)
语言驱动的图像分割是视觉-语言理解中的基本任务,要求模型根据自然语言表达对图像进行区域划分。传统方法将这个问题视为判别性问题,即基于语义一致性将每个像素分配给前景或背景。最近,扩散模型被引入到这个领域中,但现有的方法仍然以图像为中心:它们要么(i) 使用图像扩散模型作为视觉特征提取器,(ii) 通过图像生成合成分割数据来训练判别模型,或者 (iii) 进行扩散反转,从预训练的图像扩散模型中抽取注意力线索——从而将分割视为辅助过程。在本文中,我们提出了 GS(生成式分割),这是一种新颖的框架,它通过标签扩散将分割本身表述为一个生成性任务。与基于标签图和文本生成图像不同,GS 反转了这一生成过程:它直接从噪声中生成分割掩码,并根据输入图像及其伴随的语言描述进行条件化处理。这种范式使标签生成成为主要的建模目标,使得可以进行端到端训练并显式地控制空间和语义保真度。为了展示我们方法的有效性,我们在 Panoptic Narrative Grounding (PNG) 上评估了 GS,这是一个具有挑战性的多模式分割基准测试,需要通过叙述性标题引导进行全景级别推理。实验结果表明,GS 在语言驱动的分割方面显著优于现有的判别性和基于扩散的方法,并且为这一任务设定了新的最先进水平。
URL
https://arxiv.org/abs/2508.20020