Abstract
Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
Abstract (translated)
广泛采用的医学图像分割方法虽然高效,但主要是确定性的,并且对自然语言提示不太友好。因此,它们缺乏估计多种提案、人机交互和跨模态适应的能力。最近,文本到图像的扩散模型显示出弥合这一差距的潜力。然而,从头开始训练这些模型需要大量的数据集——这对医学图像分割来说是一个限制。此外,它们通常仅限于二值分割,并且不能以自然语言提示为条件进行操作。 为此,我们提出了一种称为ProGiDiff的新框架,该框架利用现有的图像生成模型来实现医学图像分割的目的。具体而言,我们提出了一个类似ControlNet的控制机制和一个自定义编码器,适用于图像条件化,可以引导预训练的扩散模型输出分割掩码。通过提示目标器官,它自然地扩展到了多类设置。 我们在CT图像上的器官分割实验中展示了与先前方法相比的强大性能,并且可以从“专家在循环”(expert-in-the-loop)设置中受益匪浅,以利用多种提案。重要的是,我们证明了学习到的控制机制可以通过低秩、少量样本适应轻松转移到对MR图像进行分割。 此框架和方法表明,在医学图像分割领域,通过采用先进的文本引导技术结合现有生成模型可以显著提升算法的能力与灵活性,尤其是在处理跨模态数据时展现出了巨大的潜力。
URL
https://arxiv.org/abs/2601.16060