Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Localizing the exact pathological regions in a given medical scan is an important imaging problem that requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains mechanisms (cross-attention) that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any further training on target data, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive wih SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance.

Abstract (translated)

在同一医学扫描中准确定位病理性区域是一个重要的图像问题，需要大量约束框 ground truth 注释才能准确解决。然而，存在 alternative、可能更弱的形式监督，例如随附的免费文本报告，这些监督形式非常容易获得。用文本指导进行局部化的工作通常称为短语 grounding。在这项工作中，我们使用一个公开的可用于所有目的的模型，即Latent Diffusion Model（LDM）来解决这个问题具有挑战性的任务。这个选择得到了事实的支持，尽管LDM在本质上具有生成性，但它包含了一些隐含的与视觉和文本特征对齐的机制，从而导致适合该任务的中间表示。此外，我们还希望通过零样本的方式执行这项任务，即不需要对目标数据进行进一步训练，这意味着模型的权重将保持不变。为此，我们设计了一些策略来选择特征，并通过后处理来精炼它们，而无需额外的学习参数。我们比较了我们的方法与最先进的通过对比学习在联合嵌入空间中明确实现图像-文本对齐的方法。在一项流行的胸部X光挑战中，我们的方法与最先进的方法在各种类型的病理性上具有竞争力，甚至平均而言优于它们。源代码将在接受时发布。

URL

https://arxiv.org/abs/2404.12920

PDF

https://arxiv.org/pdf/2404.12920.pdf

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF