Paper Reading AI Learner

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

2024-04-19 14:43:48
Konstantinos Vilouras, Pedro Sanchez, Alison Q. O'Neil, Sotirios A. Tsaftaris

Abstract

Localizing the exact pathological regions in a given medical scan is an important imaging problem that requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains mechanisms (cross-attention) that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any further training on target data, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive wih SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance.

Abstract (translated)

在同一医学扫描中准确定位病理性区域是一个重要的图像问题,需要大量约束框 ground truth 注释才能准确解决。然而,存在 alternative、可能更弱的形式监督,例如随附的免费文本报告,这些监督形式非常容易获得。用文本指导进行局部化的工作通常称为短语 grounding。在这项工作中,我们使用一个公开的可用于所有目的的模型,即Latent Diffusion Model(LDM)来解决这个问题具有挑战性的任务。这个选择得到了事实的支持,尽管LDM在本质上具有生成性,但它包含了一些隐含的与视觉和文本特征对齐的机制,从而导致适合该任务的中间表示。此外,我们还希望通过零样本的方式执行这项任务,即不需要对目标数据进行进一步训练,这意味着模型的权重将保持不变。为此,我们设计了一些策略来选择特征,并通过后处理来精炼它们,而无需额外的学习参数。我们比较了我们的方法与最先进的通过对比学习在联合嵌入空间中明确实现图像-文本对齐的方法。在一项流行的胸部X光挑战中,我们的方法与最先进的方法在各种类型的病理性上具有竞争力,甚至平均而言优于它们。源代码将在接受时发布。

URL

https://arxiv.org/abs/2404.12920

PDF

https://arxiv.org/pdf/2404.12920.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot