Abstract
Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.
Abstract (translated)
从图像-报告对中通过联合学习来获取医学视觉表示,因其有潜力缓解医疗领域中的数据稀缺问题而日益受到研究关注。然而,这一领域的挑战主要来自于长篇复杂的报告文本中包含的复杂语义病理关系。以往的研究大多集中在实例级或标记级跨模态对齐上,并常常忽视了病理性一致性的关键作用。本文提出了一种新的框架PLACE,它通过相关性探索促进病理性级别的对齐并丰富细粒度细节,而无需额外的人工标注。 具体来说,我们提出了一个新的病理性级别的跨模态对齐(PCMA)方法来最大化来自图像和报告的病理观察的一致性。为实现这一目标,引入了一个视觉病理观察提取器从局部标记中抽取视觉病理观察表示。该PCMA模块独立于任何外部疾病注释运行,从而增强了我们方法的泛化能力和鲁棒性。 此外,我们设计了一个代理任务,强制模型识别图像块之间的相关性,以丰富对于各种下游任务至关重要的细粒度细节。 实验结果表明,我们的框架在多个下游任务上达到了新的最先进的性能,包括分类、图-文检索、语义分割、目标检测和报告生成。
URL
https://arxiv.org/abs/2506.10573