Paper Reading AI Learner

Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

2025-06-12 11:01:57
Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He

Abstract

Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.

Abstract (translated)

从图像-报告对中通过联合学习来获取医学视觉表示,因其有潜力缓解医疗领域中的数据稀缺问题而日益受到研究关注。然而,这一领域的挑战主要来自于长篇复杂的报告文本中包含的复杂语义病理关系。以往的研究大多集中在实例级或标记级跨模态对齐上,并常常忽视了病理性一致性的关键作用。本文提出了一种新的框架PLACE,它通过相关性探索促进病理性级别的对齐并丰富细粒度细节,而无需额外的人工标注。 具体来说,我们提出了一个新的病理性级别的跨模态对齐(PCMA)方法来最大化来自图像和报告的病理观察的一致性。为实现这一目标,引入了一个视觉病理观察提取器从局部标记中抽取视觉病理观察表示。该PCMA模块独立于任何外部疾病注释运行,从而增强了我们方法的泛化能力和鲁棒性。 此外,我们设计了一个代理任务,强制模型识别图像块之间的相关性,以丰富对于各种下游任务至关重要的细粒度细节。 实验结果表明,我们的框架在多个下游任务上达到了新的最先进的性能,包括分类、图-文检索、语义分割、目标检测和报告生成。

URL

https://arxiv.org/abs/2506.10573

PDF

https://arxiv.org/pdf/2506.10573.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot