Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
大型语言模型(LLMs)在生物医学领域展现出巨大的潜力,但它们缺乏真正的因果理解能力,而是依赖于相关性。本文构想了具备因果推理能力的多模态数据集成型代理(包括文本、图像、基因组学等),通过基于干预的推理来推断因果关系。实现这一目标需要克服几个关键挑战:设计安全且可控的代理框架;开发严谨的基准测试以评估因果模型;整合异构的数据源;以及将大型语言模型与结构化知识图谱(KGs)和正式因果推理工具协同结合。 这样的代理能够开启一系列变革性机会,包括通过自动化假设生成和模拟加速药物发现,通过患者特定的因果模型实现个性化医疗。这一研究议程旨在促进跨学科合作,弥合因果概念与基础模型之间的差距,并开发出可靠的AI伙伴以推动生物医学领域的进步。
https://arxiv.org/abs/2505.16982
Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
现有的医学VQA(视觉问答)基准测试大多集中在单一图像分析上,然而,临床医生在得出诊断之前几乎总是要比较一系列的影像。为了更好地模拟这一工作流程,我们推出了MedFrameQA——这是首个明确评估多图推理能力的医学VQA基准测试。为了在规模和高质量两个方面构建MedFrameQA,我们开发了1)一个自动化管道,该管道从医疗视频中提取时间上连贯的画面,并构造出内容在图像间逻辑演化的VQA项目;2)一个多阶段筛选策略,包括基于模型的和人工审查的方式,以保持数据清晰度、难度以及医学相关性。生成的数据集包含2,851对VQA问题(从9,237张高质量帧中的3,420个视频中提取),涵盖九大人体系统及43个器官;每个问题都配有两到五幅图像。 我们全面评估了十种先进的多模态LLM(大型语言模型)——无论是专有的还是开源的,包括那些具有明确推理模块和没有推理模块的情况。在MedFrameQA上的评估结果挑战性地揭示出所有模型的表现都很差,大多数准确率低于50%,并且随着每个问题中图像数量的增加,准确性波动显著。错误分析进一步显示,模型经常忽略显而易见的发现、在图片之间错误汇总证据,并且通过推理链传播早期的错误;结果也因人体系统、器官和模态的不同而有较大差异。 我们希望这项工作能够推动基于临床实践的多图推理研究,并加速更强大的诊断AI系统的开发进程。
https://arxiv.org/abs/2505.16964
Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.
基础模型在医疗保健领域展现出巨大的潜力,这是因为它们能够提取出与具体下游任务无关的有意义表示。这种特性使得这些模型即使在标注数据有限的情况下(这是医疗保健领域的常见挑战),也能在基于结构化电子健康记录(EHR)数据训练的多个临床应用中实现最先进的性能表现。然而,由于缺乏全面且有意义的任务标准以及足够多样化的评估方法来表征其相对于传统监督学习的优势,这些模型在临床实用性方面的潜力仍然存在争议。 为了弥补这一差距,我们提出了一套涵盖患者结果、急性病和慢性疾病的早期预测等具有临床意义的任务,并制定了稳健评价的标准。我们在哥伦比亚大学欧文医学中心(CUMC)提供的包含500万患者的EHR数据集上对最先进基础模型进行了评估,该数据来自纽约市的一个大型城市学术医疗中心。我们针对14个相关的临床任务进行了测试,测量了整体准确性、校准性和不同亚群体的表现,以揭示基于预训练策略、标记化和数据表示方法选择的权衡。 我们的研究旨在推进结构化EHR基础模型的经验评估,并为未来健康保健领域基础模型的发展提供指导。
https://arxiv.org/abs/2505.16941
Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
知识图谱嵌入(KGE)方法中的不确定性量化对于确保下游应用的可靠性至关重要。最近的一项工作将符合预测应用于KGE方法,通过生成一组包含真实答案且保证达到预定义置信水平的答案集合来提供不确定性估计。然而,现有的方法仅提供了基于查询和答案参考集上的概率保证(边际覆盖保证)。在高风险应用场景中,如医学诊断,通常需要更强的保证:预测集必须为每个单独的查询提供一致的覆盖率(条件覆盖保证)。 我们提出了一种名为CondKGCP的新方法,该方法可以近似谓词条件下的覆盖保证,并同时保持紧凑的预测集合。CondKGCP通过合并具有相似向量表示的谓词并利用排名信息进行校准来实现这一目标。我们证明了CondKGCP的理论保证并通过全面评估展示了其在实际应用中的有效性。
https://arxiv.org/abs/2505.16877
Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.
深度学习在计算机视觉领域取得了变革性的进展,但其高度依赖于大规模的标注数据集和计算资源。迁移学习特别是对预训练模型进行微调提供了实际可行的替代方案;然而,基于自然图像数据集(如ImageNet)预训练的模型可能无法捕捉到医学影像中的特定领域特征。这项研究提出了一种无监督学习框架,用于提取高价值的皮肤科特征,而不是单纯依赖于基于ImageNet的预训练。我们使用一个从头开始在专有的皮肤科数据集上训练的变分自编码器(VAE),使模型能够学习到结构化且临床相关的潜在空间。随后,将这种自我监督的特征提取器与在同一分类条件下使用的基于ImageNet预训练的骨干网络进行比较,突出了一般用途和特定领域预训练之间的权衡。 研究结果揭示了不同的学习模式。自我监督模型实现了最终验证损失为0.110(-33.33%),而基于ImageNet预训练的模型则停滞在0.100(-16.67%),这表明后者出现了过拟合现象。准确性趋势也证实了这一点:自我监督模型从45%提高到了65%(+44.44%),其过拟合差距接近于零;而基于ImageNet预训练的模型则达到了87%(+50.00%),但最终稳定在75%(+19.05%),其过拟合差距增加到了+0.060。这些发现表明,尽管基于ImageNet的预训练加速了收敛过程,但它也放大了对非临床相关特征的过拟合现象。相比之下,自我监督学习实现了持续改进、更强泛化能力和更佳适应性,在医学影像中强调特定领域特征提取的重要性。
https://arxiv.org/abs/2505.16773
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.
我们介绍了一种针对制药领域的日语特定领域语言模型,该模型通过在20亿个日语文本的医药标记和80亿个英语生物医学标记上进行持续预训练而开发。为了能够严格评估模型性能,我们引入了三个新的基准测试:YakugakuQA,基于国家药剂师执业资格考试;NayoseQA,用于跨语言同义词和术语规范化测试;以及SogoCheck,一个新颖的任务设计用于评估成对语句之间的一致性推理。我们在开源医学LLM(大型语言模型)和商业模型(包括GPT-4o)上对该模型进行了评估。结果显示,我们的特定领域模型在现有开放模型中表现更佳,并且在术语密集型和知识基础任务中与商用模型的性能相当甚至超越。有趣的是,即使是GPT-4o在SogoCheck上的表现也相对较差,这表明跨句子一致性推理仍然是一个待解决的技术难题。 我们的基准测试套件为医药NLP(自然语言处理)提供了一个更为全面的诊断视角,涵盖事实回忆、词汇变化和逻辑一致性。这项工作展示了构建实用、安全且成本效益高的日语文本领域应用的语言模型是可行的,并为未来在制药和医疗保健NLP领域的研究提供了可重复使用的评估资源。 我们的模型、代码和数据集已在此网址发布:[此URL]。
https://arxiv.org/abs/2505.16661
Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.
医学异常检测(AD)对于早期临床干预至关重要,但由于隐私问题和数据孤岛导致的高质量医疗影像数据访问受限,这一领域面临着挑战。少样本学习作为一种有前景的方法已经出现,通过利用嵌入在视觉-语言模型(VLMs)中的大规模先验知识来缓解这些限制。近期关于少样本医学异常检测的研究通常将正常情况与异常情况视为一类分类问题处理,并且往往忽略了多个异常类别之间的差异。因此,在本论文中我们提出了一种专为需要识别多种异常类别的少样本医学异常检测场景设计的框架。 为了捕捉医疗异常类别的详细放射学标志,我们的框架引入了由大型语言模型生成的各种文本描述来表示每个类别。该假设认为,不同类型的医学图像中的异常可能在每个类别内共享一些共同的放射学标志。具体而言,我们提出了SD-MAD,这是一种两阶段驱动标识符(Sign-Driven)的少样本多异常检测框架:(i) 通过放大各异常类之间的差异来对齐放射学标志与异常类别;(ii) 在推理时采用自动标识符选择策略进一步选取对齐后的标识符以缓解由有限医疗数据导致的欠拟合和不确定样本问题。 此外,我们还提出了三种协议以全面量化多异常检测性能。大量的实验展示了我们方法的有效性。
https://arxiv.org/abs/2505.16659
Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.
依托于庞大的内部知识库,新一代的大规模语言模型(LLM)展示了处理医疗任务的潜力。然而,在调动多个LLM的专业知识和背景以形成协同效应方面,目前的努力尚显不足。本研究提出了一种基于医学多选题数据集的多LLM协作框架。通过对三个预训练的LLM进行事后分析,我们的框架被证明可以增强所有LLM的推理能力,并缓解它们在不同问题上的差异。此外,我们还测量了当一个LLM面对其他LLM提出的对抗性意见时的信心水平,并观察到了LLM信心与其预测准确度之间的契合关系。
https://arxiv.org/abs/2505.16648
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at this https URL.
我们研究了针对多任务医学图像理解的视觉语言模型(VLMs)微调方法,重点关注在医学图像中检测、定位和计数病变的任务。我们的目标是评估通过指令调整后的VLM是否能够同时改善这些任务,并以此提高诊断准确性和效率。使用MedMultiPoints这一包含内窥镜(息肉和仪器)及显微镜检查(精子细胞)注释的多模态数据集,我们将每个任务重新表述为基于视觉语言推理的指令提示。我们采用低秩适应(LoRA)方法对Qwen2.5-VL-7B-Instruct模型进行多任务组合下的微调训练。实验结果显示,多任务训练提升了模型的鲁棒性和准确性,例如,在计数+定位任务中减少了计数平均绝对误差(MAE),提高了匹配准确度。然而,也存在一些权衡,比如更多的零点预测情况出现,这表明在边缘案例中的可靠性有所下降,尽管整体性能有所提升。 我们的研究强调了通过提示驱动的微调方法将通用视觉语言模型应用于专门医学任务的潜力。这种方法模仿了临床工作流程,其中放射科医生同时定位、计数并描述病变——展示了VLM如何学习复合诊断推理模式。该模型生成可解释和结构化的输出,为具有透明度与多样性的医疗人工智能提供了有前景的发展方向。代码、模型权重及脚本将在 [提供的URL] 上发布,以确保研究的可重复性。
https://arxiv.org/abs/2505.16647
Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at this https URL.
半监督医学图像分割(SSMIS)利用未标记的数据来减少对手动标注图像的依赖。然而,当前最先进的方法主要集中在前景导向建模上(即仅分割前景区域),并且忽略了显式建模背景区域的巨大潜力。我们的研究表明,在背景建模中做出高确定性的预测可以增强相应前景建模的信心。基于这一见解,我们提出了跨视角双向建模(CVBM)框架,该框架通过将背景建模纳入其中,以创新的方式提高了前景建模的性能。在CVBM中,背景建模作为一个辅助视角发挥作用,提供互补监督信号来增强前景模型的信心。此外,CVBM引入了一种创新的双向一致性机制,确保了前景预测与基于背景指导的预测之间的相互对齐。广泛的实验表明,我们的方法在LA、胰腺、ACDC和HRF数据集上实现了最先进的性能。值得注意的是,在胰腺数据集中,CVBM在仅使用20%标签数据的情况下,优于完全监督的方法(即DSC:84.57% vs 83.89%)。我们的代码可在以下网址公开获取:[此URL]。 请注意,上述引用的“此URL”需要您手动填写实际发布的代码链接。
https://arxiv.org/abs/2505.16625
We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
我们提出了一种新颖的胸部X光(CXR)视觉问答(VQA)方法,该方法能同时处理单张图像和不同时间点获取的两幅图像之间的差异问题。单张图像的问题关注特定CX射线中的异常情况(“在图像X中可以看到哪些异常?”),而图像差异问题是对比两张在不同时段获得的纵向CXR图像(“图像X和Y之间有什么区别?”)。我们进一步探讨了将放射科报告集成到VQA模型中以提升性能的方法。虽然之前的研究已经证明,在预训练阶段使用放射科报告是有益的,但我们扩展这一思路,表明这些报告还可以作为额外输入来提高VQA模型预测答案的质量。 首先,我们提出了一种统一方法,能够处理两种类型的问题并自回归地生成答案。对于单张图像问题,模型仅接收一张CX射线;而对于图像差异问题,则提供来自同一患者的两张在不同时间点拍摄的CXR图像,使模型能够检测和描述随时间的变化。 受“链式思维推理”的启发,我们证明了通过使用预测的放射科报告来支撑答案生成模块可以提升CXR VQA任务的表现。我们的方法将VQA模型分为两个步骤:i)报告生成(RG)和ii)答案生成(AG)。实验结果表明,在Medical-Diff-VQA数据集上,将预测得到的放射科报告作为证据融入到AG模型中能够显著提高单张图像问题及图像差异问题的表现,并达到最先进的水平。
https://arxiv.org/abs/2505.16624
Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at this https URL.
医学图像分割(MIS)涵盖了从骨骼到器官的各种任务,每种任务在寻找最佳分割模型时都有其独特的挑战。目前最先进的与AutoML相关的MIS框架nnU-Net自动配置了许多模型参数,但仍受限于固定的超参数和启发式设计选择。作为面向MIS的全自动化机器学习(AutoML)框架,我们提出了一种新的nnU-Net变体——Auto-nnU-Net,该框架支持超参数优化(HPO)、神经架构搜索(NAS)以及分层NAS(HNAS)。此外,我们还提出了Regularized PriorBand方法,在保证模型精度的同时考虑训练所需的计算资源,解决了实际医疗环境中由于资源限制而使广泛培训程序不可行的问题。我们在Medical Segmentation Decathlon中一系列多样化的MIS数据集上评估了我们的方法,分析了AutoML技术对分割性能、计算效率以及模型设计选择的影响。结果表明,在10个数据集中,我们的AutoML方法在6个数据集中显著提高了nnU-Net的分割性能,并且在其他四个数据集中的表现与现有方法相当,同时保持实用的资源需求。 我们提出的代码可以在以下网址获取:[请在此处插入链接]。
https://arxiv.org/abs/2505.16561
Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large-scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM's bias toward semantics over textures and introduce a new texture-aware foundation model, TextureSAM, which performs superior segmentation in texture-dominant scenarios. To achieve this, we employ a novel fine-tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture-alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture-defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM-2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture-based segmentation datasets. The code and texture-augmented dataset will be publicly available.
段落翻译如下: 片段化模型(Segment Anything Models,SAM)在各种数据集上的对象分割任务中取得了显著的成功。然而,这些模型主要是在大规模语义分割数据集上进行训练的,这导致了对物体形状的偏见,而不是图像中的纹理线索。这种限制在医学影像、材料分类和遥感等域尤为关键,在这些领域中,纹理变化定义了对象边界。在这项研究中,我们探讨了SAM模型在语义方面的偏差,并引入了一个新的以纹理感知为基础的基础模型——TextureSAM,该模型在以纹理为主导的场景中表现出色。为实现这一目标,我们采用了一种新颖的微调方法,结合了纹理增强技术,逐步修改训练图像以强调纹理特征。通过利用ADE20K数据集的新颖纹理替代版本,我们指导TextureSAM优先考虑由纹理定义的区域,从而缓解原始SAM模型中存在的固有形状偏见。我们的大量实验表明,在基于自然纹理和合成纹理的分割数据集中,TextureSAM在mIoU指标上均显著优于SAM-2(自然场景+0.2 mIoU,合成场景+0.18 mIoU)。代码及纹理增强的数据集将公开提供。
https://arxiv.org/abs/2505.16540
We present a novel implicit neural shape optimization framework for 3D high-contrast Electrical Impedance Tomography (EIT), addressing scenarios where conductivity exhibits sharp discontinuities across material interfaces. These high-contrast cases, prevalent in metallic implant monitoring and industrial defect detection, challenge traditional reconstruction methods due to severe ill-posedness. Our approach synergizes shape optimization with implicit neural representations, introducing key innovations including a shape derivative-based optimization scheme that explicitly incorporates high-contrast interface conditions and an efficient latent space representation that reduces variable dimensionality. Through rigorous theoretical analysis of algorithm convergence and extensive numerical experiments, we demonstrate substantial performance improvements, establishing our framework as promising for practical applications in medical imaging with metallic implants and industrial non-destructive testing.
我们提出了一种新颖的隐式神经形状优化框架,用于三维高对比度电阻抗断层成像(EIT)。该方法针对导电性在材料界面处表现出急剧不连续性的场景。这些高对比度的情况常见于金属植入物监测和工业缺陷检测中,并且由于严重的不适定问题给传统的重建方法带来了挑战。 我们的方法将形状优化与隐式神经表示相结合,引入了包括基于形状导数的优化方案在内的关键创新,该方案明确地结合了高对比度界面条件。此外,我们还提出了一种高效的潜在空间表示来减少变量维度。 通过严格的算法收敛性理论分析和大量的数值实验,我们展示了显著的性能改进,并确立了我们的框架在医学成像(如含金属植入物)及工业无损检测等实际应用中的前景。
https://arxiv.org/abs/2505.16487
Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but also assesses whether it can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared to prior works, CLEAR's multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments show that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.
现有的度量标准往往缺乏捕捉候选放射学报告与真实情况之间细微临床差异的详细程度和可解释性,导致评估效果不佳。为此,我们引入了一种基于临床基础的表格框架——CLEAR(Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation),该框架结合了专家策划标签和属性级别的对比来评价放射学报告。 与先前的工作相比,CLEAR不仅检查报告是否能准确识别医学状况的存在与否,还评估它能否精确描述每一种被确定的状况在五个关键属性中的表现:首次出现、变化情况、严重程度、描述性位置以及建议措施。这种多维度、基于属性级别的输出使得报告质量能够得到更加全面和临床可解释性的评价。 为了衡量CLEAR与临床实践的一致性,我们与五位认证放射科医生合作开发了CLEAR-Bench数据集,该数据集包含了MIMIC-CXR中的100份胸部X光片报告,并在6个策划的属性和13种CheXpert条件上进行了标注。实验结果显示,CLEAR在提取临床属性方面具有高准确性,并且所提供的自动化度量标准与临床判断高度一致。
https://arxiv.org/abs/2505.16325
To address the challenge of complex pathological feature extraction in automated cardiac MRI segmentation, this study proposes an innovative dual-encoder architecture named SAMba-UNet. The framework achieves cross-modal feature collaborative learning by integrating the vision foundation model SAM2, the state-space model Mamba, and the classical UNet. To mitigate domain discrepancies between medical and natural images, a Dynamic Feature Fusion Refiner is designed, which enhances small lesion feature extraction through multi-scale pooling and a dual-path calibration mechanism across channel and spatial dimensions. Furthermore, a Heterogeneous Omni-Attention Convergence Module (HOACM) is introduced, combining global contextual attention with branch-selective emphasis mechanisms to effectively fuse SAM2's local positional semantics and Mamba's long-range dependency modeling capabilities. Experiments on the ACDC cardiac MRI dataset demonstrate that the proposed model achieves a Dice coefficient of 0.9103 and an HD95 boundary error of 1.0859 mm, significantly outperforming existing methods, particularly in boundary localization for complex pathological structures such as right ventricular anomalies. This work provides an efficient and reliable solution for automated cardiac disease diagnosis, and the code will be open-sourced.
为了应对自动化心脏MRI分割中复杂的病理特征提取挑战,本研究提出了一种创新的双编码器架构,名为SAMba-UNet。该框架通过整合视觉基础模型SAM2、状态空间模型Mamba和经典的UNet实现了跨模态特征协作学习。为缓解医学图像与自然图像之间的领域差异问题,设计了一个动态特征融合精炼器,该精炼器通过多尺度池化以及通道和空间维度上的双路径校准机制增强了对小病灶的特征提取能力。 此外,还引入了异构全注意力收敛模块(HOACM),结合全局上下文注意力与分支选择性强调机制,有效融合SAM2的局部位置语义和Mamba的长程依赖建模能力。在ACDC心脏MRI数据集上的实验表明,所提出的模型实现了0.9103的Dice系数和1.0859毫米的HD95边界误差,在复杂病理结构(如右心室异常)的边界定位上显著优于现有方法。 这项工作为自动化心脏病诊断提供了一种高效可靠的解决方案,并将开放源代码。
https://arxiv.org/abs/2505.16304
Recently, prototype learning has emerged in semi-supervised medical image segmentation and achieved remarkable performance. However, the scarcity of labeled data limits the expressiveness of prototypes in previous methods, potentially hindering the complete representation of prototypes for class embedding. To overcome this issue, we propose an efficient prototype consistency learning via joint uncertainty quantification and data augmentation (EPCL-JUDA) to enhance the semantic expression of prototypes based on the framework of Mean-Teacher. The concatenation of original and augmented labeled data is fed into student network to generate expressive prototypes. Then, a joint uncertainty quantification method is devised to optimize pseudo-labels and generate reliable prototypes for original and augmented unlabeled data separately. High-quality global prototypes for each class are formed by fusing labeled and unlabeled prototypes, which are utilized to generate prototype-to-features to conduct consistency learning. Notably, a prototype network is proposed to reduce high memory requirements brought by the introduction of augmented data. Extensive experiments on Left Atrium, Pancreas-NIH, Type B Aortic Dissection datasets demonstrate EPCL-JUDA's superiority over previous state-of-the-art approaches, confirming the effectiveness of our framework. The code will be released soon.
最近,在半监督医学图像分割领域,原型学习作为一种新兴方法取得了显著的性能。然而,标注数据的稀缺性限制了先前方法中原型的表现力,可能阻碍了原型对类嵌入的完全表示。为了解决这个问题,我们提出了一种通过联合不确定性量化和数据增强(EPCL-JUDA)进行高效的原型一致性学习的方法,以提高基于Mean-Teacher框架下原型的语义表达能力。原始标注数据与增强后的标注数据串联后输入学生网络生成具有表现力的原型。随后,设计了一个联合不确定性量化方法来优化伪标签,并为原始和增强的未标注数据分别生成可靠的原型。通过融合标注数据和未标注数据的原型,可以形成每个类别的高质量全局原型,这些全局原型被用来生成从原型到特征的映射,从而进行一致性学习。值得注意的是,我们提出了一种原型网络来减少引入增强数据后带来的高内存需求。 在左心房、胰腺-NIH以及B型主动脉夹层等数据集上的大量实验表明,EPCL-JUDA方法优于先前最先进的方法,证实了我们的框架的有效性。代码即将发布。
https://arxiv.org/abs/2505.16283
Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.
计算机断层扫描(Computed Tomography,CT)生成的三维体积医学数据可以被视作数百张横截面图像(又称切片),这些数据为诊断提供了详细的解剖信息。对于放射科医生来说,创建CT影像报告是一项耗时且容易出错的任务。因此,迫切需要一种视觉问答系统(Visual Question Answering, VQA)来回答放射科医生关于CT扫描中某些解剖区域的问题,并且能够自动生成影像报告。然而,现有的VQA系统无法充分应对CT影像问答(Computed Tomography Radiology Question Answering,CTQA)任务,原因在于:(1) 解剖结构的复杂性使得CT图像难以理解;(2) 在数百张切片之间的空间关系很难捕捉。 为解决这些问题,本文提出了一种多模态代理框架——CT-Agent。CT-Agent采用独立于解剖学的方法来分解解剖复杂性,并且通过全局-局部标记压缩策略有效地捕获跨切片的空间关系。在两个3D胸部CT数据集(CT-RATE和RadGenome-ChestCT)上的实验结果验证了CT-Agent的优越性能。
https://arxiv.org/abs/2505.16229
Medical Visual Question Answering (MedVQA) is crucial for enhancing the efficiency of clinical diagnosis by providing accurate and timely responses to clinicians' inquiries regarding medical images. Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other (in MedVQA, usually questions dominate the answer but images are overlooked), thereby failing to learn multimodal knowledge. To overcome the modality preference bias, we proposed a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference. Existing MedVQA datasets exhibit substantial prior dependencies between questions and answers, which results in acceptable performance even if the model significantly suffers from the modality preference bias. To address this issue, we reconstructed new datasets by leveraging existing MedVQA datasets and Changed their P3rior dependencies (CP) between questions and their answers in the training and test set. Extensive experiments demonstrate that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.
医学视觉问答(MedVQA)对于通过提供准确及时的响应来提高临床诊断效率至关重要,这些响应解决了医生关于医疗影像的问题。现有的MedVQA模型存在模态偏好偏差问题,即预测主要受单一模态影响而忽略另一模态(在MedVQA中通常是问题主导答案但忽视图像),从而无法学习跨模式知识。为了克服这种模态偏好偏差,我们提出了一种医学反事实视觉问答(MedCFVQA)模型,该模型通过因果图训练并消除推理过程中的模态偏好偏差。 现有的MedVQA数据集在问题和答案之间存在显著的先验依赖性,即使模型严重受到模态偏好偏差的影响也能表现出可接受的性能。为了解决这个问题,我们利用现有MedVQA数据集重建了新的数据集,并改变了训练及测试集中问题与其答案之间的P3rior依赖关系(CP)。大量的实验表明,在SLAKE、RadVQA和SLAKE-CP、RadVQA-CP等数据集上,MedCFVQA的表现显著优于其非因果模型。
https://arxiv.org/abs/2505.16209
While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.
虽然代谢和减肥手术(Bariatric and Metabolic Surgery,简称MBS)被认为是治疗重度肥胖的金标准治疗方法,但其疗效依赖于患者与多学科专业人员之间持续且长期的合作。这些专业人士包括外科医生、营养师/营养学家、心理学家以及内分泌科医生等。患者的整个治疗旅程,从术前准备到术后长期管理,都需要这种互动。 然而,这一过程常因各种医疗保健差异而受阻,比如物流和访问障碍,这影响了患者及时获取基于证据的、临床专家推荐的信息的能力。为解决这些问题,我们提出了bRAGgen模型,这是一种新颖的自适应检索增强生成(Retrieval-Augmented Generation, RAG)框架,能够在回答置信度低于动态阈值时自动整合实时医学证据。这种自我更新架构确保了答案始终保持最新且准确,降低了传播错误信息的风险。 此外,我们还介绍了一套名为bRAGq的数据集,该数据集中包含了1302个经资深减肥外科专家验证的与代谢和减肥手术相关的问题。这套数据集是首个大规模、特定领域的基准测试平台,用于全面评估MBS护理质量。 在两阶段评估中,我们将bRAGgen模型与其他最先进的模型进行了对比,并采用了基于大型语言模型(Large Language Model, LLM)的指标以及专家外科医生的意见进行评判。无论是在哪个评价维度上,bRAGgen都展现了显著更优的表现,在生成临床准确且相关的答复方面尤其出色。
https://arxiv.org/abs/2505.16102