Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.
基于大模型的医学影像应用展现出巨大潜力,但在三维成像模式中的研究相对较少。数字乳腺断层合成(DBT)作为一种用于乳腺癌筛查的技术,目前尚无专门的大模型支持。为了开发和评估一种适用于多种临床任务的DBT基础模型(DBT-DINO),并考察领域特定预训练的影响,本研究使用DINOv2方法在超过2500万张来自487,975个DBT体素数据集中的27,990名患者的二维切片上进行了自监督预训练。以下是三个下游任务的评估结果: 1. **乳腺密度分类**:使用了5,000份筛查检查数据,DBT-DINO在乳腺密度分类中达到了0.79的准确率(95% CI: 0.76--0.81),优于MetaAI DINOv2基线模型(0.73, 95% CI: 0.70--0.76, p<.001)和DenseNet-121 (0.74, 95% CI: 0.71--0.76, p<.001)。 2. **五年乳腺癌风险预测**:使用了106,417份筛查检查数据,DBT-DINO在五年内发生乳腺癌的风险预测中达到了AUROC为0.78(95% CI: 0.76--0.80),相比之下DINOv2的AUROC为0.76 (95% CI: 0.74--0.78, p=.57)。 3. **病灶检测**:使用了393份标注体素数据,DINOv2在病变检测中平均敏感度更高,达到0.67(95% CI: 0.60--0.74),而DBT-DINO的敏感度为0.62 (95% CI: 0.53--0.71, p=.60)。特别地,在检测恶性病变时,DBT-DINO的表现优于Dinov2,检出率为78.8%,相比之下Dinov2为77.3%。 通过使用前所未有的大规模数据集,本研究开发了第一个专门用于数字乳腺断层合成的DBT-DINO基础模型。在乳腺密度分类和癌症风险预测方面,DBT-DINO展示了出色的表现。然而,在病灶检测任务中,领域特定预训练表现出的效果不一,ImageNet基线模型在一般性病灶检测上优于DBT-DINO,表明局部化检测任务需要进一步的方法学研究与发展。
https://arxiv.org/abs/2512.13608
A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.
一张生物医学图像可以根据所需的应用场景以多种方式进行有意义的分割。例如,大脑MRI可以依据组织类型、血管区域、广泛的解剖区域、精细的解剖结构或病理学等多个方面进行划分。现有的自动分割模型通常要么(1)仅支持单一协议,即它们所接受训练的那个特定方案;要么(2)需要费时的手动提示来指定所需的分割方法。 我们介绍了一种名为Pancakes的新框架,在给定来自以前未见过领域的新图像后,它可以自动生成针对多个合理协议的多标签分割图,并且在相关图像之间保持语义一致性。Pancakes引入了一个目前现有基础模型无法实现的问题设定。通过一系列关于七个独立测试数据集的实验,我们展示了我们的模型能够在生成多个合理的全图分割时显著超越现有的基础模型,在不同图像之间也具有良好的语义连贯性。
https://arxiv.org/abs/2512.13534
Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at this https URL.
具备推理能力的大规模语言模型在多个领域展示了出色的表现。在临床应用中,透明且步骤明确的推理过程为医生提供了强有力的支持决策证据。虽然强化学习已经在医学场景中有效地提升了推理性能,但这些推理过程的临床可靠性仍然受限,因为它们的准确性和有效性往往在训练过程中被忽视。为了填补这一空白,我们提出了MedCEG框架,该框架通过关键证据图(Critical Evidence Graph, CEG)明确监督医疗语言模型中的推理过程,从而增强其医学上有效的推理路径。为此,我们收集了一组具有挑战性的临床案例,并为每个样本算法构建一个CEG来表示高质量且可验证的推理路径。 为了指导推理过程,我们引入了一个临床推理程序奖励机制,该机制评估节点覆盖率、结构正确性和链条完整性,从而对推理质量进行全面评价。实验结果表明,MedCEG在性能上超越了现有方法,并产生了具有临床有效性的推理链,这是可靠医疗AI推理的一个重要进步。 代码和模型可在以下链接获取:[请在这里插入实际的URL]
https://arxiv.org/abs/2512.13510
To address the issues that arise due to the manual navigation of guidewires in endovascular interventions, research in medical robotics has taken a strong interest in developing robotically steerable guidewires, which offer the possibility of enhanced maneuverability and navigation, as the tip of the guidewire can be actively steered. The COaxially Aligned STeerable (COAST) guidewire robot has the ability to generate a wide variety of motions including bending motion with different bending lengths, follow-the-leader motion, and feedforward motion. In our past studies, we have explored different designs of the COAST guidewire robot and developed modeling, control, and sensing strategies for the COAST guidewire robot. In this study, the performance of a modified COAST guidewire robot is evaluated by conducting navigation experiments in an anatomical phantom model with pulsatile flow. The modified COAST guidewire robot is a simplified version of the COAST guidewire robot and consists of two tubes as opposed to three tubes. Through this study, we demonstrate the effectiveness of the modified COAST guidewire robot in navigating the tortuous phantom vasculature.
为了应对手动操作导丝在血管内介入治疗中出现的问题,医学机器人研究领域对开发可遥控操控的导丝产生了浓厚兴趣。这种类型的导丝能够提供增强的操作灵活性和导航能力,因为其尖端可以被主动控制以进行转向。COAXIALLY ALIGNED STEERABLE (COAST) 导丝机器人具备生成多种运动的能力,包括不同弯曲长度下的弯曲运动、跟随引导的运动以及前馈运动。 在我们之前的研究所中,我们探讨了不同的 COAST 导丝机器人的设计,并为该机器人开发了建模、控制和感测策略。在这项研究中,通过在一个具有脉动流动的人体解剖模型中进行导航实验来评估改进后的 COAST 导丝机器人的性能。此改进版的 COAST 导丝机器人是一个简化版本,它由两个管子组成,而原来的版本则有三个管子。本研究表明了改进后 COAST 导丝机器人在导航曲折的人体血管模型中的有效性。
https://arxiv.org/abs/2512.13477
In medical data analysis, extracting deep insights from complex, multi-modal datasets is essential for improving patient care, increasing diagnostic accuracy, and optimizing healthcare operations. However, there is currently a lack of high-quality datasets specifically designed to evaluate the ability of large multi-modal models (LMMs) to discover medical insights. In this paper, we introduce MedInsightBench, the first benchmark that comprises 332 carefully curated medical cases, each annotated with thoughtfully designed insights. This benchmark is intended to evaluate the ability of LMMs and agent frameworks to analyze multi-modal medical image data, including posing relevant questions, interpreting complex findings, and synthesizing actionable insights and recommendations. Our analysis indicates that existing LMMs exhibit limited performance on MedInsightBench, which is primarily attributed to their challenges in extracting multi-step, deep insights and the absence of medical expertise. Therefore, we propose MedInsightAgent, an automated agent framework for medical data analysis, composed of three modules: Visual Root Finder, Analytical Insight Agent, and Follow-up Question Composer. Experiments on MedInsightBench highlight pervasive challenges and demonstrate that MedInsightAgent can improve the performance of general LMMs in medical data insight discovery.
在医学数据分析中,从复杂的多模态数据集中提取深层次见解对于提高患者护理质量、增加诊断准确性以及优化医疗运营至关重要。然而,目前缺乏专门用于评估大型多模态模型(Large Multi-Modal Models, LMMs)发现医学洞察能力的高质量数据集。本文介绍了MedInsightBench,这是第一个由332个精心策划的医学案例组成的基准测试集合,每个案例都配有经过深思熟虑设计的见解注释。该基准旨在评估LMMs和代理框架分析多模态医疗图像数据的能力,包括提出相关问题、解释复杂发现以及综合可操作的洞察力与建议。 我们的分析表明,现有的LMM在MedInsightBench上的表现有限,这主要是因为它们难以提取多层次的深层次见解,并且缺乏医学专业知识。因此,我们提出了MedInsightAgent,这是一个用于医疗数据分析的自动化代理框架,由三个模块组成:视觉根查找器(Visual Root Finder)、分析洞察力代理(Analytical Insight Agent)和后续问题组合器(Follow-up Question Composer)。在MedInsightBench上的实验突显了普遍存在的挑战,并表明MedInsightAgent能够改进一般LMMs在医学数据洞察发现中的表现。
https://arxiv.org/abs/2512.13297
Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
在医学成像中的深度学习模型容易受到捷径学习(shortcut learning)的影响,即依赖于图像嵌入中编码的混淆元数据(例如扫描仪型号)。关键问题在于模型是否积极利用了这种嵌入信息进行最终预测。我们引入了一种名为权重空间相关性分析的方法论,这是一种可解释的技术,通过测量主临床任务分类头和辅助元数据任务之间的对齐程度来量化特征利用率。 首先,我们验证了该方法的有效性,成功检测到了人工诱导的捷径学习现象。然后,我们将此方法应用于探究用于早产预测(sPTB)的SA-SonoNet模型的特征利用情况。我们的分析确认尽管嵌入包含大量元数据,但sPTB分类器的权重向量与临床相关的因素(例如出生体重)高度相关,而与临床无关的数据采集因素(如扫描仪型号)则没有关联。 该方法为验证模型的信任度提供了工具,证明在不存在诱导偏差的情况下,临床模型会选择性地利用与真正临床信号相关的特征。这一发现有助于提升医学成像深度学习模型的可靠性和透明度。
https://arxiv.org/abs/2512.13144
Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.
大型语言模型(LLM)在静态交互中表现出色,通过检索其参数中编码的知识来回答用户的问题。然而,在许多现实场景下,比如教育辅导或医疗援助中,相关的信息并不直接可用,必须通过动态互动主动获取。一个互动代理会识别自身的不确定性,提出有针对性的问题,并高效地保留新知识。此前的研究主要探讨了教师如何有效指导学生的方法,即教师识别学生的不足并提供指引。而在这项研究中,我们将重点转向学生端,探索在寻求有用信息时积极提问的有效策略。 通过数学和编码基准测试,在这些领域初始表现近乎零分的学生模型基础上,我们展示出由学生主导的方法能够持续地比静态基线方法带来至少0.5的绝对Pass@k性能提升。为了提高问题的质量,我们在直接偏好优化(DPO)的指导下训练学生模型,指导可以来自自我反馈或更强的学生模型。我们发现这种有引导的训练使较小规模的模型也能学习如何提出更好的问题,从而进一步提升了学习效率。
https://arxiv.org/abs/2512.13102
Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
视觉基础模型通过大规模、异构的预训练在医学图像分割中展示了强大的泛化能力。然而,由于通用先验与特定任务需求之间的不匹配,在标注数据有限或罕见病理变化的情况下,它们往往难以将这种泛化能力应用于专门化的临床任务。为了解决这个问题,我们提出了不确定性引导协作学习(Uncertainty-informed Collaborative Learning, UnCoL),这是一种双教师框架,旨在协调半监督医学图像分割中的泛化和专业化。具体来说,UnCoL从冻结的基础模型中蒸馏出视觉和语义表示,以转移通用知识,同时维持一个逐步适应的教师来捕捉细粒度和特定任务的表示。为了平衡两个教师提供的指导,在UnCoL的伪标签学习过程中,通过预测不确定性自适应调节,选择性地抑制不可靠的监督,并稳定在模糊区域的学习。 在各种2D和3D分割基准测试中进行的实验表明,UnCoL始终优于最先进的半监督方法和基础模型基线。此外,我们的模型能够在显著减少标注需求的情况下接近完全监督性能。
https://arxiv.org/abs/2512.13101
This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor's experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.
这项研究提供了一种使用智能系统预测心脏病的概览。准确地预测疾病在医学领域至关重要,但传统的仅依赖医生经验的方法往往不够精确。为了克服这一局限性,智能系统被用作传统方法的替代方案。尽管存在各种智能系统方法,本研究重点关注三种:模糊逻辑、神经网络和基于案例推理(CBR)。该研究对这三种技术进行了准确度比较,并最终选择了基于案例推理(CBR)用于心脏病预测。在预测阶段,心脏病数据集经过了预处理以清理数据并分割为训练集和测试集。然后使用选定的智能系统根据处理后的数据来预测心脏病的结果。实验结果显示,基于案例推理(CBR)在预测心脏病方面的准确率达到了显著的97.95%。研究结果还表明,男性患心脏病的概率为57.76%,女性为42.24%。相关研究的进一步分析表明,吸烟和饮酒等因子是导致心脏病的重要因素,尤其是在男性中更为明显。
https://arxiv.org/abs/2512.13078
Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
多模态生物医学视觉-语言模型(VLMs)在连续学习(CL)领域展现了巨大的潜力。然而,这些模型面临着一个核心难题:如何在保留细粒度的单模式特征的同时弥合不同模式之间的显著域差距。为了解决这一挑战,我们提出了一种全面的框架。通过利用我们的1800万个多模态和综合性的医学检索数据库(该库源自PubMed科学文献),我们将检索增强生成(RAG)技术首次整合到CL中。具体而言,我们采用了一个多模式、多层次的RAG系统,它能够通过动态、按需的知识检索为模型微调提供实时指导。 在此基础上,我们引入了动态知识蒸馏框架。该框架精确地解决了上述核心难题:通过根据所需细节水平动态调节参数空间的重要性、精炼知识的粒度以及参考数据集的数据分布来实现这一目标。 为了彻底验证我们的策略在临床上的价值,我们设计了一个更严格的“医学通才任务增量学习”(MGTIL)基准测试。该基准旨在同时评估模型适应重大域变化的能力、保留细微的单域特征以及实时学习新的和复杂的医疗任务的能力。 广泛的实验结果表明,我们提出的方法在所有指标上都达到了最先进的性能水平。补充材料中提供了代码。
https://arxiv.org/abs/2512.13072
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
准确的医学影像分析能够极大地辅助临床诊断,但其有效性依赖于高质量的专业标注。获取医学图像(尤其是眼底图像)的像素级标签既耗时又昂贵。尽管深度学习在医学成像领域的成功已经取得突破,但缺乏可解释性限制了其临床应用。为了解决这些问题,我们提出了TWLR,这是一种用于可解释的糖尿病视网膜病变(DR)评估的两阶段框架。 第一阶段中,视觉-语言模型整合特定领域的专业眼科知识到文本嵌入中,从而同时执行DR分级和病变分类。这有效地将语义医学概念与视觉特征联系起来。 第二阶段引入了一个基于弱监督语义分割的迭代严重程度回归框架。通过迭代细化生成的病变显著图引导渐进式填充机制,系统地消除病理特征,使疾病严重程度逐步向更健康的眼底外观退化。关键在于这一严重度回归方法实现了双重效果:在没有像素级监督的情况下实现准确的病变定位,并提供可解释的疾病到健康转变可视化。 实验结果表明,在FGADR、DDR以及一个私有数据集上,TWLR在DR分类和病变分割方面均达到了竞争性的性能,为自动视网膜图像分析提供了更加可解释且标注效率更高的解决方案。
https://arxiv.org/abs/2512.13008
3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.
三维医学图像分类对于现代临床工作流程至关重要。医疗基础模型(FMs)作为一种有前途的方法,被用于扩展到新的任务上,但目前的研究却面临着三个关键的缺陷:数据制式偏差、次优适应和任务覆盖不足。在这篇论文中,我们解决了这些问题,并引入了AnyMC3D,这是一个从二维基础模型演变而来的可扩展三维分类器。我们的方法通过在单一冻结骨干网络之上添加轻量级插件(每个任务大约100万个参数)高效地扩展到新的任务上。这个多功能框架还支持多视角输入、辅助像素级别的监督以及生成可解释的热图。我们建立了一个涵盖各种病理学、解剖结构和模态的12个任务的全面基准,并系统分析了最先进的三维分类技术。我们的分析揭示了一些关键见解:(1)有效的适应是解锁基础模型潜力的关键,(2)如果正确地进行调整,通用目的的基础模型可以与医学特定的基础模型相匹敌,以及(3)基于二维的方法在三维分类任务上超越了三维架构。首次,我们展示了使用单一可扩展框架实现跨各种应用的最先进性能的可能性(包括VLM3D挑战赛的第一名),从而消除了对专用任务模型的需求。
https://arxiv.org/abs/2512.12887
Despite their remarkable performance, deep neural networks exhibit a critical vulnerability: small, often imperceptible, adversarial perturbations can lead to drastically altered model predictions. Given the stringent reliability demands of applications such as medical diagnosis and autonomous driving, robust detection of such adversarial attacks is paramount. In this paper, we investigate the geometric properties of a model's input loss landscape. We analyze the Intrinsic Dimensionality (ID) of the model's gradient parameters, which quantifies the minimal number of coordinates required to describe the data points on their underlying manifold. We reveal a distinct and consistent difference in the ID for natural and adversarial data, which forms the basis of our proposed detection method. We validate our approach across two distinct operational scenarios. First, in a batch-wise context for identifying malicious data groups, our method demonstrates high efficacy on datasets like MNIST and SVHN. Second, in the critical individual-sample setting, we establish new state-of-the-art results on challenging benchmarks such as CIFAR-10 and MS COCO. Our detector significantly surpasses existing methods against a wide array of attacks, including CW and AutoAttack, achieving detection rates consistently above 92\% on CIFAR-10. The results underscore the robustness of our geometric approach, highlighting that intrinsic dimensionality is a powerful fingerprint for adversarial detection across diverse datasets and attack strategies.
尽管深度神经网络表现出色,但它们存在一个关键的脆弱性:小而常常难以察觉的对抗性扰动可能导致模型预测发生剧烈变化。对于医学诊断和自动驾驶等对可靠性要求极高的应用而言,能够有效地检测此类对抗性攻击至关重要。在本文中,我们探讨了模型输入损失景观中的几何特性。我们分析了模型梯度参数的固有维度(ID),它量化了描述数据点所需的基本坐标数量。我们发现自然数据与对抗性数据之间存在显著且一致的区别,这一差异构成了我们提出的检测方法的基础。我们在两种不同的操作场景中验证了我们的方法。 首先,在批量处理环境中用于识别恶意数据组时,我们的方法在MNIST和SVHN等数据集上表现出高效率。其次,在关键的单样本设置下,我们在CIFAR-10和MS COCO等具有挑战性的基准测试中建立了新的最先进的结果。我们的检测器显著超越了现有的针对包括CW(Carlini Wagner)攻击和AutoAttack在内的多种攻击方法的表现,达到了在CIFAR-10数据集上超过92%的一致检测率。这些结果显示出了我们几何方法的鲁棒性,并强调固有维度是跨不同数据集和攻击策略进行对抗性检测的强大指纹特征。
https://arxiv.org/abs/2512.12827
Recent advancements in image synthesis have enabled high-quality image generation and manipulation. Most works focus on: 1) conditional manipulation, where an image is modified conditioned on a given attribute, or 2) disentangled representation learning, where each latent direction should represent a distinct semantic attribute. In this paper, we focus on a different and less studied research problem, called Contrastive Analysis (CA). Given two image datasets, we want to separate the common generative factors, shared across the two datasets, from the salient ones, specific to only one dataset. Compared to existing methods, which use attributes as supervised signals for editing (e.g., glasses, gender), the proposed method is weaker, since it only uses the dataset signal. We propose a novel framework for CA, that can be adapted to both GAN and Diffusion models, to learn both common and salient factors. By defining new and well-adapted learning strategies and losses, we ensure a relevant separation between common and salient factors, preserving a high-quality generation. We evaluate our approach on diverse datasets, covering human faces, animal images and medical scans. Our framework demonstrates superior separation ability and image quality synthesis compared to prior methods.
最近在图像合成领域的进展已经实现了高质量的图像生成和编辑。大多数研究集中在两个方面:1)条件性编辑,即根据给定属性修改图片;2)解耦表示学习,其中每个潜在方向应代表一个独特的语义属性。在这篇论文中,我们关注一种不同的且较少被研究的问题,称为对比分析(Contrastive Analysis, CA)。给定两个图像数据集,我们的目标是从这两个数据集中分离出共同的生成因素和特定于某一数据集的关键因素。 与现有方法相比,后者通常使用属性作为编辑操作中的监督信号(例如眼镜、性别),我们提出的方法更为简单,因为它仅依赖于数据集本身的信号。为了实现对比分析,我们设计了一个新颖的框架,该框架可以应用于GAN和扩散模型,以学习共同和关键生成因素。通过定义新的适应性强的学习策略和损失函数,确保了有效区分共同与特定的因素,并同时保持高质量的图像生成能力。 我们在涵盖人脸、动物图片以及医学影像等多样化的数据集上评估了我们的方法,结果表明,相比先前的方法,我们的框架在分离能力和图像合成质量方面都表现出色。
https://arxiv.org/abs/2512.12800
Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithm's opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI explanations to examine how XAI assistance, particularly multimodal large language models (LLMs), influences diagnostic performance. AI assistance balanced across skin tones improved accuracy and reduced diagnostic disparities. However, LLM explanations yielded divergent effects: lay users showed higher automation bias - accuracy boosted when AI was correct, reduced when AI erred - while experienced PCPs remained resilient, benefiting irrespective of AI accuracy. Presenting AI suggestions first also led to worse outcomes when the AI was incorrect for both groups. These findings highlight XAI's varying impact based on expertise and timing, underscoring LLMs as a "double-edged sword" in medical AI and informing future human-AI collaborative system design.
人工智能(AI)在医疗保健领域的应用日益广泛,从医师助手到消费者应用程序无所不包。然而,由于AI算法的黑箱性质对人类互动构成了挑战,可解释的人工智能(XAI)应运而生,旨在通过提供有关AI决策过程的信息来解决这一问题。不过,有证据表明,XAI可能会引发过度依赖或偏见的问题。 我们进行了两项大规模实验研究(涉及623名普通民众和153名初级保健医生),结合了基于公平性的诊断AI模型与不同的XAI解释方式,以考察XAI在辅助诊断性能方面的效果。结果表明,在不同肤色背景下平衡提供的AI支持可以提高准确性并减少诊断上的差异。 然而,大型语言模型(LLMs)的解释产生了不同的影响:普通用户显示出更高的自动化偏见——当AI正确时准确率提升,而AI出错时则下降;相比之下,经验丰富的初级保健医生显得更为坚韧不拔,在任何情况下都能从中受益。此外,无论是对普通人还是初级保健医生而言,先展示AI建议然后再提供其他信息的做法在AI出现错误的情况下都会导致结果变差。 这些发现突显了XAI的影响因使用者的专业知识水平和使用时机的不同而有所变化,并且强调大型语言模型(LLMs)在医疗人工智能领域是一把“双刃剑”。这一研究为未来的人机协作系统设计提供了重要信息。
https://arxiv.org/abs/2512.12500
The development of machine learning (ML) models based on computed tomography (CT) imaging modality has been a major focus of recent research in the medical imaging domain. Incorporating robust feature engineering approach can highly improve the performance of these models. Topological data analysis (TDA), a recent development based on the mathematical field of algebraic topology, mainly focuses on the data from a topological perspective, extracting deeper insight and higher dimensional structures from the data. Persistent homology (PH), a fundamental tool in the area of TDA, can extract topological features such as connected components, cycles and voids from the data. A popular approach to construct PH from 3D CT images is to utilize the 3D cubical complex filtration, a method adapted for grid-structured data. However, this approach may not always yield the best performance and can suffer from computational complexity with higher resolution CT images. This study introduces a novel patch-based PH construction approach tailored for volumetric medical imaging data, in particular CT modality. A wide range of experiments has been conducted on several datasets of 3D CT images to comprehensively analyze the performance of the proposed method with various parameters and benchmark it against the 3D cubical complex algorithm. Our results highlight the dominance of the patch-based TDA approach in terms of both classification performance and time-efficiency. The proposed approach outperformed the cubical complex method, achieving average improvement of 10.38%, 6.94%, 2.06%, 11.58%, and 8.51% in accuracy, AUC, sensitivity, specificity, and F1 score, respectively, across all datasets. Finally, we provide a convenient python package, Patch-TDA, to facilitate the utilization of the proposed approach.
基于计算机断层扫描(CT)成像模式的机器学习(ML)模型的发展已成为近期医学影像领域研究的主要焦点。采用稳健的特征工程方法可以显著提升这些模型的表现。拓扑数据分析(TDA),一种基于代数拓扑学领域的最新发展,主要从数据的拓扑视角出发,能够从中提取更深层次的信息和高维结构。持续同调性(PH)是TDA领域中的一个重要工具,可以从数据中抽取诸如连通分量、循环和空洞等拓扑特征。构建3D CT图像持久同调性的流行方法之一是利用三维立方复形过滤法,这是一种适用于网格结构数据的方法。然而,这种方法并不总能提供最佳表现,并且在处理高分辨率CT图像时可能会遇到计算复杂度的问题。 本研究提出了一种专为体素医学影像数据设计的新颖补丁基持久同调性构建方法,特别是在CT模式下。我们对多个3D CT图像数据集进行了一系列广泛的实验,以全面分析所提方法的各种参数性能,并将其与三维立方复形算法进行了基准测试。我们的结果强调了基于补丁的TDA方法在分类表现和时间效率方面的主导地位。相较于立方复形法,提出的方案在所有数据集中分别实现了平均10.38%,6.94%,2.06%,11.58% 和8.51% 的准确率、AUC值、敏感性、特异性以及F1分数的改善。 最后,我们提供了一个便捷的Python包Patch-TDA,以方便使用所提出的方案。
https://arxiv.org/abs/2512.12108
Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.
超声心动图是心脏病学中最常用的成像方式,然而其解读仍然非常费力且具有多模态特性,需要进行视图识别、定量测量、定性评估以及基于指南的推理。尽管最近的视觉-语言模型(VLMs)在自然图像和某些医学领域中取得了广泛的成功,但由于缺乏大规模、临床导向的图文数据集以及缺少超声心动图解读中至关重要的基于测量的推理能力,其在超声心动图领域的应用潜力受到了限制。 我们引入了EchoGround-MIMIC,这是第一个以测量为基础的多模态超声心动图数据集,包含来自1572名患者的19,065对图文配对,并且这些图像和文本均基于标准化视图、结构化测量值、与测量相关的描述以及基于指南的疾病标签。在此资源基础上,我们提出了EchoVLM,这是一个视觉-语言模型,它包括两个新颖的预训练目标:(i)一个以视图为信息来源的对比损失,该损失编码了超声心动图成像中视图依赖性的结构;(ii)一种基于否定情况识别关键临床发现的对比损失。 在跨五种类型的临床应用和涵盖多模态疾病分类、图文检索、视图分类、腔室分割以及地标检测在内的36项任务上,EchoVLM取得了最先进的性能(零样本学习下的疾病分类AUC为86.5%,视图分类准确率为95.1%)。我们展示了基于临床导向的多模态预训练能够产生可转移的视觉表示,并确立了EchoVLM作为端到端超声心动图解读的基础模型的地位。我们将发布EchoGround-MIMIC以及数据整理代码,以促进研究再现性和进一步在多模态超声心动图解读领域的研究工作。
https://arxiv.org/abs/2512.12107
Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: this https URL.
生成逼真的合成显微图像对于在标签稀缺的环境下训练深度学习模型(例如,每张图像中有大量细胞的情况下的计数任务)至关重要。然而,传统的领域适应方法往往难以跨越合成图像缺乏真实样本中的复杂纹理和视觉模式的领域差距。在这项工作中,我们将原本用于艺术风格迁移的基于逆向的样式转换(Inversion-Based Style Transfer, InST)框架应用于生物医学显微镜图像。我们的方法结合了潜在空间自适应实例归一化与扩散模型中的随机反转,以从真实荧光显微图像中传递风格到合成图像上,并弱化地保持内容结构。 我们通过在不同数据源(包括真实数据、硬编码的合成数据以及公共Cell200-s数据集)上对EfficientNet-B0模型进行预训练和微调来评估基于InST的合成数据集用于下游细胞计数的有效性。使用我们的InST生成图像训练得到的模型相比单纯使用硬编码合成数据训练的模型,平均绝对误差(Mean Absolute Error, MAE)最多可减少37%,与Cell200-s上训练的模型相比则能降低52%(MAE从53.70降至25.95)。值得注意的是,我们的方法甚至超过了单纯基于真实数据训练得到的结果(25.95 vs. 27.74 MAE)。 通过将InST合成的数据与轻量级领域适应技术相结合(如DACS结合CutMix),还可以进一步提高性能。这些发现表明,基于InST的风格转换最有效地减少了合成和真实显微数据之间的域差距。我们的方法为在减少人工标注工作的同时提升细胞计数性能提供了一条可扩展的途径。源代码和资源可在以下网址获取:[this https URL] (请将 [ ] 替换为实际链接地址)。
https://arxiv.org/abs/2512.11763
Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at this https URL.
在临床医学中,治疗决策制定是一个高风险领域,在此过程中人工智能(AI)指导与患者特征、疾病进程和药物之间的复杂互动相互作用。诸如药物推荐、治疗规划以及副作用预测等任务需要基于可靠生物医学知识的稳健多步骤推理能力。代理型AI方法,如TxAgent通过迭代检索增强生成(RAG)来应对这些挑战。TxAgent使用了一个经过微调的Llama-3.1-8B模型,该模型能够动态地生成和执行对统一生物医学工具套件(ToolUniverse)的功能调用,整合了FDA药物API、OpenTargets以及Monarch资源,以确保获取最新的治疗信息。 与通用型RAG系统不同的是,在医疗应用中需要严格的保障措施,这使得推理过程的准确性和工具调用序列都变得至关重要。这些因素促使评估协议将标记级别的推理和工具使用行为视为显式的监督信号。这项工作阐述了我们在CURE-Bench NeurIPS 2025挑战赛中的见解,该竞赛通过评估正确性、工具利用以及推理质量等指标来衡量治疗推理系统的性能。我们分析了函数(工具)调用的检索质量如何影响整体模型表现,并展示了改进后的工具检索策略所带来的性能提升。 我们的研究荣获开放科学卓越奖。有关此工作的完整信息可以在[此处](请将"this https URL"替换为实际链接地址)找到。
https://arxiv.org/abs/2512.11682
Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbf{\Sigma}_{obs} = \mathbf{\Sigma}_{HR} + \mathbf{\Sigma}_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$--10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {this https URL}.
从稀疏或退化的2D图像中恢复高保真度的3D图像是医学成像中的一个基本挑战,其应用范围广泛,涵盖了从3D超声重建到MRI超分辨率等各个领域。在胎儿MRI的情境下,从运动损坏的低分辨率2D采集数据中进行高分辨率3D脑部重建是准确诊断神经发育状况的前提条件。尽管基于隐式神经表示(INRs)的方法最近在自监督切片到体积重建(SVR)方面取得了最先进的性能,但它们面临一个关键的计算瓶颈:精确建模图像获取物理特性需要昂贵的随机蒙特卡洛采样来近似点扩散函数(PSF)。在这项工作中,我们提出了一种从基于神经网络的隐式表示转向基于高斯的显式表示的方法。通过将HR 3D图体积参数化为各向异性高斯基本单元的集合,我们可以利用高斯卷积闭合性质,并推导出前向模型的“封闭形式解析解”。这种表述将之前难以处理的获取积分简化为精确的协方差加法($\mathbf{\Sigma}_{obs} = \mathbf{\Sigma}_{HR} + \mathbf{\Sigma}_{PSF}$),从而有效地绕过了计算密集型随机采样的需求,同时确保了准确的梯度传播。我们证明我们的方法在重建质量上与最先进的自监督SVR框架相匹配,并且在新生儿和胎儿数据集上的速度提高了5到10倍。通常情况下,在不到30秒的时间内就能达到收敛,这使得我们的框架朝着将实时胎儿3D MRI技术应用于临床实践迈进了一步。 代码将在[此处](https://this-url-will-be-provided)公开发布。
https://arxiv.org/abs/2512.11624