Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
强化学习(RL)已成为大型语言模型(LLMs)训练后处理的核心范式,尤其是在复杂的推理任务中。然而,它常常会遇到探索崩溃的问题:策略过早地集中在少数占主导地位的推理模式上,在提高首次通过率(pass@1)的同时限制了多步决策过程中的多样性以及在多次尝试中的成功率(pass@k)。我们主张这一问题源于对局部令牌行为的正则化,而非解决方案集多样性的优化。 为了解决这个问题,我们提出了一种基于独特性感知的强化学习方法,这是一种多步级别的目标设定,它明确奖励展现出罕见高层次策略的正确解决方案。我们的方法使用一个基于LLM的评判系统来根据问题的高层次解决方案策略对同一问题的不同尝试进行聚类,并忽略表面差异,然后根据聚类大小逆向调整策略收益。因此,正确的但新颖的战略可以获得比冗余战略更高的回报。 在数学、物理和医学推理基准测试中,我们的方法一致地提高了大采样预算下的pass@k值,并且在不牺牲首次通过率(pass@1)的情况下增加了pass@k曲线下的面积(AUC@$K$),同时保持了探索性并揭示了更多种类的解决方案策略。
https://arxiv.org/abs/2601.08763
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.
链式思维(Chain-of-Thought,CoT)推理已被证明可以有效提升大型语言模型的性能,通过鼓励逐步、中间推理的方式来实现。近期进展已将这一范式扩展到多模态大型语言模型(MLLMs)。在医疗领域中,诊断决策依赖于细微的视觉线索和顺序推理,链式思维与临床思维方式自然契合。然而,目前用于医学图像理解的基准测试通常仅关注最终答案,而忽视了推理路径。缺乏透明度的过程难以提供可靠的判断依据,使得医生难以利用其进行辅助诊断。 为解决这一问题,我们引入了一个新的M3CoTBench基准测试,专门设计用于评估链式思维在医学图像理解中的正确性、效率、影响和一致性。该基准包括以下特点: 1. 一个涵盖24种检查类型的多样性和多难度级别的数据集。 2. 包含不同难度等级的13个任务。 3. 针对临床推理量身定制的一系列链式思维特定评估指标(正确性、效率、影响和一致性)。 4. 多个MLLM性能分析。 M3CoTBench系统地评估了各种医学影像任务中的链式推理,揭示了当前多模态大型语言模型在生成可靠且临床解释性强的推理方面存在的局限,并致力于推动透明、可信及诊断准确的人工智能系统的开发。项目页面链接:[此URL](https://project-page-url.com)(请将实际项目页URL插入此处)。
https://arxiv.org/abs/2601.08758
Thoracic aortic dissection and aneurysms are the most lethal diseases of the aorta. The major hindrance to treatment lies in the accurate analysis of the medical images. More particularly, aortic segmentation of the 3D image is often tedious and difficult. Deep-learning-based segmentation models are an ideal solution, but their inability to deliver usable outputs in difficult cases and their computational cost cause their clinical adoption to stay limited. This study presents an innovative approach for efficient aortic segmentation using targeted region of interest (ROI) detection. In contrast to classical detection models, we propose a simple and efficient detection model that can be widely applied to detect a single ROI. Our detection model is trained as a multi-task model, using an encoder-decoder architecture for segmentation and a fully connected network attached to the bottleneck for detection. We compare the performance of a one-step segmentation model applied to a complete image, nnU-Net and our cascade model composed of a detection and a segmentation step. We achieve a mean Dice similarity coefficient of 0.944 with over 0.9 for all cases using a third of the computing power. This simple solution achieves state-of-the-art performance while being compact and robust, making it an ideal solution for clinical applications.
胸主动脉夹层和动脉瘤是主动脉疾病中最致命的两种病症。治疗这些疾病的最大障碍在于对医学影像进行准确分析,尤其是三维图像中的主动脉分割往往耗时且难度大。基于深度学习的分割模型是一种理想的解决方案,但由于它们在处理复杂情况时输出效果不佳以及高昂的计算成本,其临床应用仍然有限。 本研究提出了一种使用目标区域(ROI)检测实现高效主动脉分割的创新方法。与传统的检测模型不同,我们设计了一个简单高效的检测模型,可以广泛应用于单一ROI的识别。我们的检测模型采用了多任务学习架构,在编码器-解码器结构用于分割的同时,还利用瓶颈处连接的全连接网络进行检测。 我们在一个一步式整体图像分割模型(nnU-Net)和我们提出的级联模型之间进行了比较,该级联模型包括了检测和分割两个步骤。通过使用三分之一的计算资源,我们的方法达到了平均Dice相似系数为0.944的良好效果,在所有情况下均超过了0.9。 这一简单的方法在性能上达到了当前的最佳水平,并且由于其紧凑性和鲁棒性而成为临床应用的理想解决方案。
https://arxiv.org/abs/2601.08683
Echocardiographic diagnosis is vital for cardiac screening yet remains challenging. Existing echocardiography foundation models do not effectively capture the relationships between quantitative measurements and clinical manifestations, whereas medical reasoning multimodal large language models (MLLMs) require costly construction of detailed reasoning paths and remain ineffective at directly incorporating such echocardiographic priors into their reasoning. To address these limitations, we propose a novel approach comprising Cardiac Reasoning Template (CRT) and CardiacMind to enhance MLLM's echocardiographic reasoning by introducing cardiologist-like mindset. Specifically, CRT provides stepwise canonical diagnostic procedures for complex cardiac diseases to streamline reasoning path construction without the need for costly case-by-case verification. To incentivize reasoning MLLM under CRT, we develop CardiacMind, a new reinforcement learning scheme with three novel rewards: Procedural Quantity Reward (PQtR), Procedural Quality Reward (PQlR), and Echocardiographic Semantic Reward (ESR). PQtR promotes detailed reasoning; PQlR promotes integration of evidence across views and modalities, while ESR grounds stepwise descriptions in visual content. Our methods show a 48% improvement in multiview echocardiographic diagnosis for 15 complex cardiac diseases and a 5% improvement on CardiacNet-PAH over prior methods. The user study on our method's reasoning outputs shows 93.33% clinician agreement with cardiologist-like reasoning logic. Our code will be available.
心超诊断对于心脏筛查至关重要,但仍然具有挑战性。现有的心超基础模型未能有效捕捉定量测量与临床表现之间的关系,而医疗推理多模态大型语言模型(MLLMs)则需要构建详细的推理路径,并且仍无法直接将这些心超先验知识融入其推理中。为解决这些问题,我们提出了一种新方法,包括心脏推理模板(CRT)和CardiacMind,旨在通过引入类似心脏病专家的思维方式来增强MLLM的心超推理能力。具体来说,CRT提供了复杂心脏疾病的分步标准诊断程序,简化了推理路径构建过程,并且无需对每个案例进行成本高昂的验证。为了在CRT框架下激励MLLM进行推理,我们开发了一种新的强化学习方案CardiacMind,该方案有三种新颖的奖励机制:流程量奖赏(PQtR)、流程质奖赏(PQlR)和心超语义奖赏(ESR)。PQtR促进详细的推理过程;PQlR促进了跨视图和模式之间证据整合;而ESR则将分步描述与视觉内容联系起来。我们的方法在15种复杂心脏疾病的多视图心超诊断中显示出48%的改进,并且在CardiacNet-PAH数据集上比先前的方法提高了5%的表现。用户研究显示,我们方法推理输出的心脏病专家逻辑获得93.33%的临床医生认同率。我们的代码将公开发布。
https://arxiv.org/abs/2601.08440
Accurate 3D medical image segmentation is vital for diagnosis and treatment planning, but state-of-the-art models are often too large for clinics with limited computing resources. Lightweight architectures typically suffer significant performance loss. To address these deployment and speed constraints, we propose Region- and Context-aware Knowledge Distillation (ReCo-KD), a training-only framework that transfers both fine-grained anatomical detail and long-range contextual information from a high-capacity teacher to a compact student network. The framework integrates Multi-Scale Structure-Aware Region Distillation (MS-SARD), which applies class-aware masks and scale-normalized weighting to emphasize small but clinically important regions, and Multi-Scale Context Alignment (MS-CA), which aligns teacher-student affinity patterns across feature levels. Implemented on nnU-Net in a backbone-agnostic manner, ReCo-KD requires no custom student design and is easily adapted to other architectures. Experiments on multiple public 3D medical segmentation datasets and a challenging aggregated dataset show that the distilled lightweight model attains accuracy close to the teacher while markedly reducing parameters and inference latency, underscoring its practicality for clinical deployment.
准确的三维医学图像分割对于诊断和治疗规划至关重要,但最先进的模型通常过于庞大,不适合计算资源有限的诊所使用。轻量级架构在性能上往往会有显著下降。为了解决这些部署和速度限制问题,我们提出了一个仅需训练过程中的框架——区域与上下文感知知识蒸馏(Region- and Context-aware Knowledge Distillation, ReCo-KD)。该框架能够将高容量教师模型的精细解剖细节以及长距离上下文信息转移到紧凑的学生网络中。 ReCo-KD 框架集成了多尺度结构感知区域蒸馏(Multi-Scale Structure-Aware Region Distillation,MS-SARD)和多尺度上下文对齐(Multi-Scale Context Alignment,MS-CA)。其中,MS-SARD 通过类别感知的掩码以及按比例归一化的权重来突出显示小但临床上重要的区域;而 MS-CA 则在特征层级上对学生网络与教师网络的亲和模式进行对齐。 ReCo-KD 框架以一种不受骨干网限制的方式实现在 nnU-Net 中,无需为学生设计定制架构,并且可以轻松适应其他架构。实验结果显示,在多个公开的三维医学分割数据集以及一个具有挑战性的聚合数据集中,蒸馏后的轻量级模型能够获得接近教师模型的准确性,同时显著减少参数和推理延迟,这证明了其在临床部署中的实用性。
https://arxiv.org/abs/2601.08301
While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
尽管增强推理能力的大型语言模型在英语医学任务上表现出色,但多语种差距仍然存在:即这些模型在当地语言中的推理能力明显较弱,这限制了全球范围内公平部署医疗应用的可能性。为解决这一问题,我们引入了一种新的框架——Med-CoReasoner(医学协同推理器),这是一种基于语言的协同推理框架。它能够激发英语和当地语言并行的推理过程,并将这些推理抽象成结构化的概念。通过概念级别的对齐和检索技术,该框架还能将本地临床知识整合进英文逻辑架构中。这一设计结合了英语推理的结构性稳健性和地方语言中扎根于实践的专业知识。 为了评估多语种医学推理能力在多项选择题之外的表现,我们构建了一个名为MultiMed-X的新基准测试集,它涵盖了七种不同语言,并包括专家标注的长格式问答和自然语言推断任务,每种语言包含350个实例。跨三个基准测试的实验显示,与现有方法相比,Med-CoReasoner平均提高了5%的多语种推理性能,尤其在资源匮乏的语言环境中表现更为显著。此外,模型蒸馏及专家评估分析进一步证实了Med-CoReasoner能够生成临床上有意义且文化上相关的推理痕迹。 通过这种方式,Med-CoReasoner不仅提升了跨语言医学任务中的推理精度和效率,还确保了其提供的解决方案具有更高的实用性和文化适应性,在全球范围内推动更加公平的医疗技术部署。
https://arxiv.org/abs/2601.08267
Diabetic retinopathy (DR), affecting millions globally with projections indicating a significant rise, poses a severe blindness risk and strains healthcare systems. Diagnostic complexity arises from visual symptom overlap with conditions like age-related macular degeneration and hypertensive retinopathy, exacerbated by high misdiagnosis rates in underserved regions. This study introduces TIMM-ProRS, a novel deep learning framework integrating Vision Transformer (ViT), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) with multi-modal fusion. TIMM-ProRS uniquely leverages both retinal images and temporal biomarkers (HbA1c, retinal thickness) to capture multi-modal and temporal dynamics. Evaluated comprehensively across diverse datasets including APTOS 2019 (trained), Messidor-2, RFMiD, EyePACS, and Messidor-1 (validated), the model achieves 97.8\% accuracy and an F1-score of 0.96, demonstrating state-of-the-art performance and outperforming existing methods like RSG-Net and DeepDR. This approach enables early, precise, and interpretable diagnosis, supporting scalable telemedical management and enhancing global eye health sustainability.
糖尿病视网膜病变(DR)是一种全球影响数百万人的疾病,预计病例数量还将显著增加。它会导致严重的失明风险,并对医疗系统造成压力。诊断复杂性源于视觉症状与年龄相关黄斑变性和高血压性视网膜病等状况之间的重叠现象,在医疗资源不足地区误诊率也较高。 本研究介绍了TIMM-ProRS,这是一种结合了Vision Transformer(ViT)、卷积神经网络(CNN)和图神经网络(GNN),并实现了多模态融合的新型深度学习框架。TIMM-ProRS的独特之处在于它同时利用视网膜图像和时间生物标志物(如糖化血红蛋白HbA1c和视网膜厚度)来捕捉多模态和时间动态变化。 该模型在APTOS 2019、Messidor-2、RFMiD、EyePACS及Messidor-1等多样化的数据集上进行了全面评估,取得了高达97.8%的准确率和F1评分为0.96的成绩,显示出业界领先的表现,并优于现有的方法如RSG-Net和DeepDR。 这一方法能够实现早期、精确且可解释性的诊断,支持远程医疗的大规模管理并增强全球眼健康可持续性。
https://arxiv.org/abs/2601.08240
Aggregating multi-site brain MRI data can enhance deep learning model training, but also introduces non-biological heterogeneity caused by site-specific variations (e.g., differences in scanner vendors, acquisition parameters, and imaging protocols) that can undermine generalizability. Recent retrospective MRI harmonization seeks to reduce such site effects by standardizing image style (e.g., intensity, contrast, noise patterns) while preserving anatomical content. However, existing methods often rely on limited paired traveling-subject data or fail to effectively disentangle style from anatomy. Furthermore, most current approaches address only single-sequence harmonization, restricting their use in real-world settings where multi-sequence MRI is routinely acquired. To this end, we introduce MMH, a unified framework for multi-site multi-sequence brain MRI harmonization that leverages biomedical semantic priors for sequence-aware style alignment. MMH operates in two stages: (1) a diffusion-based global harmonizer that maps MR images to a sequence-specific unified domain using style-agnostic gradient conditioning, and (2) a target-specific fine-tuner that adapts globally aligned images to desired target domains. A tri-planar attention BiomedCLIP encoder aggregates multi-view embeddings to characterize volumetric style information, allowing explicit disentanglement of image styles from anatomy without requiring paired data. Evaluations on 4,163 T1- and T2-weighted MRIs demonstrate MMH's superior harmonization over state-of-the-art methods in image feature clustering, voxel-level comparison, tissue segmentation, and downstream age and site classification.
跨多个站点的脑部MRI数据聚合可以增强深度学习模型训练,但也会引入由于特定地点变化(如不同制造商、采集参数和成像协议差异)导致的非生物异质性,从而削弱了模型的泛化能力。最近回顾性的MRI校准试图通过标准化图像风格(例如强度、对比度、噪声模式)来减少此类站点效应,同时保留解剖内容。然而,现有方法往往依赖于有限的一对一旅行者数据或无法有效区分风格与解剖结构。此外,大多数当前的方法仅处理单序列的校准,限制了它们在现实场景中的应用,因为在这些场景中通常会获取多序列MRI图像。 为此,我们引入了一个统一框架MMH,用于跨多个站点和多序列脑部MRI图像的校准,该框架利用生物医学语义先验进行序列感知风格对齐。MMH分为两个阶段操作:(1)一个基于扩散的整体校准器,使用无风格特征导向梯度条件将MR图像映射到特定于序列的统一领域;(2)针对目标领域的特化微调器,它将全局对齐的图像调整为所需的目标领域。三平面注意力BiomedCLIP编码器通过聚合多视图嵌入来表征体积样式信息,从而在无需配对数据的情况下明确区分图像风格与解剖结构。 基于4,163张T1和T2加权MRI图像进行评估表明,MMH在图像特征聚类、体素级比较、组织分割以及下游年龄和站点分类方面,均优于现有最佳方法。
https://arxiv.org/abs/2601.08193
Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: this https URL
医学图像分析越来越依赖于大型视觉语言模型(VLM),然而,大多数系统仍然是一次性黑盒操作,对推理、安全和空间定位的控制有限。我们提出了R^4框架,这是一个代理框架,将医学成像工作流程分解为四个协调的代理:一个路由器,根据从图像、患者历史记录和元数据中提取的任务和专业信息来配置提示;一个检索器,使用范例记忆和pass@k抽样同时生成自由文本报告和边界框;一个反思者,批评每对草案-边框的关键临床错误模式(否定、侧向性、缺乏支持的主张、矛盾、遗漏发现以及定位错误);以及一个修复者,在有针对性的约束下迭代修订叙述性和空间输出的同时为未来案例策划高质量范例。在胸部X光片分析中使用多个现代VLM后端,并通过报告生成和弱监督检测进行评估,R^4 在不进行任何基于梯度的微调的情况下,相对于强大的单一VLM基线,显著提高了LLM作为裁判(LLM-as-a-Judge)评分约1.7至2.5分,并将mAP50绝对提高了2.5至3.5个点。这些结果表明,代理路由、反思和修复可以将强大但脆弱的VLM转变为更可靠且更好地定位的临床图像解释工具。我们的代码可在以下链接找到:[此URL]
https://arxiv.org/abs/2601.08192
Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.
医学对比视觉-语言预训练(VLP)在提升下游任务表现方面展现了巨大潜力。传统方法通常采用对比学习,将成对的图像报告样本视为正例,而未配对的则被视为负例处理。然而,在医疗数据集中,来自不同患者的图像或报告之间可能存在大量相似性。僵化地将所有未配对样本都视为负例可能会破坏潜在的语义结构,并影响学到表示的质量。 在本文中,我们提出了一种多级对齐框架——具有语义感知实例和稀疏标记对齐的表示学习(SISTA),通过利用医学图像与放射学报告之间在两个层级上的语义对应关系:即图-文层面以及图像块-词项层面。具体来说,我们在传统的对比学习中引入了跨报告相似性以消除假负例,并提出了一种有效方法来对齐相关联的图像块和词项。 实验结果表明,所提出的框架在三个下游任务(图像分类、图像分割及目标检测)上显著提升了不同数据集之间的迁移性能。值得注意的是,即使面对有限标注数据的情况,我们的框架也能实现细粒度任务上的重大改进。代码与预训练模型将对外公开发布。
https://arxiv.org/abs/2601.08165
The development of robust artificial intelligence models for histopathology diagnosis is severely constrained by the scarcity of expert-annotated lesion data, particularly for rare pathologies and underrepresented disease subtypes. While data augmentation offers a potential solution, existing methods fail to generate sufficiently realistic lesion morphologies that preserve the complex spatial relationships and cellular architectures characteristic of histopathological tissues. Here we present PathoGen, a diffusion-based generative model that enables controllable, high-fidelity inpainting of lesions into benign histopathology images. Unlike conventional augmentation techniques, PathoGen leverages the iterative refinement process of diffusion models to synthesize lesions with natural tissue boundaries, preserved cellular structures, and authentic staining characteristics. We validate PathoGen across four diverse datasets representing distinct diagnostic challenges: kidney, skin, breast, and prostate pathology. Quantitative assessment confirms that PathoGen outperforms state-of-the-art generative baselines, including conditional GAN and Stable Diffusion, in image fidelity and distributional similarity. Crucially, we show that augmenting training sets with PathoGen-synthesized lesions enhances downstream segmentation performance compared to traditional geometric augmentations, particularly in data-scarce regimes. Besides, by simultaneously generating realistic morphology and pixel-level ground truth, PathoGen effectively overcomes the manual annotation bottleneck. This approach offers a scalable pathway for developing generalizable medical AI systems despite limited expert-labeled data.
在组织病理学诊断中,开发稳健的人工智能模型受到专家注释的病变数据稀缺性的严重限制,尤其是在罕见疾病和代表性不足的亚型方面。虽然数据增强提供了一种潜在解决方案,但现有方法无法生成足够逼真的病变形态,这些形态能够保持组织病理标本特有的复杂空间关系和细胞结构。在此,我们介绍了PathoGen,这是一种基于扩散的生成模型,可以对良性组织病理学图像中的病灶进行可控且高保真度的修复。与传统的增强技术不同,PathoGen 利用扩散模型的迭代细化过程来合成具有自然组织边界、保持细胞结构和真实染色特征的病变。 我们在四个代表不同的诊断挑战的数据集上验证了 PathoGen:肾脏病理学、皮肤病理学、乳腺病理学和前列腺病理学。定量评估确认 PathoGen 在图像保真度和分布相似性方面优于最新的生成基线,包括条件 GAN 和 Stable Diffusion。关键的是,我们展示了使用 PathoGen 合成的病变数据增强训练集可以提高下游分割性能,特别是在数据稀缺的情况下,这比传统的几何增强方法更有效。 此外,通过同时生成逼真的形态和像素级别的真实标签,PathoGen 有效地克服了手动注释瓶颈问题。这种方法提供了一条可扩展的道路,即使在专家标记的数据有限的情况下也能开发出具有普适性的医学 AI 系统。
https://arxiv.org/abs/2601.08127
Deep learning-based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few-shot scenarios due to the scarcity of annotated training data. Recently, self-supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few-shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO-AugSeg, a novel framework that leverages DINOv3 features to address the few-shot medical image segmentation challenge. Specifically, we introduce WT-Aug, a wavelet-based feature-level augmentation module that enriches the diversity of DINOv3-extracted features by perturbing frequency components, and CG-Fuse, a contextual information-guided fusion module that exploits cross-attention to integrate semantic-rich low-resolution features with spatially detailed high-resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO-AugSeg consistently outperforms existing methods under limited-sample conditions. The results highlight the effectiveness of incorporating wavelet-domain augmentation and contextual fusion for robust feature representation, suggesting DINO-AugSeg as a promising direction for advancing few-shot medical image segmentation. Code and data will be made available on this https URL.
基于深度学习的自动医学图像分割在临床诊断和治疗计划中扮演着关键角色,但在标注训练数据稀缺的情况下进行的少样本场景下仍面临挑战。最近,像DINOv3这样的自监督基础模型,在大型自然图像数据集上训练后,显示出强大的密集特征提取能力,有助于解决少样本学习问题。然而,由于领域差异,直接将其应用于医学影像受限。 为此,我们提出了一种名为DINO-AugSeg的新框架,该框架利用DINOv3特征来应对少样本医疗影像分割的挑战。具体来说,我们引入了WT-Aug模块,这是一个基于小波变换的增强模块,通过扰动频率成分丰富DINOv3提取到的特征多样性;同时设计了一个CG-Fuse模块,这是一种指导上下文信息融合的模块,利用跨注意力机制将语义丰富的低分辨率特征与空间细节丰富的高分辨率特征相整合。 我们在包括MRI、CT、超声波、内窥镜和皮肤镜在内的六项公开基准测试中进行了广泛的实验。结果显示,在样本受限条件下,DINO-AugSeg持续优于现有的方法。这些结果强调了结合小波领域增强和上下文融合对于稳健的特征表示的有效性,并表明DINO-AugSeg为推进少样本医学影像分割提供了有前景的方向。 相关代码与数据将在提供的网址上公开发布:[https://this-url.com](https://this-url.com)(注意,这里使用的URL是示例性质的,请替换为您实际提供的链接)。
https://arxiv.org/abs/2601.08078
Scientific image manipulation in biomedical publications poses a growing threat to research integrity and reproducibility. Unlike natural image forensics, biomedical forgery detection is uniquely challenging due to domain-specific artifacts, complex textures, and unstructured figure layouts. We present the first vision-language guided framework for both generating and detecting biomedical image forgeries. By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations, including duplication, splicing, and region removal, across diverse biomedical modalities. We introduce Rescind, a large-scale benchmark featuring fine-grained annotations and modality-specific splits, and propose Integscan, a structured state space modeling framework that integrates attention-enhanced visual encoding with prompt-conditioned semantic alignment for precise forgery localization. To ensure semantic fidelity, we incorporate a vision-language model based verification loop that filters generated forgeries based on consistency with intended prompts. Extensive experiments on Rescind and existing benchmarks demonstrate that Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.
在生物医学出版物中进行科学图像篡改对研究诚信和可重复性构成了日益增长的威胁。与自然图像取证不同,由于领域特定的伪像、复杂的纹理以及无结构的图表布局,生物医学伪造检测面临独特的挑战。我们提出了首个用于生成和检测生物医学图像伪造品的视觉-语言引导框架。通过结合基于扩散的合成技术和视觉-语言提示技术,我们的方法能够实现真实且语义可控的操作,包括复制、拼接及区域移除等操作,涵盖各种生物医学模式。 为了评估这些方法的效果,我们介绍了Rescind这一大规模基准测试集,其包含细粒度注释和特定模态的分隔。我们还提出了一种结构化的状态空间建模框架——Integscan,它整合了增强注意力的视觉编码与基于提示条件下的语义对齐,以实现精确的伪造品定位。为了确保生成的图像在语义上的一致性,我们在该方法中加入了一个基于视觉-语言模型验证循环,根据意图提示过滤合成的伪造品。 通过Rescind和其他现有基准测试集进行广泛的实验表明,Integscan在检测和定位伪造图像方面的性能达到了最先进的水平,为自动化科学诚信分析奠定了坚实的基础。
https://arxiv.org/abs/2601.08040
We develop a new statistical ideal observer model that performs holistic visual search (or gist) processing in part by placing thresholds on minimum extractable image features. In this model, the ideal observer reduces the number of free parameters thereby shrinking down the system. The applications of this novel framework is in medical image perception (for optimizing imaging systems and algorithms), computer vision, benchmarking performance and enabling feature selection/evaluations. Other applications are in target detection and recognition in defense/security as well as evaluating sensors and detectors.
我们开发了一种新的统计理想观察者模型,该模型通过在可提取的图像特征上设置阈值来进行整体视觉搜索(或感知处理)。在这个模型中,理想的观察者减少了自由参数的数量,从而简化了系统。这个新型框架的应用包括医学影像感知(用于优化成像系统和算法)、计算机视觉、性能基准测试以及功能选择/评估。其他应用领域还包括国防/安全中的目标检测与识别以及传感器和探测器的评估。
https://arxiv.org/abs/2601.07982
Recent advances in large language models have enabled their application to a range of healthcare tasks. However, aligning LLMs with the nuanced demands of medical ethics, especially under complex real world scenarios, remains underexplored. In this work, we present MedES, a dynamic, scenario-centric benchmark specifically constructed from 260 authoritative Chinese medical, ethical, and legal sources to reflect the challenges in clinical decision-making. To facilitate model alignment, we introduce a guardian-in-the-loop framework that leverages a dedicated automated evaluator (trained on expert-labeled data and achieving over 97% accuracy within our domain) to generate targeted prompts and provide structured ethical feedback. Using this pipeline, we align a 7B-parameter LLM through supervised fine-tuning and domain-specific preference optimization. Experimental results, conducted entirely within the Chinese medical ethics context, demonstrate that our aligned model outperforms notably larger baselines on core ethical tasks, with observed improvements in both quality and composite evaluation metrics. Our work offers a practical and adaptable framework for aligning LLMs with medical ethics in the Chinese healthcare domain, and suggests that similar alignment pipelines may be instantiated in other legal and cultural environments through modular replacement of the underlying normative corpus.
最近在大型语言模型领域的进步使得这些模型能够应用于一系列医疗保健任务。然而,使LLM(大语言模型)与医学伦理的复杂需求相适应,尤其是在复杂的现实世界场景中,仍然是一个尚未充分探索的领域。在此项工作中,我们提出了MedES,这是一个专为反映临床决策挑战而设计的动态、以情景为中心的基准测试集,该测试集基于260个权威的中文医疗、伦理和法律来源构建而成。 为了促进模型对齐,我们引入了一个“监护人在环”框架,利用一个专门的自动化评估器(在专家标记的数据上训练,并在我们的领域内实现了超过97%的准确率)来生成有针对性的问题提示并提供结构化的伦理反馈。通过这个管道,我们在监督微调和特定领域的偏好优化的帮助下,对一个拥有70亿参数的LLM进行了对齐。 实验结果完全基于中国的医疗伦理环境进行,结果显示,在核心伦理任务中,我们的模型相对于较大的基线模型表现显著更好,并且在质量和综合评价指标方面均有所提高。我们这项工作提供了一个实用且可适应的框架,用于在中国医疗领域将大型语言模型与医学伦理对齐,并建议通过更换底层规范语料库,可以在其他法律和文化环境中实现类似的对齐管道。
https://arxiv.org/abs/2601.07954
The rapid evolution of Large Language Models (LLMs) has shifted focus from general-purpose capabilities to domain-specific expertise. However, adapting LLMs to specialized fields such as medicine presents two challenge: (1) the "Stability-Plasticity Dilemma", where the model must acquire complex clinical knowledge without suffering from catastrophic forgetting of general world knowledge; and (2) "Task Interference", where disparate sub-tasks, such as medical diagnosis, report summarization, and drug-drug interaction prediction, compete for limited low-rank parameter space. In this paper, we propose Med-MoE-LoRA, a novel framework that integrates Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to enable efficient multi-task domain adaptation, especially for medical scenarios. Drawing inspiration from recent advances, our framework employs an asymmetric expert distribution where deeper layers are equipped with a higher density of LoRA experts to capture complex semantic abstractions. We further introduce a "Knowledge-Preservation Plugin", inspired by LoRA MoE, to isolate and protect general-purpose reasoning. By utilizing soft merging with adaptive routing and rank-wise decoupling, Med-MoE-LoRA achieves superior performance in medical benchmarks while reducing interference. Experimental results demonstrate that our approach consistently outperforms standard LoRA and conventional MoE architectures across multiple clinical NLP tasks while retaining the model's general cognitive capabilities.
大型语言模型(LLMs)的快速演化已经从强调通用能力转向了特定领域的专业知识。然而,将这些模型适应如医学这样的专门领域面临两大挑战:一是“稳定性-可塑性困境”,即模型在获取复杂临床知识的同时不应忘记一般的常识;二是“任务干扰”,不同子任务如医疗诊断、报告总结和药物相互作用预测竞争有限的低秩参数空间。本文提出了Med-MoE-LoRA框架,它将专家混合(MoE)与低秩适应(LoRA)相结合,以实现高效的多任务领域自适应,特别是在医学场景中。借鉴最近的进展,我们的框架采用了一种非对称的专家分布策略,在更深层配置了更多LoRA专家来捕捉复杂的语义抽象。此外,我们引入了一个“知识保护插件”,受LoRA MoE启发,用于隔离并保护通用推理功能。通过使用软合并、自适应路由和秩解耦技术,Med-MoE-LoRA在医学基准测试中表现出色,并减少了任务间的干扰。实验结果表明,我们的方法在多个临床自然语言处理(NLP)任务上始终优于标准的LoRA和传统的MoE架构,同时保持了模型的一般认知能力。
https://arxiv.org/abs/2601.07935
The teleoperation of robotic hands is limited by the high costs of depth cameras and sensor gloves, commonly used to estimate hand relative joint positions (XYZ). We present a novel, cost-effective approach using three webcams for triangulation-based tracking to approximate relative joint angles (theta) of human fingers. We also introduce a modified DexHand, a low-cost robotic hand from TheRobotStudio, to demonstrate THETA's real-time application. Data collection involved 40 distinct hand gestures using three 640x480p webcams arranged at 120-degree intervals, generating over 48,000 RGB images. Joint angles were manually determined by measuring midpoints of the MCP, PIP, and DIP finger joints. Captured RGB frames were processed using a DeepLabV3 segmentation model with a ResNet-50 backbone for multi-scale hand segmentation. The segmented images were then HSV-filtered and fed into THETA's architecture, consisting of a MobileNetV2-based CNN classifier optimized for hierarchical spatial feature extraction and a 9-channel input tensor encoding multi-perspective hand representations. The classification model maps segmented hand views into discrete joint angles, achieving 97.18% accuracy, 98.72% recall, F1 Score of 0.9274, and a precision of 0.8906. In real-time inference, THETA captures simultaneous frames, segments hand regions, filters them, and compiles a 9-channel tensor for classification. Joint-angle predictions are relayed via serial to an Arduino, enabling the DexHand to replicate hand movements. Future research will increase dataset diversity, integrate wrist tracking, and apply computer vision techniques such as OpenAI-Vision. THETA potentially ensures cost-effective, user-friendly teleoperation for medical, linguistic, and manufacturing applications.
机器人手的遥操作受到深度相机和传感器手套高昂成本的限制,这些设备通常用于估算相对关节位置(XYZ)。我们提出了一种新颖且成本效益高的方法,使用三台网络摄像头进行三角测量跟踪,以近似人类手指的相对关节角度(θ)。此外,还引入了经过修改的DexHand,这是一种来自TheRobotStudio的低成本机器人手,用以展示THETA在实时应用中的效果。数据收集涉及40种不同的手势,通过三个间隔120度排列的640x480p网络摄像头捕捉到超过48,000张RGB图像。关节角度由测量掌指(MCP)、近侧指间(PIP)和远侧指间(DIP)关节中点的手动确定得出。 收集的RGB帧使用带有ResNet-50骨干的DeepLabV3分割模型进行多尺度手部分割处理。随后,对分段图像应用HSV滤波,并将其输入到THETA架构中,该架构由基于MobileNetV2的CNN分类器和编码多种视角的手部表示的9通道张量组成,用于优化层级空间特征提取。分类模型将分割后的手部视图映射为离散关节角度,在准确性、召回率、F1分数以及精度方面分别达到了97.18%、98.72%、0.9274和0.8906。 在实时推断过程中,THETA同时捕捉图像帧,分割手部区域,并过滤这些区域以生成用于分类的9通道张量。关节角度预测通过串行通信发送到Arduino,使DexHand能够复制人类的手部动作。未来的研究将进一步丰富数据集多样性、整合手腕跟踪以及应用计算机视觉技术如OpenAI-Vision。 THETA有望确保成本效益高且用户友好的远程操作,在医疗、语言和制造应用中具有潜在价值。
https://arxiv.org/abs/2601.07768
Fully convolutional networks have become the backbone of modern medical imaging due to their ability to learn multi-scale representations and perform end-to-end inference. Yet their potential for slice-to-volume reconstruction (SVR), the task of jointly estimating 3D anatomy and slice poses from misaligned 2D acquisitions, remains underexplored. We introduce a fast convolutional framework that fuses multiple orthogonal 2D slice stacks to recover coherent 3D structure and refines slice alignment through lightweight model-based optimization. Applied to fetal brain MRI, our approach reconstructs high-quality 3D volumes in under 10s, with 1s slice registration and accuracy on par with state-of-the-art iterative SVR pipelines, offering more than speedup. The framework uses non-rigid displacement fields to represent transformations, generalizing to other SVR problems like fetal body and placental MRI. Additionally, the fast inference time paves the way for real-time, scanner-side volumetric feedback during MRI acquisition.
全卷积网络(Fully Convolutional Networks,FCNs)由于其能够学习多尺度表示并执行端到端推理的能力,已经成为现代医学影像领域的支柱技术。然而,它们在切片到体积重建(Slice-to-Volume Reconstruction, SVR)方面的潜力——即从对齐不准确的2D图像中同时估计3D解剖结构和切片位置——仍然未被充分探索。我们介绍了一个快速卷积框架,该框架融合了多个正交2D切片堆栈以恢复连贯的3D结构,并通过轻量级模型优化来改进切片对齐。当应用于胎儿大脑MRI时,我们的方法能够在不到10秒的时间内重建高质量的3D体积,在1秒钟内完成切片配准,其精度与最先进的迭代SVR管道相当,提供了比传统方法更快的速度。 该框架使用非刚性位移场来表示变换,可以推广到其他SVR问题,例如胎儿身体和胎盘MRI。此外,快速推理时间还为在MRI采集过程中提供实时、扫描仪端的体积反馈铺平了道路。
https://arxiv.org/abs/2601.07519
In this paper, we present a new dynamic collaborative network for semi-supervised 3D vessel segmentation, termed DiCo. Conventional mean teacher (MT) methods typically employ a static approach, where the roles of the teacher and student models are fixed. However, due to the complexity of 3D vessel data, the teacher model may not always outperform the student model, leading to cognitive biases that can limit performance. To address this issue, we propose a dynamic collaborative network that allows the two models to dynamically switch their teacher-student roles. Additionally, we introduce a multi-view integration module to capture various perspectives of the inputs, mirroring the way doctors conduct medical analysis. We also incorporate adversarial supervision to constrain the shape of the segmented vessels in unlabeled data. In this process, the 3D volume is projected into 2D views to mitigate the impact of label inconsistencies. Experiments demonstrate that our DiCo method sets new state-of-the-art performance on three 3D vessel segmentation benchmarks. The code repository address is this https URL
在这篇论文中,我们提出了一种新的动态协作网络用于半监督的三维血管分割,命名为DiCo。传统的方法通常使用静态的方式来实现均值教师(Mean Teacher, MT)策略,其中教师模型和学生模型的角色是固定的。然而,由于三维血管数据的复杂性,教师模型并不总是优于学生模型,这种认知偏差可能会限制性能的提升。为了解决这个问题,我们提出了一种动态协作网络,使两个模型能够根据情况动态切换他们的师生角色。 此外,我们引入了一个多视角集成模块来捕捉输入的各种视角,类似于医生进行医学分析的方式。我们也加入了对抗性监督以约束无标签数据中分割出的血管形状,在这个过程中将三维体积投影到二维视图上以便减少标签不一致的影响。实验表明,我们的DiCo方法在三个三维血管分割基准测试中达到了新的最先进的性能表现。代码仓库地址如下:[https URL] (请注意,此处的实际链接应该是有效的URL)
https://arxiv.org/abs/2601.07377
When swimming at low Reynolds numbers, inertial effects are negligible and reciprocal movements cannot induce net motion. Instead, symmetry breaking is necessary to achieve net propulsion. Directed swimming can be supported by magnetic fields, which simultaneously provide a versatile means of remote actuation. Thus, we analyze the motion of a straight microswimmer composed of three magnetizable beads connected by two elastic links. The swimming mechanism is based on oriented external magnetic fields that oscillate in magnitude. Through induced reversible hysteretic collapse of the two segments of the swimmer, the two pairs of beads jump into contact and separate nonreciprocally. Due to higher-order hydrodynamic interactions, net displacement results after each cycle. Different microswimmers can be tuned to different driving amplitudes and frequencies, allowing for simultaneous independent control by just one external magnetic field. The swimmer geometry and magnetic field shape are optimized for maximum swimming speed using an evolutionary optimization strategy. Thanks to the simple working principle, an experimental realization of such a microrobot seems feasible and may open new approaches for microinvasive medical interventions such as targeted drug delivery.
当在低雷诺数条件下游泳时,惯性效应可以忽略不计,且互为逆向的运动不能产生净位移。相反,需要打破对称性以实现推进效果。定向游泳可以通过磁场来支持,这同时提供了一种灵活的远程操控手段。因此,我们分析了一个由三个磁化珠子通过两个弹性链连接而成的直形微泳器的运动情况。其工作原理基于外部磁场的方向和振幅变化,通过感应产生的可逆迟滞收缩效应使游动器的两段交替塌陷。这样,成对的珠子会非互为逆向地接触并分离。由于高阶流体动力学相互作用,在每次循环后都会产生净位移。不同的微泳器可以调谐至不同驱动幅度和频率,从而仅通过一个外部磁场即可实现同时独立控制。利用进化优化策略,对游动器的几何结构及磁场形状进行优化以达到最大游泳速度。由于其工作原理简单,这种微型机器人的实验实现似乎是可行的,并可能为如靶向药物输送等微侵入性医疗干预开辟新途径。
https://arxiv.org/abs/2601.07370