Leukemia is 10th most frequently diagnosed cancer and one of the leading causes of cancer related deaths worldwide. Realistic analysis of Leukemia requires White Blook Cells (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real world applicability. To overcome dataset challenges, we present a large scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from several patients, through multiple microscopes, multi cameras, and multi magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain shift challenges, presented datasets could be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: this https URL
白血病是全球诊断频率最高的第十种癌症,也是导致癌症相关死亡的主要原因之一。对白血病进行现实分析需要定位、分类和评估白细胞(WBC)的形态特征。尽管深度学习在医学影像方面的进展显著,但白血病分析却缺乏大规模且多任务类型的多样化数据集;现有的小规模数据集也因领域多样性不足而限制了其实际应用。 为克服这些数据集挑战,我们推出了一项名为“大型白血病数据集”(Large Leukemia Dataset, LLD)的项目,并提出一种检测WBC及其属性的新方法。我们的贡献主要有三个方面: 1. 我们收集了一个大规模的数据集——LLD,该数据集通过外周血涂片(PBF)从多个患者处获取,在不同显微镜、相机和放大倍数下拍摄。为了增强诊断的解释性和医学专家的认可度,每个白血病细胞都在100x下标注了7个形态特征,涵盖了细胞大小到核形状等多个方面。 2. 我们提出了一种多任务模型,不仅能够检测WBCs,还能预测它们的属性,从而提供了一个可解释且具有临床意义的解决方案。 3. 我们提出了一个使用稀疏注释进行WBC检测及其属性分析的方法。这种方法减少了血液学家在标注过程中的负担,只需标记视野内的一个小区域即可完成任务。我们的方法使模型能够利用整个视野的信息,而不仅仅是已标注的区域,从而提高了学习效率和诊断准确性。 从增强诊断解释性到克服领域转移挑战,“所提出的”数据集可用于显微图像分析中许多具有挑战性的方面。“这些数据集、代码和演示可以在这里获取:[此链接](https://this-url.com/)(请将“this https URL”替换为实际的网址)。
https://arxiv.org/abs/2504.02602
In recent years, deep learning methods such as convolutional neural network (CNN) and transformers have made significant progress in CT multi-organ segmentation. However, CT multi-organ segmentation methods based on masked image modeling (MIM) are very limited. There are already methods using MAE for CT multi-organ segmentation task, we believe that the existing methods do not identify the most difficult areas to reconstruct. To this end, we propose a MIM self-training framework with hard patches mining masked autoencoders for CT multi-organ segmentation tasks (selfMedHPM). The method performs ViT self-pretraining on the training set of the target data and introduces an auxiliary loss predictor, which first predicts the patch loss and determines the location of the next mask. SelfMedHPM implementation is better than various competitive methods in abdominal CT multi-organ segmentation and body CT multi-organ segmentation. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for abdomen mult-organ segmentation and the SinoMed Whole Body (SMWB) dataset for body multi-organ segmentation tasks.
近年来,深度学习方法如卷积神经网络(CNN)和变压器在CT多器官分割中取得了显著进展。然而,基于掩码图像建模(MIM)的CT多器官分割方法非常有限。已经有一些使用MAE进行CT多器官分割的方法,但我们认为现有的方法未能识别最难重建的区域。为此,我们提出了一种用于CT多器官分割任务的带有难例挖掘自编码器的MIM自我训练框架(selfMedHPM)。该方法在目标数据集上对ViT进行了自我预训练,并引入了一个辅助损失预测器,首先预测补丁损失并确定下一个掩码的位置。我们的SelfMedHPM方法在腹部CT多器官分割和全身CT多器官分割任务中优于各种竞争方法。我们在Multi Atlas Labeling Beyond The Cranial Vault (BTCV) 数据集上验证了该方法在腹部多器官分割中的性能,并在SinoMed Whole Body (SMWB)数据集上验证了其在全身多器官分割任务中的效果。
https://arxiv.org/abs/2504.02524
Ultrasound is a widely accessible and cost-effective medical imaging tool commonly used for prenatal evaluation of the fetal brain. However, it has limitations, particularly in the third trimester, where the complexity of the fetal brain requires high image quality for extracting quantitative data. In contrast, magnetic resonance imaging (MRI) offers superior image quality and tissue differentiation but is less available, expensive, and requires time-consuming acquisition. Thus, transforming ultrasonic images into an MRI-mimicking display may be advantageous and allow better tissue anatomy presentation. To address this goal, we have examined the use of artificial intelligence, implementing a diffusion model renowned for generating high-quality images. The proposed method, termed "Dual Diffusion Imposed Correlation" (DDIC), leverages a diffusion-based translation methodology, assuming a shared latent space between ultrasound and MRI domains. Model training was obtained utilizing the "HC18" dataset for ultrasound and the "CRL fetal brain atlas" along with the "FeTA " datasets for MRI. The generated pseudo-MRI images provide notable improvements in visual discrimination of brain tissue, especially in the lateral ventricles and the Sylvian fissure, characterized by enhanced contrast clarity. Improvement was demonstrated in Mutual information, Peak signal-to-noise ratio, Fréchet Inception Distance, and Contrast-to-noise ratio. Findings from these evaluations indicate statistically significant superior performance of the DDIC compared to other translation methodologies. In addition, a Medical Opinion Test was obtained from 5 gynecologists. The results demonstrated display improvement in 81% of the tested images. In conclusion, the presented pseudo-MRI images hold the potential for streamlining diagnosis and enhancing clinical outcomes through improved representation.
超声波是一种广泛可用且成本效益高的医学成像工具,常用于孕期胎儿大脑的评估。然而,在妊娠晚期,由于胎儿大脑结构复杂,需要高质量图像来提取定量数据,这使得超声波有其局限性。相比之下,磁共振成像(MRI)提供更优的图像质量和组织对比度,但因其不可用、昂贵且采集耗时而受限。因此,将超声影像转化为类似MRI的显示可能具有优势,并能更好地呈现组织解剖结构。 为实现这一目标,我们研究了使用人工智能的方法,采用了一种以其生成高质量图像著称的扩散模型。所提出的方法称为“双重扩散施加相关性”(DDIC),利用基于扩散的翻译方法,假设超声和MRI之间存在共享潜在空间。通过使用“HC18”数据集进行超声波训练,并结合“CRL胎儿大脑图谱”和“FeTA”数据集进行MRI训练来获取模型训练。 生成的伪MRI图像在视觉上显著改善了脑组织的区分度,尤其是在侧脑室和 Sylvian 裂中,特征为对比清晰度提升。这些改进体现在互信息、峰值信噪比、Fréchet Inception Distance 和 对比噪声比上的提高。来自这些评估的研究结果表明,DDIC 方法相比其他翻译方法在统计上表现更优。 此外,我们还从五位妇科医生那里获取了医疗意见测试。结果显示,在81%的测试图像中显示有所改善。 综上所述,所呈现的伪MRI图像有望通过改进表示来简化诊断流程并提升临床效果。
https://arxiv.org/abs/2504.02408
The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at this https URL.
在医学领域中应用大规模语言模型(LLMs)已经引起了广泛关注,然而它们在麻醉学等更专业领域的推理能力仍然有待探索。本文系统地评估了大规模语言模型在麻醉学中的推理能力,并分析影响其性能的关键因素。为此,我们引入了一个跨语言基准测试AnesBench,旨在评估三个层次的麻醉相关推理:事实检索(System 1)、混合推理(System 1.x)和复杂决策制定(System 2)。通过广泛的实验,我们首先探讨了模型特性对推理性能的影响,包括模型规模、Chain of Thought (CoT) 长度以及语言迁移性。然后,我们进一步评估不同训练策略的有效性,利用我们精心整理的麻醉学相关数据集进行连续预训练(CPT)和监督微调(SFT)。此外,我们还研究了推理时技术对推理性能的影响,包括最佳N样本选择(Best-of-N sampling)和束搜索(beam search),并评估增强模型蒸馏(如DeepSeek-R1)的推理效果。我们将公开发布AnesBench以及我们的CPT和SFT训练数据集和评估代码,请参见此链接:[URL]。
https://arxiv.org/abs/2504.02404
The deployment of foundation models for medical imaging has demonstrated considerable success. However, their training overheads associated with downstream tasks remain substantial due to the size of the image encoders employed, and the inference complexity is also significantly high. Although lightweight variants have been obtained for these foundation models, their performance is constrained by their limited model capacity and suboptimal training strategies. In order to achieve an improved tradeoff between complexity and performance, we propose a new framework to improve the performance of low complexity models via knowledge distillation from multiple large medical foundation models (e.g., MedSAM, RAD-DINO, MedCLIP), each specializing in different vision tasks, with the goal to effectively bridge the performance gap for medical image segmentation tasks. The agglomerated model demonstrates superior generalization across 12 segmentation tasks, whereas specialized models require explicit training for each task. Our approach achieved an average performance gain of 2\% in Dice coefficient compared to simple distillation.
将基础模型应用于医学成像已经取得了相当的成功。然而,由于所使用的图像编码器的大小,这些模型在下游任务上的训练开销仍然非常高,并且推理复杂性也很高。虽然已经为这些基础模型获得了轻量级变体,但它们的表现受到模型容量有限和培训策略不理想的影响。为了实现复杂性和性能之间的更好权衡,我们提出了一种新框架,通过从多个大型医疗基础模型(例如MedSAM、RAD-DINO、MedCLIP)进行知识蒸馏来提高低复杂度模型的性能,每个模型专注于不同的视觉任务。我们的目标是有效地缩小医学图像分割任务中的表现差距。聚合后的模型在12个分割任务上表现出更好的泛化能力,而专用模型需要为每个单独的任务进行明确训练。与简单的蒸馏相比,我们方法在Dice系数上的平均性能提升了2%。
https://arxiv.org/abs/2504.02351
Image segmentation is critical for applications such as medical imaging, augmented reality, and video surveillance. However, segmentation models often lack robustness, making them vulnerable to adversarial perturbations from subtle image distortions. In this work, we propose SegRMT, a metamorphic testing approach that leverages genetic algorithms (GA) to optimize sequences of spatial and spectral transformations while preserving image fidelity via a predefined PSNR threshold. Using the Cityscapes dataset, our method generates adversarial examples that effectively challenge the DeepLabV3 segmentation model. Our experiments show that SegRMT reduces DeepLabV3's mean Intersection over Union (mIoU) to 6.4%, outperforming other adversarial baselines that decrease mIoU to between 8.5% and 21.7%. Furthermore, when used for adversarial training, SegRMT boosts model performance, achieving mIoU improvements up to 73% on dedicated adversarial datasets and increasing cross-adversarial mIoU to 53.8%, compared to only 2%-10% for other methods. These findings demonstrate that SegRMT not only simulates realistic image distortions but also enhances the robustness of segmentation models, making it a valuable tool for ensuring reliable performance in safety-critical applications.
图像分割对于医学成像、增强现实和视频监控等应用至关重要。然而,分割模型往往缺乏鲁棒性,容易受到细微图像失真引起的对抗性扰动的影响。在本文中,我们提出了一种名为SegRMT的变体测试方法,该方法利用遗传算法(GA)优化空间和光谱变换序列,并通过预定义的PSNR阈值保持图像保真度。使用Cityscapes数据集,我们的方法生成了能有效挑战DeepLabV3分割模型的对抗性样本。实验结果显示,SegRMT将DeepLabV3的平均交并比(mIoU)降至6.4%,优于其他使mIoU降低至8.5%到21.7%之间的对抗基线方法。 此外,在用于对抗训练时,SegRMT可提高模型性能,在专门针对对抗性的数据集上实现了高达73%的mIoU改进,并将跨对抗性mIoU提升至53.8%,而其他方法仅提升了2%-10%。这些发现表明,SegRMT不仅能够模拟真实的图像失真,还能增强分割模型的鲁棒性,在安全关键的应用中确保其可靠性能方面具有重要价值。
https://arxiv.org/abs/2504.02335
Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).
医学影像分析,尤其是X射线检查,通常需要在同一扫描中同时检测多种状况,这使得多标签分类对于实际临床应用至关重要。我们提出了一种新型注意力机制——Medical X-ray Attention(MXA)块,专门设计用于解决X光异常检测的独特挑战。MXA块通过集成一个专用模块来增强传统的Multi-Head Self Attention (MHSA),该模块能够高效地捕捉详细的局部信息和广泛的全局上下文。 据我们所知,这是首次提出针对胸部X射线诊断的特定任务注意力机制,并尝试使用Efficient Vision Transformer(EfficientViT)进行多标签分类。通过将MXA块嵌入到EfficientViT架构中并利用知识蒸馏技术,我们的模型在CheXpert数据集上取得了显著的性能提升,该数据集是多标签胸部X射线异常检测中最广泛使用的基准之一。 我们的方法达到了0.85的曲线下面积(AUC),相比我们基线模型0.66的AUC值提升了0.19,这相当于相对于随机猜测(AUC = 0.5)约233%的相对改善。
https://arxiv.org/abs/2504.02277
Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{this https URL}{this link}.
近期在通用医学人工智能领域的进展取得了显著成就,但现有的模型往往缺乏进行复杂医疗决策所需的推理能力。本文介绍了一种增强型多模态医学推理模型GMAI-VL-R1,通过强化学习(RL)来提升其推理能力。通过迭代训练,GMAI-VL-R1优化了决策过程,大幅提升了诊断准确性和临床支持效果。我们还开发了一种推理数据合成方法,利用拒绝采样生成逐步推理数据,进一步增强了模型的泛化能力。实验结果显示,在经过强化学习训练后,GMAI-VL-R1在医学影像诊断和视觉问答等任务中表现出色。虽然通过监督微调可以获得基本的记忆效果,但强化学习对于实现真正的泛化至关重要。我们的工作建立了新的评估基准,并为未来医疗推理模型的发展铺平了道路。代码、数据和模型将在[此处链接](this https URL)发布。
https://arxiv.org/abs/2504.01886
Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at this https URL
在皮肤疾病诊断中,人工智能(AI)技术取得了显著的进步,但一个主要问题是这些模型在不同亚群体中的表现经常存在偏差,特别是在肤色等敏感属性方面。为了解决这些问题,我们提出了一种基于生成式AI的新型框架,名为皮肤病学扩散变换器(DermDiT)。该框架利用通过视觉语言模型生成的文字提示以及多模态文本-图像学习来生成新的皮肤镜检图像。 我们采用大型视觉语言模型为每个皮肤镜检图像生成准确且合适的文字提示,这有助于生成合成图像以改善临床诊断中代表性不足的群体(如患者、疾病等)在高度不平衡数据集中的表现。我们的广泛实验表明,大型视觉语言模型提供了更加深入和细致的表现形式,从而使得DermDiT能够生成高质量的图像。 我们的代码可在以下链接获取:[https URL]
https://arxiv.org/abs/2504.01838
Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on this https URL.
准确的病变分割在医学图像分析和诊断中起着关键作用。传统依赖视觉特征的分割方法往往难以应对病变分布和大小固有的不确定性。为了解决这些问题,我们提出了STPNet(Scale-aware Text Prompt Network),这是一种利用视觉-语言模型来增强医疗图像分割的方法。我们的方法使用多尺度文本描述来指导病灶定位,并采用检索-分割联合学习方法来弥合视觉与语言模式之间的语义鸿沟。至关重要的是,STPNet在训练过程中从专门的医学文献库中检索相关文本信息,在推理阶段则不再需要输入文本,同时保留跨模态学习的好处。我们已在三个数据集上对STPNet进行了评估:COVID-Xray、COVID-CT和Kvasir-SEG。实验结果显示,我们的视觉语言方法优于当前最先进的分割方法,证明了在医学图像分析中整合文本语义知识的有效性。代码已公开发布在此 [URL](请将此部分中的[URL]替换为实际链接)。
https://arxiv.org/abs/2504.01561
Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at this https URL
监督深度学习在医学图像中的语义分割方面已经取得了识别解剖和病理结构的卓越成果。然而,这种方法通常需要大量的标注训练数据集,这限制了其在临床环境下的可扩展性。为了解决这一挑战,半监督学习作为一种利用有标签和无标签数据的方法得到了广泛应用。在这篇论文中,我们介绍了一种新的生物医学图像分割的半监督教师-学生框架,灵感来源于最近生成模型的成功应用。我们的方法使用去噪扩散概率模型(DDPM)通过逐步细化噪声输入来生成分割掩码,并根据相应的图像进行条件设置。首先在无监督的方式下训练教师模型,采用基于噪声污染图像重建的循环一致性约束,使其能够生成信息丰富的语义掩码。随后,将教师模型整合到与双生学生网络(twin-student network)协同训练的过程中。当有真实标签时,学生从这些标签中学习;否则,它从老师产生的伪标签中学习,而教师则持续提高其伪标记的能力。最后,为了进一步提升性能,我们引入了一种多轮次的伪标签生成策略,以迭代改进伪标记过程。 我们在多个生物医学成像基准上评估了我们的方法,涵盖了多种成像模式和分割任务。实验结果表明,在数据标注有限的情况下,我们的方法始终优于当前最先进的半监督技术,展示了其有效性。复制我们实验所需代码可以在这里的URL找到:[提供具体的网址链接]
https://arxiv.org/abs/2504.01547
Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in data augmentation. To simultaneously avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function to improve the quality of augmented data by removing the reliance on nearest neighbor interpolation and integrating a mean based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. Experiments on semantic segmentation tasks using three medical image datasets demonstrated both qualitative and quantitative improvements with alternative interpolation algorithms.
避免使用最近邻插值(nearest-neighbor interpolation)带来的未定义分类标签风险,却忽视了数据增强过程中可能加剧像素级别标注错误的风险。为了同时规避这两种风险,作者修改了卷积神经网络的数据转换函数,通过引入改进的几何变换函数来提高增强数据的质量,该方法消除了对最近邻插值的依赖,并整合了一种基于均值的类别过滤机制,以使用替代插值算法处理未定义的分类标签。在三个医学图像数据集上的语义分割任务实验表明,采用替代插值算法带来了定性和定量上的改进。
https://arxiv.org/abs/2504.01527
Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.
结肠息肉和皮肤病变的精确分割对于诊断结直肠癌和皮肤癌至关重要。尽管已经开发出了多种使用全监督深度学习技术对息肉和皮肤病变进行分割的方法,但医生为医学图像提供像素级别的标注既耗时又昂贵。基础视觉模型如Segment Anything Model (SAM) 已经展示了卓越的性能;然而,直接将SAM应用于医学分割可能由于缺乏特定领域的医疗知识而无法获得令人满意的结果。在本文中,我们提出了BiSeg-SAM,这是一个由SAM引导的弱监督提示和边界细化网络,用于息肉和皮肤病变的分割。 具体来说,我们将SAM与CNN模块结合进行微调以学习局部特征。我们引入了一个名为WeakBox的功能组件:它可以自动为SAM模型生成框提示,并使用我们提出的Multi-choice Mask-to-Box (MM2B) 转换技术来进行粗略的掩码到边界框转换,解决了粗糙标签和精确预测之间的不匹配问题。此外,我们应用尺度一致性(SC)损失来对齐预测的尺度。我们的DetailRefine模块通过利用少量的真实标注数据来细化粗略的预测结果以提高边界的精度和分割准确性。这种全面的方法使BiSeg-SAM能够实现卓越的多任务分割性能。 在五个息肉数据集和一个皮肤癌数据集中进行测试时,我们的方法相对于最先进的(SOTA)方法表现出显著的优势。
https://arxiv.org/abs/2504.01452
Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.
最近,对比语言图像预训练(CLIP)在特定领域的数据(如生物学领域)中表现出色,并引起了越来越多的研究关注。现有的研究通常集中在收集广泛的特定领域数据以及直接微调原始的CLIP模型上。直观来看,这种范式没有充分考虑特定领域数据中的特性(例如生物数据的细粒度性质),从而限制了模型的能力,同时在通用领域的原始能力也大打折扣。 为此,在本文中我们提出了一种基于分布对齐的语言图像预训练方法——DALIP,专门用于处理生物数据。具体来说,DALIP通过匹配图像文本对特征分布之间的相似性来优化CLIP模型,而不是使用原来的[cls]标记,这种方法能够捕捉到图像-文本对之间丰富的有效信息作为强大的表示形式,从而更好地应对生物学数据的细粒度特性。 特别地,我们的DALIP方法可以通过一阶和二阶统计高效地近似特征分布,并提出一个多头布朗距离协方差(MBDC)模块来有效地获取标记特征的二阶统计量。此外,我们还收集了一个新的植物领域的数据集(例如生物领域中的特定数据),该数据集中包含了10M植物数据与3M通用领域数据(即PlantMix-13M),根据数据混合规律构建而成。 广泛的实验表明,DALIP在生物学领域中明显优于现有的CLIP方法,并且很好地泛化到了遥感和医学成像领域。此外,我们的PlantMix-13M数据集进一步提升了DALIP在植物领域的性能,同时保持了模型在通用领域的能力。
https://arxiv.org/abs/2504.01386
Foundation medical segmentation models, with MedSAM being the most popular, have achieved promising performance across organs and lesions. However, MedSAM still suffers from compromised performance on specific lesions with intricate structures and appearance, as well as bounding box prompt-induced perturbations. Although current test-time adaptation (TTA) methods for medical image segmentation may tackle this issue, partial (e.g., batch normalization) or whole parametric updates restrict their effectiveness due to limited update signals or catastrophic forgetting in large models. Meanwhile, these approaches ignore the computational complexity during adaptation, which is particularly significant for modern foundation models. To this end, our theoretical analyses reveal that directly refining image embeddings is feasible to approach the same goal as parametric updates under the MedSAM architecture, which enables us to realize high computational efficiency and segmentation performance without the risk of catastrophic forgetting. Under this framework, we propose to encourage maximizing factorized conditional probabilities of the posterior prediction probability using a proposed distribution-approximated latent conditional random field loss combined with an entropy minimization loss. Experiments show that we achieve about 3\% Dice score improvements across three datasets while reducing computational complexity by over 7 times.
基础医学分割模型,其中MedSAM最为流行,在各种器官和病变上已经取得了令人鼓舞的性能。然而,MedSAM在结构复杂且外观多变的具体病变以及边界框提示引起的扰动方面仍然存在表现不佳的问题。尽管现有的测试时间适应(TTA)方法可能解决这些问题,但由于更新信号有限或大型模型中的灾难性遗忘,部分参数(例如批量归一化层)或全参数更新会限制它们的有效性。同时,这些方法忽略了在适应过程中计算复杂性的增加,这对于现代基础模型来说尤为重要。 为此,我们的理论分析表明,在MedSAM架构下直接细化图像嵌入是可行的,并且可以达到与参数更新相同的目标,这使我们能够在没有灾难性遗忘风险的情况下实现高计算效率和分割性能。在此框架内,我们提出了一种结合分布近似潜条件随机场损失和熵最小化损失的方法来最大化后验预测概率的因子化条件概率。 实验结果显示,在三个数据集上我们的方法实现了约3%的Dice分数提升,并且在测试时间适应阶段减少了超过7倍的计算复杂性。
https://arxiv.org/abs/2504.02008
In Question Answering (QA), Retrieval Augmented Generation (RAG) has revolutionized performance in various domains. However, how to effectively capture multi-document relationships, particularly critical for biomedical tasks, remains an open question. In this work, we propose a novel method that utilizes propositional claims to construct a local knowledge graph from retrieved documents. Summaries are then derived via layerwise summarization from the knowledge graph to contextualize a small language model to perform QA. We achieved comparable or superior performance with our method over RAG baselines on several biomedical QA benchmarks. We also evaluated each individual step of our methodology over a targeted set of metrics, demonstrating its effectiveness.
在问答(QA)领域,检索增强生成(RAG)技术已经在多个领域中实现了性能的革新。然而,如何有效捕捉多文档之间的关系——尤其是对于生物医学任务而言至关重要——仍然是一个开放性问题。在这项工作中,我们提出了一种新颖的方法,该方法利用命题声明来从检索到的文档中构建本地知识图谱。随后,通过从知识图谱进行分层总结来生成摘要,使小型语言模型能够在上下文中执行QA任务。我们在几个生物医学QA基准测试上使用这种方法取得了与RAG基线相当或更好的性能。我们还针对一组特定指标评估了方法中的每一步,证明了其有效性。
https://arxiv.org/abs/2504.01309
The identification of dermatological disease is an important problem in Mexico according with different studies. Several works in literature use the datasets of different repositories without applying a study of the data behavior, especially in medical images domain. In this work, we propose a methodology to preprocess dermaMNIST dataset in order to improve its quality for the classification stage, where we use lightweight convolutional neural networks. In our results, we reduce the number of instances for the neural network training obtaining a similar performance of models as ResNet.
根据不同的研究,墨西哥的皮肤病识别是一个重要的问题。文献中的许多工作使用了不同数据仓库的数据集,但没有进行数据分析行为的研究,尤其是在医学图像领域。在这项工作中,我们提出了一种预处理dermaMNIST数据集的方法,以提高其在分类阶段的质量,我们在这一阶段采用了轻量级卷积神经网络。通过我们的方法,在减少用于神经网络训练的样本数量的同时,我们获得的模型性能与ResNet等其他模型相当。
https://arxiv.org/abs/2504.01208
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.
大型语言模型(LLMs)有可能变革医学领域,但现实中的临床场景包含了许多不必要的信息,这可能会影响其性能。辅助技术如环境口述的兴起——这种技术能自动从实时患者互动中生成草稿笔记——可能会引入额外的噪音,因此评估LLM过滤相关数据的能力变得至关重要。为研究这个问题,我们开发了MedDistractQA基准测试,该测试使用嵌入模拟现实干扰信息的美国执业医师考试(USMLE)风格的问题。 我们的发现表明,分散注意力的陈述(在非临床环境中使用具有医学意义的多义词或引用不相关的健康状况)可以使LLM的准确性降低多达17.9%。常见的改进模型性能的方法,如检索增强生成(RAG)和医疗微调,并没有改变这一影响,在某些情况下甚至引入了自己的干扰因素并进一步降低了性能。 我们的发现表明,LLMs天生缺乏区分相关临床信息与不相关信息所需的逻辑机制,这给现实应用带来了挑战。MedDistractQA及其结果强调了需要开发稳健的缓解策略以增强LLM对冗余信息的抵抗力。
https://arxiv.org/abs/2504.01201
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code will be publicly available.
医学任务,如诊断和治疗计划的制定,需要精确且复杂的推理能力,尤其是在关乎生命的关键领域中。与数学推理不同,医学推理要求细致、可验证的思想过程以确保可靠性和准确性。然而,在提供透明、逐步推理的数据集方面存在明显的不足,这些数据集可以用于验证和提升AI模型在医疗领域的推理能力。为解决这一问题,我们引入了MedReason,这是一个大规模的高质量医学推理数据集,旨在增强大型语言模型(LLMs)进行忠实且可解释的医学问题解决的能力。 我们利用结构化的医学知识图谱(KG),将临床问答对转化为逻辑链式推理路径或“思考路径”,这些路径通过相关的KG实体追踪从问题要素到答案之间的联系。每条路径都经过验证,以确保其与临床逻辑和基于证据的医学一致。我们的流程生成了来自7个医疗数据集的各种医学问题的详细推理结果,最终形成了一个包含32,682对问答的数据集,每个问答都有详细的、逐步的解释。 实验表明,在我们提供的数据集上进行微调可以显著提升大语言模型解决医学问题的能力,DeepSeek-Ditill-8B在这一过程中实现了高达7.7%的性能改进。我们的最佳表现模型MedReason-8B在临床基准测试MedBullets上的表现超过了最先进的医学推理模型Huatuo-o1-8B,最高提升了4.2%。 此外,我们还邀请了来自不同专科领域的医疗专业人士对数据集的质量进行评估,确保MedReason提供了准确且连贯的医学推理。我们的数据、模型和代码将公开发布以供使用。
https://arxiv.org/abs/2504.00993
The scarcity of accessible, compliant, and ethically sourced data presents a considerable challenge to the adoption of artificial intelligence (AI) in sensitive fields like healthcare, finance, and biomedical research. Furthermore, access to unrestricted public datasets is increasingly constrained due to rising concerns over privacy, copyright, and competition. Synthetic data has emerged as a promising alternative, and diffusion models -- a cutting-edge generative AI technology -- provide an effective solution for generating high-quality and diverse synthetic data. In this paper, we introduce a novel federated learning framework for training diffusion models on decentralized private datasets. Our framework leverages personalization and the inherent noise in the forward diffusion process to produce high-quality samples while ensuring robust differential privacy guarantees. Our experiments show that our framework outperforms non-collaborative training methods, particularly in settings with high data heterogeneity, and effectively reduces biases and imbalances in synthetic data, resulting in fairer downstream models.
可用的、合规且道德来源的数据稀缺,这对在医疗保健、金融和生物医学研究等敏感领域的采纳人工智能(AI)构成了重大挑战。此外,由于隐私、版权以及竞争方面的担忧日益增加,不受限制地访问公共数据集变得越来越困难。合成数据作为一种有希望的替代方案应运而生,而扩散模型——一种前沿的生成式AI技术——则提供了有效生成高质量且多样化合成数据的方法。本文介绍了一种新的联邦学习框架,用于在分散化的私人数据集中训练扩散模型。我们的框架利用个性化和正向扩散过程中的内在噪声来产生高质量样本,并确保强大的差分隐私保证。实验结果显示,相较于非协作的训练方法,特别是在数据异质性较高的情况下,我们的框架表现更佳,并且有效减少了合成数据中的偏差与不平衡,从而生成更为公平的下游模型。
https://arxiv.org/abs/2504.00952