Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at this https URL.
卷积神经网络(CNN)是计算机视觉任务中的关键工具,但它们缺乏一些传统上期望的特性,这些特性能够进一步提升模型性能,比如旋转等变性。这种性质在生物医学图像中很常见,而这类图像通常没有明确的方向信息。虽然当前的研究主要依赖于数据增强或显式模块来捕捉方向信息,但这会增加训练成本或导致对所需等变性的无效近似。为了克服这些挑战,我们提出了一种新颖且高效的对称旋转等变(SRE)卷积核(SRE-Conv)的实现方法,旨在学习旋转不变特征的同时压缩模型大小。SRE-Conv 核可以轻松集成到任何 CNN 主干网络中。我们使用公开的 MedMNISTv2 数据集(共16个任务)验证了深层 SRE-CNN 捕获旋转等变性的能力。在二维和三维图像的所有16个测试数据集中,SRE-Conv-CNN 在所有情况下都显示出了更高的旋转图像分类精度,并且通过减少参数数量和降低内存占用提高了效率。代码可在以下网址获得:[提供链接]。 请注意,在实际翻译过程中,请确保使用正确的URL以供访问相关资源。
https://arxiv.org/abs/2501.09753
The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.
BioCreative8 Track 3的目标是从电子健康记录(EHR)文本中提取关键的医学表型发现,并将其归一化为人类表型本体论(Human Phenotype Ontology,HPO)术语。然而,由于表型发现表面形式的多样性,准确地将它们归一化到正确的HPO术语上存在挑战性。为了应对这一挑战,我们探索了多种命名实体识别模型,并实施了数据增强技术如同义词边际化来提升归一化的步骤。我们的流水线最终在精确提取和归一化的F1评分方面比所有对挑战做出回应的提交的平均得分高出2.6%。此外,在归一化F1评分上,我们的方法超出了平均水平1.9%。这些发现有助于自动医学数据提取和归一化技术的发展,并展示了未来研究和生物医学领域应用的潜在路径。
https://arxiv.org/abs/2501.09744
As artificial intelligence (AI) becomes increasingly embedded in healthcare delivery, this chapter explores the critical aspects of developing reliable and ethical Clinical Decision Support Systems (CDSS). Beginning with the fundamental transition from traditional statistical models to sophisticated machine learning approaches, this work examines rigorous validation strategies and performance assessment methods, including the crucial role of model calibration and decision curve analysis. The chapter emphasizes that creating trustworthy AI systems in healthcare requires more than just technical accuracy; it demands careful consideration of fairness, explainability, and privacy. The challenge of ensuring equitable healthcare delivery through AI is stressed, discussing methods to identify and mitigate bias in clinical predictive models. The chapter then delves into explainability as a cornerstone of human-centered CDSS. This focus reflects the understanding that healthcare professionals must not only trust AI recommendations but also comprehend their underlying reasoning. The discussion advances in an analysis of privacy vulnerabilities in medical AI systems, from data leakage in deep learning models to sophisticated attacks against model explanations. The text explores privacy-preservation strategies such as differential privacy and federated learning, while acknowledging the inherent trade-offs between privacy protection and model performance. This progression, from technical validation to ethical considerations, reflects the multifaceted challenges of developing AI systems that can be seamlessly and reliably integrated into daily clinical practice while maintaining the highest standards of patient care and data protection.
随着人工智能(AI)在医疗保健领域应用的日益广泛,本章探讨了开发可靠且符合伦理规范的临床决策支持系统(CDSS)的关键方面。从传统的统计模型过渡到复杂的机器学习方法开始,本书详细研究了严格的验证策略和性能评估方法,包括模型校准和决策曲线分析等关键角色。本章强调,在医疗保健中创建值得信赖的人工智能系统不仅仅需要技术上的准确性,还需要谨慎考虑公平性、可解释性和隐私保护。 确保通过AI实现公正的医疗服务是一项挑战,本章节讨论了识别和缓解临床预测模型中的偏见的方法。随后,本书深入探讨了可解释性作为以人类为中心CDSS的核心要素。这种关注反映了对医疗保健专业人员不仅需要信任AI建议而且必须理解其背后的逻辑这一认识。 接下来,本书分析了医学人工智能系统中的隐私漏洞问题,从深度学习模型的数据泄露到针对模型解释的复杂攻击手段。文中探讨了包括差异隐私和联邦学习在内的隐私保护策略,并承认在隐私保护与模型性能之间存在固有的权衡关系。 这一从技术验证到伦理考虑的过程反映了开发能够在日常临床实践中无缝且可靠地集成的人工智能系统所面临的多方面挑战,同时保持最高标准的患者护理和数据保护。
https://arxiv.org/abs/2501.09628
De-identification of medical images is a critical step to ensure privacy during data sharing in research and clinical settings. The initial step in this process involves detecting Protected Health Information (PHI), which can be found in image metadata or imprinted within image pixels. Despite the importance of such systems, there has been limited evaluation of existing AI-based solutions, creating barriers to the development of reliable and robust tools. In this study, we present an AI-based pipeline for PHI detection, comprising three key components: text detection, text extraction, and analysis of PHI content in medical images. By experimenting with exchanging roles of vision and language models within the pipeline, we evaluate the performance and recommend the best setup for the PHI detection task.
医学图像去识别是确保研究和临床环境中数据共享期间隐私保护的关键步骤。此过程的初始阶段涉及检测受保护的健康信息(PHI),这些信息可能存在于图像元数据中或嵌印在图像像素内。尽管此类系统的至关重要性,现有的基于AI的解决方案却很少被评估,这阻碍了可靠且稳健工具的发展。在这项研究中,我们提出了一种用于检测PHI的基于人工智能的流程,包括三个关键组成部分:文本检测、文本提取以及医学图像中PHI内容的分析。通过在管道中交换视觉和语言模型的角色进行实验,我们评估了其性能,并推荐了最适合执行PHI检测任务的最佳配置。
https://arxiv.org/abs/2501.09552
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
这项研究提供了多模态深度学习(DL)在医学诊断中潜在应用的全面回顾,以COVID-19为例。鉴于人工智能技术在新冠疫情期间的成功应用,本研究旨在揭示深度学习在疾病筛查、预测和分类方面的潜力,并从中获得有助于增强科学、技术和创新体系韧性、可持续性和包容性的见解。采用系统方法,我们探讨了各种研究与实施中遇到的基本方法论、数据来源、预处理步骤以及所面临的挑战。我们还探索了深度学习模型的架构,强调其特定于数据的结构及其基础算法。接下来,我们将比较在COVID-19分析中使用的不同深度学习策略,并根据方法学、数据、性能和未来研究的需求对其进行评估。 通过考察不同类型的数据及诊断模式,本研究为多模态应用下的DL科学理解和知识贡献了力量,并探讨其在诊断中的有效性。我们实施并分析了11种基于COVID-19图像、文本以及语音(即咳嗽)数据的深度学习模型。我们的分析表明,MobileNet模型对COVID-19图像数据实现了最高精度为99.97%,而针对语音数据(如咳嗽)的准确率达到了93.73%。然而,在COVID-19文本分类中,BiGRU模型表现出色,其准确性达到99.89%。 这项研究更广泛的含义在于,它可能对其他领域和学科产生潜在益处,这些领域和学科可以利用深度学习技术进行图像、文本以及语音分析。
https://arxiv.org/abs/2501.09506
Online medical consultation (OMC) restricts doctors to gathering patient information solely through inquiries, making the already complex sequential decision-making process of diagnosis even more challenging. Recently, the rapid advancement of large language models has demonstrated a significant potential to transform OMC. However, most studies have primarily focused on improving diagnostic accuracy under conditions of relatively sufficient information, while paying limited attention to the "inquiry" phase of the consultation process. This lack of focus has left the relationship between "inquiry" and "diagnosis" insufficiently explored. In this paper, we first extract real patient interaction strategies from authentic doctor-patient conversations and use these strategies to guide the training of a patient simulator that closely mirrors real-world behavior. By inputting medical records into our patient simulator to simulate patient responses, we conduct extensive experiments to explore the relationship between "inquiry" and "diagnosis" in the consultation process. Experimental results demonstrate that inquiry and diagnosis adhere to the Liebig's law: poor inquiry quality limits the effectiveness of diagnosis, regardless of diagnostic capability, and vice versa. Furthermore, the experiments reveal significant differences in the inquiry performance of various models. To investigate this phenomenon, we categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history. We analyze the distribution of inquiries across the four types for different models to explore the reasons behind their significant performance differences. We plan to open-source the weights and related code of our patient simulator at this https URL.
在线医疗咨询(OMC)限制了医生只能通过询问来收集患者信息,使得本已复杂的诊断决策过程更加复杂。最近,大型语言模型的迅速发展显示出显著潜力,可以彻底改变OMC的方式。然而,大多数研究主要集中在提高在相对充足信息条件下的诊断准确性上,而对咨询过程中“问诊”阶段的关注较少。这种忽视导致了关于“问诊”与“诊断”之间关系的研究不足。在这篇论文中,我们首先从真实的医生和患者对话中提取出实际的问诊策略,并利用这些策略来训练一个高度模拟现实行为的患者模拟器。通过将医疗记录输入我们的患者模拟器以模拟患者的反应,我们进行了广泛的实验来探索咨询过程中“问诊”与“诊断”的关系。实验结果表明,问诊和诊断遵循李比希法则:低质量的问诊限制了诊断的有效性,无论诊断能力如何;反之亦然。 此外,这些实验揭示了不同模型在问诊性能上的显著差异。为了探讨这一现象的原因,我们将问诊过程分为四种类型:(1)主诉询问;(2)已知症状的具体化;(3)伴随症状的询问;以及(4)收集家族或个人医疗史信息。我们分析了不同类型问诊在整个咨询过程中不同模型间的分布情况,以探讨其显著性能差异的原因。 我们的患者模拟器权重和相关代码计划在以下网址开源:[此 URL](https://this-url.com) (请注意,原文中提供的实际链接需要替换为有效URL)。
https://arxiv.org/abs/2501.09484
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
许多视觉-语言应用需要能够理解否定的模型,例如,在使用自然语言检索包含某些对象但不包含其他特定对象的图像时。尽管通过大规模训练提高了视觉-语言模型(VLMs)的能力,它们理解和处理否定信息的能力仍然没有得到充分研究。这项研究旨在回答以下问题:当前的VLMs在理解否定方面表现如何?为此,我们引入了NegBench,这是一个新的基准测试工具,用于评估18种任务变体和涵盖图像、视频及医学数据集的79,000个样本中的否定理解能力。该基准由两个核心任务组成,旨在评估多样化多模态环境下的否定理解:带有否定的检索以及带有所述否定描述的多项选择题。 我们的评估结果显示,现代VLM在处理否定时存在显著困难,其性能常常处于随机水平。为解决这些不足之处,我们探索了一种数据为中心的方法,即对包含数百万条否定描述的大规模合成数据集进行微调CLIP模型训练。这种方法可以在带有否定查询的检索中将召回率提高10%,在带有所述否定描述的多项选择题中则可使准确度提升40%。
https://arxiv.org/abs/2501.09425
Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.
电子健康记录(EHR)表格具有一些独特的挑战,其中之一是存在高维度和稀疏数据的医疗特征之间的隐藏上下文依赖关系。本研究首次探讨了大型语言模型(LLM)理解EHR以提取和检索患者数据的能力。我们使用MIMICSQL数据集进行了广泛的实验,以探索提示结构、指令、上下文以及两个骨干LLM——Llama2和Meditron的演示对任务性能的影响。通过定量和定性分析,我们的研究结果表明,最佳特征选择和序列化方法可以将任务表现提升高达26.79%,相比于朴素的方法。同样地,在使用相关示例选择的情况下进行上下文学习设置能够提高数据提取性能达5.95%。基于我们对研究发现的评估,我们提出了我们认为会帮助设计支持健康搜索的LLM模型的设计指导原则。
https://arxiv.org/abs/2501.09384
Medicinal plants have been a key component in producing traditional and modern medicines, especially in the field of Ayurveda, an ancient Indian medical system. Producing these medicines and collecting and extracting the right plant is a crucial step due to the visually similar nature of some plants. The extraction of these plants from nonmedicinal plants requires human expert intervention. To solve the issue of accurate plant identification and reduce the need for a human expert in the collection process; employing computer vision methods will be efficient and beneficial. In this paper, we have proposed a model that solves such issues. The proposed model is a custom convolutional neural network (CNN) architecture with 6 convolution layers, max-pooling layers, and dense layers. The model was tested on three different datasets named Indian Medicinal Leaves Image Dataset,MED117 Medicinal Plant Leaf Dataset, and the self-curated dataset by the authors. The proposed model achieved respective accuracies of 99.5%, 98.4%, and 99.7% using various optimizers including Adam, RMSprop, and SGD with momentum.
药用植物在传统和现代药物的生产中扮演着关键角色,尤其是在古老的印度医学体系阿育吠陀领域。由于某些植物外观相似,正确采集和提取这些植物是制药过程中的重要步骤。从非药用植物中分离出药用植物需要人类专家介入。为了准确识别植物并减少对人类专家的需求,在收集过程中使用计算机视觉方法将是高效且有益的。 本文提出了一种解决此类问题的模型。该模型是一种自定义卷积神经网络(CNN)架构,包含6个卷积层、最大池化层和全连接层。我们在三个不同的数据集上测试了这个模型:印度药用叶子图像数据集、MED117药用植物叶片数据集以及作者自己整理的数据集。使用包括Adam、RMSprop和带有动量的SGD在内的多种优化器,所提出的模型分别达到了99.5%、98.4%和99.7%的准确率。
https://arxiv.org/abs/2501.09363
Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
在医疗图像分类中,少样本学习(few-shot learning)面临着一个显著的挑战,即标注数据的有限可用性和医学影像的复杂性。本文提出了一种新的框架——自适应视觉-语言微调与层次对比对齐(HiCA),该框架利用大规模视觉-语言模型(LVLMs)的能力来进行医疗图像分析。HiCA引入了一个两阶段的微调策略,结合领域特定预训练和分层对比学习来在多个层级上对齐视觉和文本表示。我们在两个基准数据集——胸部X光片和乳腺超声检查数据集中评估了我们的方法,在少样本(few-shot)和零样本(zero-shot)设置下均取得了最先进的性能表现。进一步的分析表明,与现有的基线相比,该方法具有更强的鲁棒性、泛化能力和可解释性,并在性能上实现了显著提升。我们的工作强调了层次对比策略在将LVLMs适应医学成像任务独特挑战方面的潜力。
https://arxiv.org/abs/2501.09294
Vision Transformers (ViTs) are increasingly being adopted in various sensitive vision applications - like medical diagnosis, facial recognition, etc. To improve the interpretability of such models, many approaches attempt to forward-align them with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose a novel Concept Representation Alignment Module (CRAM) which learns both scale and position-aware representations from multi-scale feature pyramids and patch representations respectively. CRAM further aligns these representations with concept annotations through an attention matrix. The proposed CRAM module improves the predictive performance of ViT architectures and also provides accurate and robust concept explanations as demonstrated on five datasets - including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).
视觉变压器(ViTs)在医疗诊断、面部识别等敏感视觉应用中越来越受到青睐。为了提高这些模型的可解释性,许多方法试图通过与精心标注的抽象且人类易于理解的概念进行正向对齐来实现这一点。概念为模型预测提供了全局理由,并且领域专家可以快速理解和干预。然而,目前大多数研究集中在设计不考虑基础模型内部工作原理(如归纳偏差、尺度不变性等)的通用可插拔模块上。 为了缓解这一问题,我们针对ViTs提出了一种新颖的概念表示对齐模块(CRAM)。该模块分别从多尺度特征金字塔和patch表示中学习具有尺度感知和位置感知的表示。此外,通过注意力矩阵,CRAM进一步将这些表示与概念注释进行对齐。所提出的CRAM模块不仅提升了ViT架构的预测性能,还提供了准确且稳健的概念解释,这一效果在五个数据集上得到了验证——包括三个广泛使用的基准(CUB、Pascal APY、Concept-MNIST)以及两个真实世界的数据集(AWA2、KITS)。
https://arxiv.org/abs/2501.09221
Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the advanced reasoning required for complex clinical scenarios, such as differential diagnosis or personalized treatment suggestions. We proposed FineMedLM-o1, which leverages high-quality synthetic medical data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduced Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also proposed a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.
近期,在大型语言模型(LLM)在医学应用中的进展显示出其在疾病诊断和治疗规划方面的潜力。然而,大多数现有的医疗LLM在处理需要高级推理的复杂临床场景时(如鉴别诊断或个性化治疗建议),表现力不从心。为此,我们提出了FineMedLM-o1,该模型利用高质量的人工合成医疗数据和长文本推理数据进行监督微调(SFT)和直接偏好优化(DPO),从而提升了对话能力和深层次的推理能力。 此外,我们在医学领域首次引入了测试时训练(TTT),这有助于领域的适应性,并确保了可靠的、准确的推理。实验结果显示,FineMedLM-o1在关键医疗基准上比之前的模型平均性能提高了23%。特别是,TTT技术提供了额外的14%性能提升,突显了其在增强医学推理能力方面的有效性。 为了支持这一过程,我们还提出了一种新的合成医学对话的方法。与现有的开源数据集相比,我们的数据集在质量和复杂性方面都表现更优。该项目和相关数据将在GitHub上发布。
https://arxiv.org/abs/2501.09213
Vision foundation models have achieved remarkable progress across various image analysis tasks. In the image segmentation task, foundation models like the Segment Anything Model (SAM) enable generalizable zero-shot segmentation through user-provided prompts. However, SAM primarily trained on natural images, lacks the domain-specific expertise of medical imaging. This limitation poses challenges when applying SAM to medical image segmentation, including the need for extensive fine-tuning on specialized medical datasets and a dependency on manual prompts, which are both labor-intensive and require intervention from medical experts. This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM), a novel method designed to adapt the advanced Segment Anything Model 2 (SAM2) for 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of SAM2 to enable few-shot adaptation, leveraging a small number of support examples to capture anatomical knowledge and perform prompt-free segmentation, without requiring model fine-tuning. To handle the volumetric nature of medical images, we incorporate a Volumetric Consistency mechanism that enhances spatial coherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging datasets and compare it with supervised learning methods, zero-shot SAM approaches, and fine-tuned medical SAM methods. Results show that FATE-SAM delivers robust and accurate segmentation while eliminating the need for large annotated datasets and expert intervention. FATE-SAM provides a practical, efficient solution for medical image segmentation, making it more accessible for clinical applications.
视觉基础模型在各种图像分析任务中取得了显著进展。在图像分割任务方面,像Segment Anything Model (SAM)这样的基础模型能够通过用户提供的提示实现零样本泛化分割。然而,由于SAM主要是在自然图像上进行训练的,因此缺乏医学成像领域的专业知识。这使得将SAM应用于医学图像分割时面临挑战,包括需要对专门的医学数据集进行大量微调以及依赖于手动提示的需求,这两种需求既费时又需要医疗专家的介入。 为此,本工作引入了Few-shot Adaptation of Training-free SAM (FATE-SAM),这是一种创新方法,旨在将先进的Segment Anything Model 2(SAM2)调整用于3D医学图像分割。FATE-SAM重新组装了预训练的SAM2模块,以实现少量样本适应性,并利用少量支持示例来捕捉解剖学知识并执行无需提示的分割,同时避免了对模型微调的需求。为了处理医学图像体积性质的问题,我们引入了一个体积一致性机制,增强了3D切片之间的空间连贯性。 我们在多个医学成像数据集上评估FATE-SAM,并将其与监督学习方法、零样本SAM方法和微调后的医疗SAM方法进行了比较。结果显示,FATE-SAM在不需要大量注释数据集和专家介入的情况下提供了稳健且准确的分割结果。因此,FATE-SAM为医学图像分割提供了一种实用高效的解决方案,使其更适用于临床应用。
https://arxiv.org/abs/2501.09138
Medical images and reports offer invaluable insights into patient health. The heterogeneity and complexity of these data hinder effective analysis. To bridge this gap, we investigate contrastive learning models for cross-domain retrieval, which associates medical images with their corresponding clinical reports. This study benchmarks the robustness of four state-of-the-art contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We introduce an occlusion retrieval task to evaluate model performance under varying levels of image corruption. Our findings reveal that all evaluated models are highly sensitive to out-of-distribution data, as evidenced by the proportional decrease in performance with increasing occlusion levels. While MedCLIP exhibits slightly more robustness, its overall performance remains significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a general-purpose dataset, struggles with medical image-report retrieval, highlighting the importance of domain-specific training data. The evaluation of this work suggests that more effort needs to be spent on improving the robustness of these models. By addressing these limitations, we can develop more reliable cross-domain retrieval models for medical applications.
医学图像和报告为患者健康提供了宝贵的见解。然而,这些数据的异质性和复杂性阻碍了有效的分析。为了弥合这一差距,我们研究了一种用于跨域检索的对比学习模型,该模型将医疗图像与其相应的临床报告关联起来。这项研究对四种最先进的对比学习模型(CLIP、CXR-RePaiR、MedCLIP 和 CXR-CLIP)进行了基准测试,以评估它们在不同领域的鲁棒性。我们引入了一项遮挡检索任务来评估在不同程度的数据损坏情况下这些模型的性能。我们的发现表明,所有被评估的模型都对域外数据高度敏感,随着遮挡水平的增加,其性能显著下降。尽管 MedCLIP 在一定程度上更具鲁棒性,但其整体性能仍然远远落后于 CXR-CLIP 和 CXR-RePaiR。训练在通用数据集上的 CLIP,在处理医学图像报告检索时表现不佳,突显了特定领域数据的重要性。这项工作的评估表明,需要更多努力来提高这些模型的鲁棒性。通过解决这些局限性,我们可以为医疗应用开发出更可靠的跨域检索模型。
https://arxiv.org/abs/2501.09134
Small object segmentation, like tumor segmentation, is a difficult and critical task in the field of medical image analysis. Although deep learning based methods have achieved promising performance, they are restricted to the use of binary segmentation mask. Inspired by the rigorous mapping between binary segmentation mask and distance map, we adopt distance map as a novel ground truth and employ a network to fulfill the computation of distance map. Specially, we propose a new segmentation framework that incorporates the existing binary segmentation network and a light weight regression network (dubbed as LR-Net). Thus, the LR-Net can convert the distance map computation into a regression task and leverage the rich information of distance maps. Additionally, we derive a shape-aware loss by employing distance maps as penalty map to infer the complete shape of an object. We evaluated our approach on MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge dataset and a clinical dataset. Experimental results show that our approach outperforms the classification-based methods as well as other existing state-of-the-arts.
小对象分割,如肿瘤分割,在医学图像分析领域是一项困难且关键的任务。尽管基于深度学习的方法已经取得了令人鼓舞的性能,但它们仍然局限于使用二值分割掩码。受到二值分割掩码和距离图之间严谨映射关系的启发,我们将距离图作为新的真实标签,并采用网络来完成距离图的计算。特别地,我们提出了一种新的分割框架,该框架结合了现有的二值分割网络以及一个轻量级回归网络(称为LR-Net)。因此,LR-Net可以将距离图计算转换为回归任务,并利用距离图中的丰富信息。此外,我们通过采用距离图作为惩罚图来推断物体的完整形状,导出了一个感知形状的损失函数。 我们在MICCAI 2017肝脏肿瘤分割(LiTS)挑战数据集和临床数据集上评估了我们的方法。实验结果表明,与基于分类的方法和其他现有的最新技术相比,我们的方法表现更优。
https://arxiv.org/abs/2501.09116
Medical image anonymization aims to protect patient privacy by removing identifying information, while preserving the data utility to solve downstream tasks. In this paper, we address the medical image anonymization problem with a two-stage solution: latent code projection and optimization. In the projection stage, we design a streamlined encoder to project input images into a latent space and propose a co-training scheme to enhance the projection process. In the optimization stage, we refine the latent code using two deep loss functions designed to address the trade-off between identity protection and data utility dedicated to medical images. Through a comprehensive set of qualitative and quantitative experiments, we showcase the effectiveness of our approach on the MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that can serve as training set for detecting lung pathologies. Source codes are available at this https URL.
医学图像匿名化旨在通过移除身份信息来保护患者隐私,同时保留数据效用以解决下游任务。在本文中,我们采用两阶段解决方案来解决医学图像的匿名化问题:潜代码投影和优化。在投影阶段,我们设计了一个精简编码器将输入图像映射到一个潜在空间,并提出了一种协同训练方案来增强投影过程。在优化阶段,我们使用两种深度损失函数对潜代码进行细化,这些损失函数旨在解决医学图像中身份保护与数据效用之间的权衡问题。通过一系列定性和定量实验,在MIMIC-CXR胸部X光数据集上展示了我们的方法的有效性,生成了可用于检测肺部病理的匿名合成训练集。源代码可在提供的链接获取。
https://arxiv.org/abs/2501.09114
The Masked Autoencoder (MAE) has recently demonstrated effectiveness in pre-training Vision Transformers (ViT) for analyzing natural images. By reconstructing complete images from partially masked inputs, the ViT encoder gathers contextual information to predict the missing regions. This capability to aggregate context is especially important in medical imaging, where anatomical structures are functionally and mechanically linked to surrounding regions. However, current methods do not consider variations in the number of input images, which is typically the case in real-world Magnetic Resonance (MR) studies. To address this limitation, we propose a 3D Adaptive Masked Autoencoders (AMAE) architecture that accommodates a variable number of 3D input contrasts per subject. A magnetic resonance imaging (MRI) dataset of 45,364 subjects was used for pretraining and a subset of 1648 training, 193 validation and 215 test subjects were used for finetuning. The performance demonstrates that self pre-training of this adaptive masked autoencoders can enhance the infarct segmentation performance by 2.8%-3.7% for ViT-based segmentation models.
最近,带遮罩的自动编码器(MAE)在用于分析自然图像的视觉变压器(ViT)的预训练中展示了其有效性。通过从部分遮盖的输入重建完整图像,ViT编码器收集上下文信息以预测缺失区域。这种聚合上下文的能力在医学成像中尤为重要,因为在解剖结构的功能和机械连接方面,它们与其周围区域密切相关。然而,目前的方法没有考虑输入图像数量的变化,而在现实世界的磁共振(MR)研究中这种情况通常会发生。为了解决这一限制,我们提出了一种3D自适应遮罩自动编码器(AMAE)架构,该架构可以处理每个受试者不同的3D输入对比度的数量变化。为了预训练,使用了包含45,364名受试者的磁共振成像(MRI)数据集,并且在微调时使用了一个子集,包括1,648个训练、193个验证和215个测试受试者。实验结果显示,这种自适应遮罩自动编码器的自我预训练可以提高基于ViT的分割模型对梗死区域分割性能,使其提高了2.8%-3.7%。
https://arxiv.org/abs/2501.09096
Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
基础模型(FMs)在放射学中展示了变革潜力,能够在不同的成像模式下执行多样且复杂的任务。在这里,我们开发了CT-FM,这是一个大规模的基于3D图像的预训练模型,专门针对各种放射学任务设计。CT-FM通过无标签对比学习方法,在来自影像数据公用库(Imaging Data Commons)的148,000个计算机断层扫描(CT)扫描基础上进行了预训练。 我们评估了CT-FM在四个类别的任务中的表现,包括全身和肿瘤分割、头部CT分类、医学图像检索以及语义理解。结果显示,在所有类别中,CT-FM的表现均优于最先进的模型。 除了量化的成功之外,CT-FM还展示出将解剖区域进行聚类,并能够在不同扫描之间识别类似的解剖结构的能力。此外,它在测试-重测设置下仍然保持了其稳健性,并显示出了与其嵌入关联的合理显著区域。 本研究证明了大规模医学影像基础模型的价值,并通过开源该模型权重、代码和数据,旨在支持放射学领域中更灵活、可靠且可解释的人工智能解决方案。
https://arxiv.org/abs/2501.09001
Mobile robot fleets are currently used in different scenarios such as medical environments or logistics. The management of these systems provides different challenges that vary from the control of the movement of each robot to the allocation of tasks to be performed. Task Allocation (TA) problem is a key topic for the proper management of mobile robot fleets to ensure the minimization of energy consumption and quantity of necessary robots. Solutions on this aspect are essential to reach economic and environmental sustainability of robot fleets, mainly in industry applications such as warehouse logistics. The minimization of energy consumption introduces TA problem as an optimization issue which has been treated in recent studies. This work focuses on the analysis of current trends in solving TA of mobile robot fleets. Main TA optimization algorithms are presented, including novel methods based on Artificial Intelligence (AI). Additionally, this work showcases most important results extracted from simulations, including frameworks utilized for the development of the simulations. Finally, some conclusions are obtained from the analysis to target on gaps that must be treated in the future.
当前,移动机器人舰队被应用于多种场景中,如医疗环境或物流。这些系统的管理提供了从控制每个机器人的运动到任务分配的各种挑战。任务分配(TA)问题是确保移动机器人舰队高效运行的关键问题之一,旨在最小化能耗和所需机器人数量。在诸如仓库物流等工业应用领域实现机器人车队的经济效益和环保可持续性方面,解决这一问题至关重要。 随着对能源消耗最小化的追求,任务分配问题逐渐成为一个优化难题,并受到了近期研究的关注。本文重点分析了目前解决移动机器人舰队任务分配(TA)的趋势。文中介绍了主要的任务分配优化算法,包括基于人工智能(AI)的新方法。此外,文章展示了从模拟实验中提取的最重要结果,其中包括用于开发这些模拟的框架。最后,通过对现有问题进行分析,指出了未来研究需要填补的一些空白。 总之,本文旨在概述当前解决移动机器人舰队任务分配优化问题的方法和技术进展,并为未来的研究方向提供了指导和建议。
https://arxiv.org/abs/2501.08726
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at this https URL
特征提取技术在医学图像分类中至关重要;然而,传统的机器学习分类器与经典的特征提取方法经常表现出提供足够区分信息的能力不足,特别是在处理复杂的图像集合时。虽然卷积神经网络(CNN)和视觉变换器(ViT)在特征提取方面展现出巨大潜力,但由于医疗影像数据固有的特性,如样本量小或类内变异大,它们容易出现过拟合问题。 本文提出了一种新的方法——基于医学图像注意力的特征提取器(MIAFEx),该方法利用可学习的改进机制来增强变换器编码器架构中的分类标记。这种机制根据学到的权重调整标记,从而提高显著性特征的抽取,并增强了模型对医疗影像数据挑战的适应能力。本文将使用传统和混合分类器对比MIAFEx输出特征的质量与经典特征提取方法的结果;同时还将这些特征在分类任务上的性能与现代CNN和ViT模型进行比较,证明其在多个复杂医学图像分类数据集上具有更高的准确性和鲁棒性。这种优势尤其明显于样本量有限的场景中,在这种情况下,传统和现代模型往往难以有效地泛化。 该项目的源代码可以在以下链接找到:[此URL](在此处插入实际URL)
https://arxiv.org/abs/2501.08562