Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
在将基础模型在高风险任务中部署输出之前,确保它们与人类价值观保持一致是至关重要的。例如,在放射学报告生成中,使用视觉语言模型生成的报告必须在用于医疗决策之前与人类评价保持一致。本文介绍了一种称为“对齐预测”的一般框架,可以确定满足用户指定对齐标准的单元。保证,即使使用不同的基础模型或数据分布,平均选择的单元也确实满足对齐标准。给定任何预训练模型和新单元,对齐预测利用具有真实对齐状态的参考数据集训练一个对齐预测器。然后选择预测对齐分数超过数据相关阈值的新的单元,证明其相应输出值得信赖。通过应用于问题回答和放射学报告生成,我们证明了我们的方法能够通过轻量训练在适量参考数据上准确地识别具有可信输出的单元。在研究对齐预测的各种特征的信息性之后,我们将它们与标准模型结合使用构建对齐预测器。
https://arxiv.org/abs/2405.10301
Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
自然语言在开发通用手术模型方面发挥了重要作用,因为它可以提供从原始文本的广泛监督。这种灵活的监督形式可以确保模型在数据和任务上的可转移性,因为自然语言可以用于参考学到的视觉概念或描述新的概念。在这项工作中,我们提出了HecVL,一种用于构建通用手术模型的分层视频语言预训练方法。具体来说,我们通过将手术讲座视频与三个层次的文本(剪辑级别、音频转录文本级别和视频级别)进行配对,构建了一个分层的视频文本对数据集。然后,我们提出了一个新颖的细到粗的对比学习框架,使用单个模型学习三个视频文本层次的嵌入空间。通过分离不同层次的嵌入空间,学习到的多模态表示编码了同一模型中的短期和长期手术概念。由于引入了文本语义,我们证明了HecVL方法可以在没有任何人类注释的情况下实现零散手术阶段识别。此外,我们还证明了用于手术阶段识别的HecVL模型可以应用于不同的手术过程和医疗机构。
https://arxiv.org/abs/2405.10075
Deformable image registration (alignment) is highly sought after in numerous clinical applications, such as computer aided diagnosis and disease progression analysis. Deep Convolutional Neural Network (DCNN)-based image registration methods have demonstrated advantages in terms of registration accuracy and computational speed. However, while most methods excel at global alignment, they often perform worse in aligning local regions. To address this challenge, this paper proposes a mask-guided encoder-decoder DCNN-based image registration method, named as MrRegNet. This approach employs a multi-resolution encoder for feature extraction and subsequently estimates multi-resolution displacement fields in the decoder to handle the substantial deformation of images. Furthermore, segmentation masks are employed to direct the model's attention toward aligning local regions. The results show that the proposed method outperforms traditional methods like Demons and a well-known deep learning method, VoxelMorph, on a public 3D brain MRI dataset (OASIS) and a local 2D brain MRI dataset with large deformations. Importantly, the image alignment accuracies are significantly improved at local regions guided by segmentation masks. Github link:this https URL.
塑形图像注册(对齐)在许多临床应用中受到高度关注,如计算机辅助诊断和疾病进展分析。基于深度卷积神经网络(DCNN)的图像注册方法在注册准确性和计算速度方面表现出了优势。然而,虽然大多数方法在全局对齐方面表现出色,但它们在局部区域对齐方面往往表现得更差。为解决这个问题,本文提出了一种基于mask-guided encoder-decoder DCNN图像注册方法,称为MrRegNet。该方法采用多分辨率编码器用于特征提取,并随后在解码器中估计多分辨率位移场,以处理图像的巨额变形。此外,还使用分割掩码来引导模型的注意力指向对齐局部区域。结果表明,与传统方法如Demons和著名的深度学习方法VoxelMorph相比,所提出的方法在公共3D脑MRI数据集(OASIS)和具有较大变形 locally的2D脑MRI数据集上显著表现出色。重要的是,在由分割掩码引导的局部区域,图像对齐准确度得到了显著提高。Github链接:this <https://github.com/>.
https://arxiv.org/abs/2405.10068
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
自动医学图像分析系统通常需要大量高质量的训练数据,这很难且耗时。本文介绍了一种名为“Context v2”的Radiology Object(ROCOv2)多模态数据集,该数据集由来自PMC开放访问子集的放射性图像和相关医疗概念和注释组成。这是2018年发表的ROCO数据集中的更新版本,并增加了2018年以来在PMC上新增的35,705个图像。它进一步提供了含有人工编写的成像模式的概念,以及增加了X射线方面的解剖和方向概念。数据集包括79,789个图像,已在ImageCLEF medical caption 2023中的概念检测和预测任务中使用,尽管略微有所修改。该数据集还适用于基于图像-摘要对的训练图像注释模型,或使用每个图像提供的统一医疗语言系统(UMLS)概念进行多标签图像分类。此外,它还可以用于医学领域模型的预训练,以及评估用于多任务学习的深度学习模型。
https://arxiv.org/abs/2405.10004
Multi-task learning (MTL) is a learning paradigm that enables the simultaneous training of multiple communicating algorithms. Although MTL has been successfully applied to ether regression or classification tasks alone, incorporating mixed types of tasks into a unified MTL framework remains challenging, primarily due to variations in the magnitudes of losses associated with different tasks. This challenge, particularly evident in MTL applications with joint feature selection, often results in biased selections. To overcome this obstacle, we propose a provable loss weighting scheme that analytically determines the optimal weights for balancing regression and classification tasks. This scheme significantly mitigates the otherwise biased feature selection. Building upon this scheme, we introduce MTLComb, an MTL algorithm and software package encompassing optimization procedures, training protocols, and hyperparameter estimation procedures. MTLComb is designed for learning shared predictors among tasks of mixed types. To showcase the efficacy of MTLComb, we conduct tests on both simulated data and biomedical studies pertaining to sepsis and schizophrenia.
多任务学习(MTL)是一种学习范式,可同时训练多个通信算法。尽管MTL已成功应用于单独的ether回归或分类任务,但将多种任务集成到一个统一的MTL框架中仍然具有挑战性,主要原因是不同任务上损失的大小不同。这个挑战在MTL应用中尤为明显,尤其是在联合特征选择的情况下。为了克服这个障碍,我们提出了一个可证明的损失加权方案,用于平衡回归和分类任务。这个方案显著减轻了其他方案导致的偏见特征选择。在此基础上,我们介绍了MTLComb,一种涵盖优化过程、训练协议和超参数估计程序的MTL算法和软件包。MTLComb旨在学习混合类型任务中的共享预测器。为了展示MTLComb的有效性,我们对其在模拟数据和关于脓毒症和精神分裂症的生物医学研究中进行了测试。
https://arxiv.org/abs/2405.09886
We introduce the Riskman ontology & shapes for representing and analysing information about risk management for medical devices. Risk management is concerned with taking necessary precautions so a medical device does not cause harms for users or the environment. To date, risk management documentation is submitted to notified bodies (for certification) in the form of semi-structured natural language text. We propose to use classes from the Riskman ontology to logically model risk management documentation and to use the included SHACL constraints to check for syntactic completeness and conformity to relevant standards. In particular, the ontology is modelled after ISO 14971 and the recently published VDE Spec 90025. Our proposed methodology has the potential to save many person-hours for both manufacturers (when creating risk management documentation) as well as notified bodies (when assessing submitted applications for certification), and thus offers considerable benefits for healthcare and, by extension, society as a whole.
我们介绍了一个名为Riskman本体和形状,用于表示和分析医疗器械风险管理的信息。风险管理关注采取必要预防措施,以避免医疗器械对用户或环境造成伤害。截至目前,风险管理文件提交给认证机构的形式是半结构化的自然语言文本。我们建议使用Riskman本体中的类来逻辑地建模风险管理文件,并使用所包含的SHACL约束来检查语法的完整性和相关标准的符合性。特别地,本体是根据ISO 14971和最近发表的VDE Spec 90025建模的。我们提出的方法有可能为制造商(创建风险管理文件)以及认证机构(评估提交的认证申请)节省大量的人时,从而为医疗保健和社会整体带来显著的好处。
https://arxiv.org/abs/2405.09875
In weakly supervised medical image segmentation, the absence of structural priors and the discreteness of class feature distribution present a challenge, i.e., how to accurately propagate supervision signals from local to global regions without excessively spreading them to other irrelevant regions? To address this, we propose a novel weakly supervised medical image segmentation framework named PCLMix, comprising dynamic mix augmentation, pixel-level contrastive learning, and consistency regularization strategies. Specifically, PCLMix is built upon a heterogeneous dual-decoder backbone, addressing the absence of structural priors through a strategy of dynamic mix augmentation during training. To handle the discrete distribution of class features, PCLMix incorporates pixel-level contrastive learning based on prediction uncertainty, effectively enhancing the model's ability to differentiate inter-class pixel differences and intra-class consistency. Furthermore, to reinforce segmentation consistency and robustness, PCLMix employs an auxiliary decoder for dual consistency regularization. In the inference phase, the auxiliary decoder will be dropped and no computation complexity is increased. Extensive experiments on the ACDC dataset demonstrate that PCLMix appropriately propagates local supervision signals to the global scale, further narrowing the gap between weakly supervised and fully supervised segmentation methods. Our code is available at this https URL.
在弱监督医疗图像分割中,缺乏结构先验和类特征分布的离散性给带来了挑战,即如何在确保不扩散监督信号到其他相关区域的同时,准确地将它们从局部传播到全局?为解决这个问题,我们提出了一个名为PCLMix的新弱监督医疗图像分割框架,包括动态混合增强、像素级别对比学习和一致性正则化策略。具体来说,PCLMix基于异质双解码器架构,通过在训练过程中动态混合增强策略解决结构先验缺失的问题。为了处理类特征的离散分布,PCLMix基于预测不确定性引入了像素级别的对比学习,有效提高了模型区分不同类别的像素差异和类内一致性的能力。此外,为了增强分割的一致性和鲁棒性,PCLMix采用辅助解码器进行双一致性正则化。在推理阶段,辅助解码器将被删除,不会增加计算复杂度。在ACDC数据集上的大量实验证明,PCLMix正确地将局部监督信号传播到全局范围,进一步缩小了弱监督和完全监督分割方法之间的差距。我们的代码可以从该链接获取。
https://arxiv.org/abs/2405.06288
Automated region of interest detection in histopathological image analysis is a challenging and important topic with tremendous potential impact on clinical practice. The deep-learning methods used in computational pathology may help us to reduce costs and increase the speed and accuracy of cancer diagnosis. We started with the UNC Melanocytic Tumor Dataset cohort that contains 160 hematoxylin and eosin whole-slide images of primary melanomas (86) and nevi (74). We randomly assigned 80% (134) as a training set and built an in-house deep-learning method to allow for classification, at the slide level, of nevi and melanomas. The proposed method performed well on the other 20% (26) test dataset; the accuracy of the slide classification task was 92.3% and our model also performed well in terms of predicting the region of interest annotated by the pathologists, showing excellent performance of our model on melanocytic skin tumors. Even though we tested the experiments on the skin tumor dataset, our work could also be extended to other medical image detection problems to benefit the clinical evaluation and diagnosis of different tumors.
在病理学图像分析中自动区域兴趣检测是一个具有巨大临床应用潜力、具有挑战性的重要主题。用于计算病理学中的深度学习方法可能有助于我们降低成本并提高癌症诊断的速度和准确性。我们从一个包含160个黑素细胞肿瘤(86个)和脉络丛转移灶(74个)的UNC Melanocytic Tumor Dataset开始。我们随机将80%(134个)作为训练集,并构建了一种内部深度学习方法,允许在切片级别对脉络丛转移灶和黑素细胞肿瘤进行分类。所提出的方法在另一个20%(26个)测试数据集上的表现也非常出色;切片分类任务的准确率为92.3%,我们的模型在黑色素细胞皮肤肿瘤上的表现也非常出色。尽管我们在皮肤肿瘤数据集上进行了测试,但我们的工作还可以扩展到其他医学图像检测问题,以提高不同肿瘤的临床评价和诊断。
https://arxiv.org/abs/2405.09851
Diffusion models have recently gained significant traction due to their ability to generate high-fidelity and diverse images and videos conditioned on text prompts. In medicine, this application promises to address the critical challenge of data scarcity, a consequence of barriers in data sharing, stringent patient privacy regulations, and disparities in patient population and demographics. By generating realistic and varying medical 2D and 3D images, these models offer a rich, privacy-respecting resource for algorithmic training and research. To this end, we introduce MediSyn, a pair of instruction-tuned text-guided latent diffusion models with the ability to generate high-fidelity and diverse medical 2D and 3D images across specialties and modalities. Through established metrics, we show significant improvement in broad medical image and video synthesis guided by text prompts.
扩散模型最近因能够根据文本提示生成高质量和高多样性的图像和视频而获得了显著的关注。在医学领域,这种应用有望解决数据稀缺性的关键挑战,即数据共享的障碍、严格的患者隐私法规以及患者人群和 demographic 差异。通过生成真实和多样化的医学2D和3D图像,这些模型为机器学习训练和研究提供了宝贵的隐私尊重资源。因此,我们引入了MediSyn,一对经过指令微调的文本引导的潜在扩散模型,具有生成跨专业和模块的高质量和多样化医疗2D和3D图像的能力。通过 established metrics,我们证明了基于文本提示的广泛医疗图像和视频合成明显改善。
https://arxiv.org/abs/2405.09806
Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at this https URL .
大语言模型在少样本在上下文学习中(ICL)方面已经被证明非常有效。最近多模态基础模型的进步使得以前未曾有过的长上下文窗口成为可能,这为我们研究其在一个多模态基础模型中进行 ICL 时表现的能力提供了机会。在这项工作中,我们评估了多模态基础模型从少样本到多样本 ICL 的性能。我们在包括多个领域的多个数据集(自然图像、医学图像、遥感图像和分子图像)和任务(多分类、多标签和细粒度分类)中进行了基准测试。我们观察到,许多样本在 ICL 中,包括多达几乎 2,000 个多模态示例,比少样本(<100 个示例) ICL 在所有数据集上产生了显着改进。此外,Gemini 1.5 Pro 的性能在许多数据集上呈对数线性增长,直到达到最大测试样本数。考虑到所需长请求的推理成本,我们还研究了在单个 API 调用中批注多个查询对性能的影响。我们发现,批注多达 50 个查询可以提高零样本和多样本 ICL 的性能,在多个数据集上的零样本设置中实现显著的收益,而大大降低每个查询的代价和延迟。最后,我们测量了模型的 ICL 数据效率,即模型从更多示例中学习的速率。我们发现,尽管GPT-4o 和 Gemini 1.5 Pro 在各个数据集上的零样本性能类似,但Gemini 1.5 Pro在大多数数据集上的 ICL 数据效率要高于GPT-4o。我们的结果表明,许多样本在 ICL 中可以帮助用户有效地将多模态基础模型适应到新的应用和领域。我们的代码库公开可用,在这个链接 https:// 。
https://arxiv.org/abs/2405.09798
This paper investigates an extremely challenging problem, barely-supervised medical image segmentation (BSS), where the training dataset comprises limited labeled data with only single-slice annotations and numerous unlabeled images. Currently, state-of-the-art (SOTA) BSS methods utilize a registration-based paradigm, depending on image registration to propagate single-slice annotations into volumetric pseudo labels for constructing a complete labeled set. However, this paradigm has a critical limitation: the pseudo labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: training a model using only single-annotated slices as the labeled set without relying on image registration. To this end, we formulate BSS as an unsupervised domain adaptation (UDA) problem. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis, which may result in a side effect: domain shifts between the synthesized images and the original images. Then, a frequency and spatial mix-up strategy (FSX) is further introduced to mitigate the domain shifts for UDA. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method with only one labeled slice achieves an 80.77% dice score on left atrial segmentation, outperforming the SOTA by 61.28%. The code will be released upon the publication of this paper.
本文研究了一个极具挑战性的问题——几乎无监督的医疗图像分割(BSS),其中训练数据仅包含有限数量的带有一维注释的图像,同时存在大量未标注的图像。目前,最先进的(SOTA)BSS方法利用基于配准的范式,根据图像配准传播一维注释以构建完整的标注集。然而,这种范式具有关键限制:通过图像配准产生的伪标签不可靠且噪声干扰。为了克服这一限制,我们提出了一个新的观点:仅使用一维注释作为标签集来训练模型,不依赖于图像配准。为此,我们将BSS表示为无监督领域适应(UDA)问题。具体来说,我们首先设计了一种新的无噪声标签数据构建算法(NFC),用于一维到体积标注数据的合成,可能产生一个副作用:合成图像和原始图像之间的域移。然后,我们引入了一种频率和空间混响策略(FSX)来减轻UDA的域移。大量实验证明,我们的方法为BSS提供了具有前景的替代方案。值得注意的是,仅含一个标签片的提出方法在左心房分割上的得分达到了80.77%,超过了SOTA的61.28%。代码将在本文发表后发布。
https://arxiv.org/abs/2405.09777
Correspondence-based statistical shape modeling (SSM) stands as a powerful technology for morphometric analysis in clinical research. SSM facilitates population-level characterization and quantification of anatomical shapes such as bones and organs, aiding in pathology and disease diagnostics and treatment planning. Despite its potential, SSM remains under-utilized in medical research due to the significant overhead associated with automatic construction methods, which demand complete, aligned shape surface representations. Additionally, optimization-based techniques rely on bias-inducing assumptions or templates and have prolonged inference times as the entire cohort is simultaneously optimized. To overcome these challenges, we introduce Point2SSM++, a principled, self-supervised deep learning approach that directly learns correspondence points from point cloud representations of anatomical shapes. Point2SSM++ is robust to misaligned and inconsistent input, providing SSM that accurately samples individual shape surfaces while effectively capturing population-level statistics. Additionally, we present principled extensions of Point2SSM++ to adapt it for dynamic spatiotemporal and multi-anatomy use cases, demonstrating the broad versatility of the Point2SSM++ framework. Furthermore, we present extensions of Point2SSM++ tailored for dynamic spatiotemporal and multi-anatomy scenarios, showcasing the broad versatility of the framework. Through extensive validation across diverse anatomies, evaluation metrics, and clinically relevant downstream tasks, we demonstrate Point2SSM++'s superiority over existing state-of-the-art deep learning models and traditional approaches. Point2SSM++ substantially enhances the feasibility of SSM generation and significantly broadens its array of potential clinical applications.
基于对应关系的统计形状建模(SSM)在临床研究中具有强大的技术价值。SSM通过促进对人群水平特征和数量级解剖形状(如骨和器官)的描述,有助于病理学和疾病诊断与治疗计划的制定。然而,尽管SSM具有巨大的潜力,但在医学研究中,它仍然没有得到充分利用,因为与自动构建方法相关的显著开销,这些方法要求完全对齐的形状表面表示。此外,基于优化的技术依赖于偏差诱导的假设或模板,并且在整个队列同时优化时,推理时间会延长。为了克服这些挑战,我们引入了Point2SSM++,一种基于原则的自监督深度学习方法,可以直接从解剖形状点云表示中学习对应点。Point2SSM++对 misaligned 和不一致的输入具有鲁棒性,提供SSM,准确地采样单个形状表面,同时有效捕捉人群水平统计数据。此外,我们还介绍了Point2SSM++的可扩展性,用于动态空间和多轴场景,展示了Point2SSM++框架的广泛适用性。通过跨越不同解剖学和临床相关任务的大量验证、评价指标和临床应用,我们证明了Point2SSM++在现有深度学习模型和传统方法上的优越性。Point2SSM++极大地提高了SSM生成的可行性,并显著拓宽了其潜在临床应用的范围。
https://arxiv.org/abs/2405.09707
Anatomical shape analysis plays a pivotal role in clinical research and hypothesis testing, where the relationship between form and function is paramount. Correspondence-based statistical shape modeling (SSM) facilitates population-level morphometrics but requires a cumbersome, potentially bias-inducing construction pipeline. Recent advancements in deep learning have streamlined this process in inference by providing SSM prediction directly from unsegmented medical images. However, the proposed approaches are fully supervised and require utilizing a traditional SSM construction pipeline to create training data, thus inheriting the associated burdens and limitations. To address these challenges, we introduce a weakly supervised deep learning approach to predict SSM from images using point cloud supervision. Specifically, we propose reducing the supervision associated with the state-of-the-art fully Bayesian variational information bottleneck DeepSSM (BVIB-DeepSSM) model. BVIB-DeepSSM is an effective, principled framework for predicting probabilistic anatomical shapes from images with quantification of both aleatoric and epistemic uncertainties. Whereas the original BVIB-DeepSSM method requires strong supervision in the form of ground truth correspondence points, the proposed approach utilizes weak supervision via point cloud surface representations, which are more readily obtainable. Furthermore, the proposed approach learns correspondence in a completely data-driven manner without prior assumptions about the expected variability in shape cohort. Our experiments demonstrate that this approach yields similar accuracy and uncertainty estimation to the fully supervised scenario while substantially enhancing the feasibility of model training for SSM construction.
解剖形状分析在临床研究和假设检验中发挥着关键作用,其中形式与功能的关系至关重要。基于配对的统计形状建模(SSM)促进了人口水平形态计量学,但需要一个繁琐、可能存在偏见构建流程。随着深度学习技术的最新进步,在推理过程中直接从无分割医疗图像中提供SSM预测,从而简化了这一过程。然而,所提出的方法是全监督的,需要利用传统的SSM构建流程创建训练数据,从而继承相关的负担和局限性。为了应对这些挑战,我们引入了一种弱监督的深度学习方法,通过点云监督预测SSM。具体来说,我们提出了一个减少与最先进的完全贝叶斯变分信息瓶颈DeepSSM(BVIB-DeepSSM)模型相关的监督的方法。BVIB-DeepSSM是一种有效的、有理的框架,可以从图像中预测概率解剖形状,同时对概率和实证不确定性进行量化。尽管原始的BVIB-DeepSSM方法需要强监督的地面真值配准点,但所提出的方法通过点云表面表示利用弱监督。此外,与传统方法不同,该方法完全基于数据驱动学习,没有关于预计形状随访的方差的可行假设。我们的实验结果表明,与完全监督情况相比,这种方法具有类似的准确性和不确定性估计,同时大大提高了模型训练为SSM构建的可行性。
https://arxiv.org/abs/2405.09697
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Heterogeneous, interconnected, systems-level, molecular data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics. It will lead to a paradigm shift in computational and biomedical understanding of data and diseases that will open up ways to solving some of the major bottlenecks in precision medicine and other domains.
异质、相互连接的、系统层面的分子数据在精准医学中越来越重要。我们需要利用它们更好地将患者划分到风险组中,发现新的生物标志物和靶点,重新利用已知的并发现新的药物以个性化医疗治疗。现有的方法论受限,需要进行范式转换以实现定量和定性突破。在本文中,我们调查了文献并呼吁开发一个全面的、通用的多尺度分子网络数据嵌入框架,这将使得它们在精准医学中的可解释利用实现线性时间内的突破。网络嵌入方法将节点映射到低维空间中的点,因此学习空间中的接近反映了网络的拓扑功能关系。它们最近在各种生物医学应用中利用少量的单细胞数据取得前所未有的性能。然而,迄今为止的研究都局限于问题的特殊变体和数据,其性能取决于底层拓扑功能网络生物学假设、生物医学应用和评估指标。多尺度分子数据的可用性、现代图嵌入范式和计算能力要求创建和训练高效、可解释、可控的模型,没有潜在的危险的意外行为,实现质量突破。我们建议开发一个通用的、全面的嵌入多尺度分子网络数据的框架,从模型到高效的软件实现,并将其应用于生物医学信息学。这将导致计算和生物医学对数据和疾病的理解实现范式转移,开辟解决一些精准医学和其他领域主要瓶颈的道路。
https://arxiv.org/abs/2405.09595
Medical image interpretation using deep learning has shown promise but often requires extensive expert-annotated datasets. To reduce this annotation burden, we develop an Image-Graph Contrastive Learning framework that pairs chest X-rays with structured report knowledge graphs automatically extracted from radiology notes. Our approach uniquely encodes the disconnected graph components via a relational graph convolution network and transformer attention. In experiments on the CheXpert dataset, this novel graph encoding strategy enabled the framework to outperform existing methods that use image-text contrastive learning in 1% linear evaluation and few-shot settings, while achieving comparable performance to radiologists. By exploiting unlabeled paired images and text, our framework demonstrates the potential of structured clinical insights to enhance contrastive learning for medical images. This work points toward reducing demands on medical experts for annotations, improving diagnostic precision, and advancing patient care through robust medical image understanding.
使用深度学习进行医学图像解释已经显示出很大的潜力,但通常需要大量专家标注的数据。为了减轻这一标注负担,我们开发了一个图像-图卷积学习框架,将胸部X光片与从放射科笔记中自动提取的结构化报告知识图进行配对。我们的方法通过关系图卷积网络和Transformer注意力独特地编码了断开图组件。在CheXpert数据集的实验中,这种新颖的图编码策略使得该框架在1%的线性评估和少样本设置的图像文本对比学习方法中超过了现有方法,同时实现了与放射科医生相当的表现。通过利用未标注的成对图像和文本,我们的框架展示了结构化临床见解增强医学图像对比学习潜力。这项工作有望减少对医学专家的标注需求,提高诊断准确性,并通过可靠的医学图像理解推动患者护理。
https://arxiv.org/abs/2405.09594
The automation of writing imaging reports is a valuable tool for alleviating the workload of radiologists. Crucial steps in this process involve the cross-modal alignment between medical images and reports, as well as the retrieval of similar historical cases. However, the presence of presentation-style vocabulary (e.g., sentence structure and grammar) in reports poses challenges for cross-modal alignment. Additionally, existing methods for similar historical cases retrieval face suboptimal performance owing to the modal gap issue. In response, this paper introduces a novel method, named Factual Serialization Enhancement (FSE), for chest X-ray report generation. FSE begins with the structural entities approach to eliminate presentation-style vocabulary in reports, providing specific input for our model. Then, uni-modal features are learned through cross-modal alignment between images and factual serialization in reports. Subsequently, we present a novel approach to retrieve similar historical cases from the training set, leveraging aligned image features. These features implicitly preserve semantic similarity with their corresponding reference reports, enabling us to calculate similarity solely among aligned features. This effectively eliminates the modal gap issue for knowledge retrieval without the requirement for disease labels. Finally, the cross-modal fusion network is employed to query valuable information from these cases, enriching image features and aiding the text decoder in generating high-quality reports. Experiments on MIMIC-CXR and IU X-ray datasets from both specific and general scenarios demonstrate the superiority of FSE over state-of-the-art approaches in both natural language generation and clinical efficacy metrics.
写作影像报告的自动化是一个减轻放射科医生工作负担的有价值的工具。这个过程的关键步骤包括医学图像和报告之间的跨模态对齐以及检索类似历史病例。然而,报告中出现的表现式词汇(例如句子结构和语法)对跨模态对齐提出了挑战。此外,由于模态缺口问题,现有的类似历史病例检索方法表现不佳。为了应对这个问题,本文介绍了一种名为Factual Serialization Enhancement(FSE)的新方法,用于生成胸部X光报告。FSE首先采用结构实体方法消除报告中的表现式词汇,为我们的模型提供具体输入。然后,通过图像和报告之间的跨模态对齐学习单模态特征。接着,我们提出了一种从训练集中检索类似历史病例的新方法,利用对齐的图像特征。这些特征隐含地保留与其相应参考报告的语义相似性,使我们能够仅在对齐的特征之间计算相似度。这有效地消除了无需疾病标签的知识检索中的模态缺口问题。最后,跨模态融合网络被用于从这些病例中查询有价值的信息,丰富图像特征并帮助文本解码器生成高质量的报告。在特定和通用情景下的MIMIC-CXR和IU X-ray数据集的实验证明,FSE在自然语言生成和临床有效性指标方面优于最先进的 approaches。
https://arxiv.org/abs/2405.09586
This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME) corpus, we adapted a state-of-the-art AMR parser utilizing continuous training. Our approach incorporates data augmentation techniques to enhance the accuracy of AMR structure predictions. Notably, through this learning strategy, our parser achieved an impressive F1 score of 88% on the THYME corpus's colon cancer dataset. Moreover, our research delved into the efficacy of data required for domain adaptation within the realm of clinical notes, presenting domain adaptation data requirements for AMR parsing. This exploration not only underscores the parser's robust performance but also highlights its potential in facilitating a deeper understanding of clinical narratives through structured semantic representations.
本文致力于设计和评估第一篇针对临床笔记的AMR解析器。我们的目标是促进将临床笔记精确转换为结构化AMR表达,从而提高临床文本数据的可解释性和可用性。利用Temporal Histories of Your Medical Events(THYME)语料库中的结肠癌数据集,我们使用连续训练的最先进的AMR解析器进行适应。我们的方法包括数据增强技术,以提高AMR结构预测的准确性。值得注意的是,通过这种学习策略,我们的解析器在THYME语料库的结肠癌数据集上获得了令人印象深刻的88%的F1分数。此外,我们的研究深入探讨了在临床笔记领域进行领域自适应所需的数据量,提出了用于AMR解析的领域自适应数据需求。这一探索不仅强调了解析器的稳健性能,还突出了它在促进对临床叙事的深入理解方面的重要性。
https://arxiv.org/abs/2405.09153
In medical image segmentation tasks, diffusion models have shown significant potential. However, mainstream diffusion models suffer from drawbacks such as multiple sampling times and slow prediction results. Recently, consistency models, as a standalone generative network, have resolved this issue. Compared to diffusion models, consistency models can reduce the sampling times to once, not only achieving similar generative effects but also significantly speeding up training and prediction. However, they are not suitable for image segmentation tasks, and their application in the medical imaging field has not yet been explored. Therefore, this paper applies the consistency model to medical image segmentation tasks, designing multi-scale feature signal supervision modes and loss function guidance to achieve model convergence. Experiments have verified that the CTS model can obtain better medical image segmentation results with a single sampling during the test phase.
在医学图像分割任务中,扩散模型具有显著的优势。然而,主流扩散模型存在诸如多次采样时间和预测结果缓慢等缺点。最近,一致性模型作为一种独立的生成网络,解决了这个问题。与扩散模型相比,一致性模型可以将其采样次数减少到一次,不仅实现了与扩散模型相似的生成效果,而且大大加快了训练和预测的速度。然而,它们并不适用于图像分割任务,而且它们在医学图像领域的应用还没有被探索过。因此,本文将一致性模型应用于医学图像分割任务,设计多尺度特征信号指导模式和损失函数引导以实现模型收敛。实验证实,在测试阶段,CTS模型可以通过一次采样获得更好的医学图像分割结果。
https://arxiv.org/abs/2405.09056
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.
本文提出了一种利用成骨扫描图像搭配法国报告进行视觉-语言预训练的方法,以解决骨摄影下游任务的挑战。具体来说,我们介绍了一个实用的处理流程来匿名化和处理法国医疗报告。预训练包括对深度模型编码器产生的视觉和文本嵌入空间的自监督对齐。然后,经过调整的图像编码器用于处理各种下游任务,包括对骨关节炎的定量、对儿童手腕上的骨龄估计、骨骨折和异常检测。我们的方法在下游任务上具有竞争力的性能,与需要大量人专家注释的替代方法相比。我们的工作是第一项将法国报告整合到专门用于骨X光表示的嵌入空间的研究,充分利用了医院中存在的大量成对图像和报告数据。通过在语言特定的场景中依赖通用视觉语言深度模型,它为更广泛的医疗应用部署视觉模型做出了贡献。
https://arxiv.org/abs/2405.08932