Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
开放词汇部分分割(OVPS)是一个新兴领域,专注于识别未见过类别中的细粒度部分。我们确定了OVPS的两个主要挑战:(1) 部分级别的图像-文本对应关系对齐困难;以及 (2) 缺乏结构化理解以划分对象部分。为了解决这些问题,我们提出了PartCATSeg框架,它集成了面向对象的部分级别成本聚合、组合损失和来自DINO的结构指导。 我们的方法采用了一种解耦的成本聚合策略,分别处理对象级和部分级别的成本,从而提高了部分分割的精度。此外,我们引入了组合损失来更好地捕捉部分与对象之间的关系,弥补了有限部分注释的问题。从DINO特征中获得的结构化指导则改善了边界划分以及部分间的理解。 在Pascal-Part-116、ADE20K-Part-234和PartImageNet数据集上的广泛实验表明,我们的方法显著优于现有最先进方法,并为未见过的部分类别提供了强大的泛化能力基准。
https://arxiv.org/abs/2501.09688
Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
训练深度神经网络需要大量带有标注的数据集。这些数据集的收集和标注不仅成本极高,还面临着法律和隐私问题。这些问题在许多实际应用中构成了重大限制。为了应对这一挑战,我们引入了一种名为HydraMix的新架构,该架构通过混合同一类中的多个不同图像来生成新的图像组合。HydraMix利用基于分割的混合掩码,在特征空间中学习多种图像内容的融合,并通过无监督和对抗性训练进行优化。我们的数据增强方案允许从非常小的数据集中从头开始创建模型。 我们在ciFAIR-10、STL-10和ciFAIR-100上进行了广泛的实验,并且还引入了一种新的文本-图像指标,用于评估扩充后的数据集的泛化能力。实验结果表明,HydraMix在小数据集上的图像分类任务中优于现有的最先进的方法。
https://arxiv.org/abs/2501.09504
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: this https URL.
基础模型通过在大规模数据集上的预训练,已在计算机视觉领域实现了跨多种任务的卓越性能。然而,在手术计算机视觉领域的应用却相对有限。本研究旨在填补这一空白,引入了SurgeNetXL,这是一种新型的手术基础模型,并为手术计算机视觉设定了新的基准。该模型是在迄今为止报道的最大规模的手术数据集上训练出来的,包含超过470万帧视频图像。SurgeNetXL在涵盖四个手术程序和三个任务(语义分割、阶段识别以及关键安全视图(CVS)分类)的六个数据集中均表现出持续领先的成绩。 相较于目前表现最佳的手术基础模型,SurgeNetXL在语义分割、阶段识别及CVS分类上分别提高了2.4%,9.0%和12.6%。此外,在各自的任务中,与基于ImageNet的最佳变体相比,SurgeNetXL的表现也高出14.4%,4.0%以及1.6%。 除提升模型性能外,本研究还提供了有关如何扩大预训练数据集规模、延长训练时长及优化手术计算机视觉领域中的模型架构的关键见解。这些发现为在数据稀缺场景下提高通用性和鲁棒性铺平了道路,并为该领域的未来研究提供了一个全面的框架。 所有模型以及SurgeNetXL数据集中的一部分(包括超过200万帧视频图像)均可从以下网址公开获取:[此链接](https://thishttpsURL.com)。
https://arxiv.org/abs/2501.09436
Image segmentation, a key task in computer vision, has traditionally relied on convolutional neural networks (CNNs), yet these models struggle with capturing complex spatial dependencies, objects with varying scales, need for manually crafted architecture components and contextual information. This paper explores the shortcomings of CNN-based models and the shift towards transformer architectures -to overcome those limitations. This work reviews state-of-the-art transformer-based segmentation models, addressing segmentation-specific challenges and their solutions. The paper discusses current challenges in transformer-based segmentation and outlines promising future trends, such as lightweight architectures and enhanced data efficiency. This survey serves as a guide for understanding the impact of transformers in advancing segmentation capabilities and overcoming the limitations of traditional models.
图像分割,作为计算机视觉中的一个关键任务,长期以来一直依赖于卷积神经网络(CNN)。然而,这些模型在捕捉复杂的空间依赖关系、处理不同尺度的对象以及利用手工设计的架构组件和上下文信息方面存在困难。本文探讨了基于 CNN 的模型的不足之处,并转向基于变压器架构的趋势以克服这些限制。这项工作回顾了最新的基于变压器的分割模型,针对特定于分割的挑战及其解决方案进行了讨论。 论文还讨论了当前基于变压器的分割面临的挑战,并概述了一些有前景的发展趋势,例如轻量级架构和增强的数据效率。这篇综述旨在帮助理解变压器在提升分割能力以及克服传统模型局限性方面的影响。
https://arxiv.org/abs/2501.09372
Nowadays, more and more images are available. Annotation and retrieval of the images pose classification problems, where each class is defined as the group of database images labelled with a common semantic label. Various systems have been proposed for content-based retrieval, as well as for image classification and indexing. In this paper, a hierarchical classification framework has been proposed for bridging the semantic gap effectively and achieving multi-category image classification. A well known pre-processing and post-processing method was used and applied to three problems; image segmentation, object identification and image classification. The method was applied to classify single object images from Amazon and Google datasets. The classification was tested for four different classifiers; BayesNetwork (BN), Random Forest (RF), Bagging and Vote. The estimated classification accuracies ranged from 20% to 99% (using 10-fold cross validation). The Bagging classifier presents the best performance, followed by the Random Forest classifier.
如今,越来越多的图像可供使用。对这些图像进行标注和检索时会遇到分类问题,每个类别被定义为一组带有共同语义标签的数据集图片。已经提出了多种基于内容的检索系统以及用于图像分类和索引的方法。本文提出了一种层次化分类框架,旨在有效地弥合语义差距,并实现多类别的图像分类。文中使用并应用了一个著名的预处理和后处理方法来解决三个问题:图像分割、对象识别和图像分类。该方法被应用于亚马逊(Amazon)和谷歌(Google)数据集中的单个对象图片的分类任务上。 采用四种不同的分类器进行测试,包括贝叶斯网络(Bayes Network, BN)、随机森林(Random Forest, RF)、Bagging 和投票(Vote)。经过10折交叉验证后,估计的分类准确率范围从20%到99%。其中,Bagging 分类器表现最佳,其次是随机森林分类器。
https://arxiv.org/abs/2501.09311
Model compression through knowledge distillation has seen extensive application in classification and segmentation tasks. However, its potential in image-to-image translation, particularly in image restoration, remains underexplored. To address this gap, we propose a Simultaneous Learning Knowledge Distillation (SLKD) framework tailored for model compression in image restoration tasks. SLKD employs a dual-teacher, single-student architecture with two distinct learning strategies: Degradation Removal Learning (DRL) and Image Reconstruction Learning (IRL), simultaneously. In DRL, the student encoder learns from Teacher A to focus on removing degradation factors, guided by a novel BRISQUE extractor. In IRL, the student decoder learns from Teacher B to reconstruct clean images, with the assistance of a proposed PIQE extractor. These strategies enable the student to learn from degraded and clean images simultaneously, ensuring high-quality compression of image restoration models. Experimental results across five datasets and three tasks demonstrate that SLKD achieves substantial reductions in FLOPs and parameters, exceeding 80\%, while maintaining strong image restoration performance.
模型通过知识蒸馏进行压缩在分类和分割任务中得到了广泛的应用,但在图像到图像的转换领域,尤其是在图像恢复方面,其潜力尚未被充分探索。为了填补这一空白,我们提出了一种专门用于图像恢复任务中的模型压缩的Simultaneous Learning Knowledge Distillation (SLKD)框架。SLKD采用了一个双教师单学生架构,并结合了两种不同的学习策略:退化移除学习(DRL)和图像重建学习(IRL),以同时进行。 在DRL中,学生编码器从教师A那里学习如何专注于去除退化因素,这受到新颖的BRISQUE提取器的指导。而在IRL中,学生解码器从教师B那里学习如何利用提出的PIQE提取器的帮助来重建干净图像。这两种策略使学生能够同时从降质和未降质的图像中学到知识,从而确保高质量地压缩图像恢复模型。 通过五个数据集和三个任务进行的实验结果表明,SLKD在保持强大的图像恢复性能的同时,可以实现超过80%的FLOPs(浮点运算次数)和参数减少。
https://arxiv.org/abs/2501.09268
Accurate molecular quantification is essential for advancing research and diagnostics in fields such as infectious diseases, cancer biology, and genetic disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for achieving absolute quantification. While computational ddPCR technologies have advanced significantly, achieving automatic interpretation and consistent adaptability across diverse operational environments remains a challenge. To address these limitations, we introduce the intelligent interpretable droplet digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end predictive models (for droplet segmentation and classification) with GPT-4o multimodal large language model (MLLM, for context-aware explanations and recommendations) to automate and enhance ddPCR image analysis. This approach surpasses the state-of-the-art models, affording 99.05% accuracy in processing complex ddPCR images containing over 300 droplets per image with varying signal-to-noise ratios (SNRs). By combining specialized neural networks and large language models, the I2ddPCR assay offers a robust and adaptable solution for absolute molecular quantification, achieving a sensitivity capable of detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it improves model's transparency through detailed explanation and troubleshooting guidance, empowering users to make informed decisions. This innovative framework has the potential to benefit molecular diagnostics, disease research, and clinical applications, especially in resource-constrained settings.
准确的分子定量对于传染病、癌症生物学和遗传疾病等领域的研究与诊断至关重要。液滴数字PCR(ddPCR)已成为实现绝对定量的标准方法之一。尽管计算型ddPCR技术已经取得了显著进展,但自动解读及在各种操作环境下的一致适应性仍然面临挑战。为了解决这些问题,我们引入了智能可解释的液滴数字PCR(I2ddPCR)检测法,这是一个全面的框架,结合前端预测模型(用于液滴分割和分类)与GPT-4o多模态大型语言模型(MLLM,用于上下文感知说明及建议),以实现并增强ddPCR图像分析的自动化。这种方法超越了现有技术,在处理每张包含超过300个液滴且信噪比变化复杂的ddPCR图像时,达到了99.05%的精度。 通过结合专门的神经网络和大型语言模型,I2ddPCR检测法提供了一种稳健且适应性强的解决方案,实现了能够检测低至每微升90.32拷贝数的目标分子绝对定量。此外,它通过详细的解释与故障排除指导提高了模型的透明度,使用户能够做出明智的决定。 这种创新框架有可能在分子诊断、疾病研究和临床应用(尤其是在资源受限的情况下)中发挥重要作用。
https://arxiv.org/abs/2501.09218
Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
视觉空间系统在混凝土裂缝检测中变得越来越重要。然而,现有方法往往缺乏对多样场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且难以处理曲线或复杂的几何形状。为了克服这些局限性,本研究提出了一种结合计算机视觉技术和多模态同步定位与地图构建(SLAM)的新框架,用于二维(2D)裂缝检测、三维(3D)重建和自动测量。首先,在DeepLabv3+分割模型的基础上进行改进,并利用基础模型Segment Anything Model (SAM) 进行特定的优化,我们开发了一种在不熟悉场景中具有强泛化的裂缝分割方法,能够生成精确的2D裂缝掩模。为了提高三维重建的准确性和鲁棒性,本研究结合了激光雷达点云、图像数据和分割掩模的数据。通过利用图像SLAM和激光雷达SLAM,我们开发了一个多帧和多模态融合框架,产生密集且彩色化的点云,在3D现实尺度上有效地捕捉裂缝语义信息。此外,还在三维稠密的点云空间内直接自动测量了裂缝几何属性,超出了传统二维图像基方法的限制。这一进展使得该方法适用于具有曲线及复杂三维几何形状的结构部件。各种混凝土结构上的实验结果强调了所提方法在实际应用中的显著改进和独特优势,证明其有效、准确且鲁棒性良好。
https://arxiv.org/abs/2501.09203
Large-scale text-to-image (T2I) diffusion models have demonstrated an outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this work, we present ObjectDiffusion, a model that takes inspirations from the top cutting-edge image generative frameworks to seamlessly condition T2I models with new bounding boxes capabilities. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pretraining parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model achieves an AP$_{50}$ of 46.6, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in all of the three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple objects of different sizes and locations.
大规模的文本到图像(T2I)扩散模型在从自然语言文字描述中生成多样且高质量视觉效果方面表现出卓越性能。已经开发出多种布局到图像的模型,利用包括分割图、边缘和人体关键点在内的广泛布局来控制生成过程。在这项工作中,我们提出了ObjectDiffusion模型,该模型借鉴了顶尖的图像生成框架,以无缝地将新的边界框功能整合进T2I模型中进行条件处理。具体来说,我们在ControlNet引入的网络架构基础上进行了重大修改,并将其与GLIGEN提出的条件处理和注入技术相结合。ObjectDiffusion使用从大规模数据集训练中获得的知识预训练参数来初始化自身。我们对ObjectDiffusion在COCO2017训练数据集上进行微调,并在其验证数据集上进行评估。我们的模型在AP$_{50}$、AR以及FID这三个指标上分别达到了46.6、44.5和19.8,超越了当前开源数据集训练的最先进(SOTA)模型的所有性能指标。ObjectDiffusion展示了生成多样且高质量、高保真的图像的独特能力,这些图像是根据语义和空间控制布局无缝形成的。在定性和定量测试中,在封闭集合和开放集合设置以及各种上下文背景下,ObjectDiffusion展示出了显著的定位能力。定性评估验证了ObjectDiffusion能够生成不同大小和位置的多个物体的能力。
https://arxiv.org/abs/2501.09194
Prostate cancer (PCa) is the most prevalent cancer among men in the United States, accounting for nearly 300,000 cases, 29% of all diagnoses and 35,000 total deaths in 2024. Traditional screening methods such as prostate-specific antigen (PSA) testing and magnetic resonance imaging (MRI) have been pivotal in diagnosis, but have faced limitations in specificity and generalizability. In this paper, we explore the potential of enhancing PCa lesion segmentation using a novel MRI modality called synthetic correlated diffusion imaging (CDI$^s$). We employ several state-of-the-art deep learning models, including U-Net, SegResNet, Swin UNETR, Attention U-Net, and LightM-UNet, to segment PCa lesions from a 200 CDI$^s$ patient cohort. We find that SegResNet achieved superior segmentation performance with a Dice-Sorensen coefficient (DSC) of $76.68 \pm 0.8$. Notably, the Attention U-Net, while slightly less accurate (DSC $74.82 \pm 2.0$), offered a favorable balance between accuracy and computational efficiency. Our findings demonstrate the potential of deep learning models in improving PCa lesion segmentation using CDI$^s$ to enhance PCa management and clinical support.
前列腺癌(PCa)是美国男性中最常见的癌症,占所有病例的近30万例,占全部诊断病例的29%,并在2024年导致了大约35,000人死亡。传统的筛查方法,如前列腺特异性抗原(PSA)测试和磁共振成像(MRI),在诊断中发挥了关键作用,但它们在特异性和普适性方面存在局限性。本文探讨了一种名为合成相关扩散成像(CDI$^s$)的新MRI模式用于增强PCa病灶分割的潜力。我们采用了几种最先进的深度学习模型,包括U-Net、SegResNet、Swin UNETR、Attention U-Net和LightM-UNet,以从200名CDI$^s$患者的队列中分割出PCa病灶。研究发现,SegResNet在Dice-Sorensen系数(DSC)为76.68 ± 0.8的情况下实现了最佳的分割性能。值得注意的是,尽管Attention U-Net的准确性稍低一些(DSC为74.82 ± 2.0),但它在准确性和计算效率之间提供了一个理想的平衡点。我们的研究结果展示了深度学习模型用于提高使用CDI$^s$进行PCa病灶分割的潜力,并可以增强对PCa的管理和临床支持。
https://arxiv.org/abs/2501.09185
Vision foundation models have achieved remarkable progress across various image analysis tasks. In the image segmentation task, foundation models like the Segment Anything Model (SAM) enable generalizable zero-shot segmentation through user-provided prompts. However, SAM primarily trained on natural images, lacks the domain-specific expertise of medical imaging. This limitation poses challenges when applying SAM to medical image segmentation, including the need for extensive fine-tuning on specialized medical datasets and a dependency on manual prompts, which are both labor-intensive and require intervention from medical experts. This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM), a novel method designed to adapt the advanced Segment Anything Model 2 (SAM2) for 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of SAM2 to enable few-shot adaptation, leveraging a small number of support examples to capture anatomical knowledge and perform prompt-free segmentation, without requiring model fine-tuning. To handle the volumetric nature of medical images, we incorporate a Volumetric Consistency mechanism that enhances spatial coherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging datasets and compare it with supervised learning methods, zero-shot SAM approaches, and fine-tuned medical SAM methods. Results show that FATE-SAM delivers robust and accurate segmentation while eliminating the need for large annotated datasets and expert intervention. FATE-SAM provides a practical, efficient solution for medical image segmentation, making it more accessible for clinical applications.
视觉基础模型在各种图像分析任务中取得了显著进展。在图像分割任务方面,像Segment Anything Model (SAM)这样的基础模型能够通过用户提供的提示实现零样本泛化分割。然而,由于SAM主要是在自然图像上进行训练的,因此缺乏医学成像领域的专业知识。这使得将SAM应用于医学图像分割时面临挑战,包括需要对专门的医学数据集进行大量微调以及依赖于手动提示的需求,这两种需求既费时又需要医疗专家的介入。 为此,本工作引入了Few-shot Adaptation of Training-free SAM (FATE-SAM),这是一种创新方法,旨在将先进的Segment Anything Model 2(SAM2)调整用于3D医学图像分割。FATE-SAM重新组装了预训练的SAM2模块,以实现少量样本适应性,并利用少量支持示例来捕捉解剖学知识并执行无需提示的分割,同时避免了对模型微调的需求。为了处理医学图像体积性质的问题,我们引入了一个体积一致性机制,增强了3D切片之间的空间连贯性。 我们在多个医学成像数据集上评估FATE-SAM,并将其与监督学习方法、零样本SAM方法和微调后的医疗SAM方法进行了比较。结果显示,FATE-SAM在不需要大量注释数据集和专家介入的情况下提供了稳健且准确的分割结果。因此,FATE-SAM为医学图像分割提供了一种实用高效的解决方案,使其更适用于临床应用。
https://arxiv.org/abs/2501.09138
Small object segmentation, like tumor segmentation, is a difficult and critical task in the field of medical image analysis. Although deep learning based methods have achieved promising performance, they are restricted to the use of binary segmentation mask. Inspired by the rigorous mapping between binary segmentation mask and distance map, we adopt distance map as a novel ground truth and employ a network to fulfill the computation of distance map. Specially, we propose a new segmentation framework that incorporates the existing binary segmentation network and a light weight regression network (dubbed as LR-Net). Thus, the LR-Net can convert the distance map computation into a regression task and leverage the rich information of distance maps. Additionally, we derive a shape-aware loss by employing distance maps as penalty map to infer the complete shape of an object. We evaluated our approach on MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge dataset and a clinical dataset. Experimental results show that our approach outperforms the classification-based methods as well as other existing state-of-the-arts.
小对象分割,如肿瘤分割,在医学图像分析领域是一项困难且关键的任务。尽管基于深度学习的方法已经取得了令人鼓舞的性能,但它们仍然局限于使用二值分割掩码。受到二值分割掩码和距离图之间严谨映射关系的启发,我们将距离图作为新的真实标签,并采用网络来完成距离图的计算。特别地,我们提出了一种新的分割框架,该框架结合了现有的二值分割网络以及一个轻量级回归网络(称为LR-Net)。因此,LR-Net可以将距离图计算转换为回归任务,并利用距离图中的丰富信息。此外,我们通过采用距离图作为惩罚图来推断物体的完整形状,导出了一个感知形状的损失函数。 我们在MICCAI 2017肝脏肿瘤分割(LiTS)挑战数据集和临床数据集上评估了我们的方法。实验结果表明,与基于分类的方法和其他现有的最新技术相比,我们的方法表现更优。
https://arxiv.org/abs/2501.09116
Towards clinical interpretations, this paper presents a new ''output-with-confidence'' segmentation neural network with multiple input images and multiple output segmentation maps and their pairwise relations. A confidence score of the test image without ground-truth can be estimated from the difference among the estimated relation maps. We evaluate the method based on the widely used vanilla U-Net for segmentation and our new model is named Relation U-Net which can output segmentation maps of the input images as well as an estimated confidence score of the test image without ground-truth. Experimental results on four public datasets show that Relation U-Net can not only provide better accuracy than vanilla U-Net but also estimate a confidence score which is linearly correlated to the segmentation accuracy on test images.
面向临床解释,本文提出了一种新的“带置信度输出”的分割神经网络,该网络可以处理多张输入图像,并生成多张分割图及其成对关系。对于没有地面真实标签的测试图像,可以通过估计的关系图之间的差异来估算其置信分数。我们基于广泛使用的传统U-Net模型进行了方法评估,我们的新模型称为Relation U-Net,它不仅可以输出输入图像的分割图,还可以为没有地面真实信息的测试图像提供一个估计的置信度评分。 实验结果表明,在四个公开数据集上的性能显示,Relation U-Net不仅在准确性方面超越了传统U-Net,而且可以估算与测试图像分割准确率呈线性相关性的置信分数。
https://arxiv.org/abs/2501.09101
The Masked Autoencoder (MAE) has recently demonstrated effectiveness in pre-training Vision Transformers (ViT) for analyzing natural images. By reconstructing complete images from partially masked inputs, the ViT encoder gathers contextual information to predict the missing regions. This capability to aggregate context is especially important in medical imaging, where anatomical structures are functionally and mechanically linked to surrounding regions. However, current methods do not consider variations in the number of input images, which is typically the case in real-world Magnetic Resonance (MR) studies. To address this limitation, we propose a 3D Adaptive Masked Autoencoders (AMAE) architecture that accommodates a variable number of 3D input contrasts per subject. A magnetic resonance imaging (MRI) dataset of 45,364 subjects was used for pretraining and a subset of 1648 training, 193 validation and 215 test subjects were used for finetuning. The performance demonstrates that self pre-training of this adaptive masked autoencoders can enhance the infarct segmentation performance by 2.8%-3.7% for ViT-based segmentation models.
最近,带遮罩的自动编码器(MAE)在用于分析自然图像的视觉变压器(ViT)的预训练中展示了其有效性。通过从部分遮盖的输入重建完整图像,ViT编码器收集上下文信息以预测缺失区域。这种聚合上下文的能力在医学成像中尤为重要,因为在解剖结构的功能和机械连接方面,它们与其周围区域密切相关。然而,目前的方法没有考虑输入图像数量的变化,而在现实世界的磁共振(MR)研究中这种情况通常会发生。为了解决这一限制,我们提出了一种3D自适应遮罩自动编码器(AMAE)架构,该架构可以处理每个受试者不同的3D输入对比度的数量变化。为了预训练,使用了包含45,364名受试者的磁共振成像(MRI)数据集,并且在微调时使用了一个子集,包括1,648个训练、193个验证和215个测试受试者。实验结果显示,这种自适应遮罩自动编码器的自我预训练可以提高基于ViT的分割模型对梗死区域分割性能,使其提高了2.8%-3.7%。
https://arxiv.org/abs/2501.09096
Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.
获取和标注手术数据通常耗费资源、受伦理限制,并且需要大量专家参与。虽然生成式AI模型,如文本到图像的转换,可以缓解数据稀缺问题,但为了精确驱动的外科应用、模拟及教育目的,结合空间注释(例如分割掩模)至关重要。这项研究介绍了一项新任务和方法——SimGen,用于同时生成图像和遮罩。SimGen基于DDPM框架和残差U-Net模型构建,旨在协同生成高质量的手术图像及其对应的分割掩模。 该模型利用交叉相关先验来捕捉连续图像和离散掩模分布之间的依赖关系。此外,还采用了一种正则化方法——规范斐波那契网格(CFL),以增强RGB空间中各类别的分离度和均匀性。SimGen能够生成高保真度的图像和准确的分割掩模,在六个公共数据集上根据图像和语义 inception 距离指标,超越了基准模型的表现。 消融研究表明,CFL可以提高遮罩质量和空间分离效果。下游实验表明,在限制人类数据发布的监管条件下,生成的图象-遮罩对仍可用于研究目的。这项工作为低成本生成配对手术图像及其复杂标签提供了一种解决方案,通过减少昂贵的手动注释需求,促进了外科AI的发展。
https://arxiv.org/abs/2501.09008
Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
基础模型(FMs)在放射学中展示了变革潜力,能够在不同的成像模式下执行多样且复杂的任务。在这里,我们开发了CT-FM,这是一个大规模的基于3D图像的预训练模型,专门针对各种放射学任务设计。CT-FM通过无标签对比学习方法,在来自影像数据公用库(Imaging Data Commons)的148,000个计算机断层扫描(CT)扫描基础上进行了预训练。 我们评估了CT-FM在四个类别的任务中的表现,包括全身和肿瘤分割、头部CT分类、医学图像检索以及语义理解。结果显示,在所有类别中,CT-FM的表现均优于最先进的模型。 除了量化的成功之外,CT-FM还展示出将解剖区域进行聚类,并能够在不同扫描之间识别类似的解剖结构的能力。此外,它在测试-重测设置下仍然保持了其稳健性,并显示出了与其嵌入关联的合理显著区域。 本研究证明了大规模医学影像基础模型的价值,并通过开源该模型权重、代码和数据,旨在支持放射学领域中更灵活、可靠且可解释的人工智能解决方案。
https://arxiv.org/abs/2501.09001
Predicting future brain states is crucial for understanding healthy aging and neurodegenerative diseases. Longitudinal brain MRI registration, a cornerstone for such analyses, has long been limited by its inability to forecast future developments, reliance on extensive, dense longitudinal data, and the need to balance registration accuracy with temporal smoothness. In this work, we present \emph{TimeFlow}, a novel framework for longitudinal brain MRI registration that overcomes all these challenges. Leveraging a U-Net architecture with temporal conditioning inspired by diffusion models, TimeFlow enables accurate longitudinal registration and facilitates prospective analyses through future image prediction. Unlike traditional methods that depend on explicit smoothness regularizers and dense sequential data, TimeFlow achieves temporal consistency and continuity without these constraints. Experimental results highlight its superior performance in both future timepoint prediction and registration accuracy compared to state-of-the-art methods. Additionally, TimeFlow supports novel biological brain aging analyses, effectively differentiating neurodegenerative conditions from healthy aging. It eliminates the need for segmentation, thereby avoiding the challenges of non-trivial annotation and inconsistent segmentation errors. TimeFlow paves the way for accurate, data-efficient, and annotation-free prospective analyses of brain aging and chronic diseases.
预测未来的脑部状态对于理解健康老龄化和神经退行性疾病至关重要。纵向脑MRI配准作为此类分析的关键手段,长期以来一直受限于其无法预测未来发展趋势、依赖大量密集的纵向数据以及需要在注册精度与时间连续性之间取得平衡的问题。在这项工作中,我们提出了一个名为“TimeFlow”的新型框架,用于纵向脑部MRI配准,该框架克服了上述所有挑战。 TimeFlow利用受扩散模型启发的U-Net架构和时序条件化技术,实现了准确的纵向配准,并通过未来图像预测促进了前瞻性的分析。与传统的依赖显式平滑正则器和密集序列数据的方法不同,TimeFlow能够在没有这些限制的情况下实现时间一致性和平稳性。 实验结果表明,在未来的时点预测和注册准确性方面,TimeFlow优于现有的最佳方法。此外,TimeFlow支持新型的生物学脑部老化分析,能够有效地区分神经退行性疾病与健康老龄化之间的差异。它消除了对分割的需求,从而避免了非平凡注释和不一致分割错误带来的挑战。 通过准确、数据高效且无需标注的方式进行前瞻性分析,TimeFlow为大脑老化的研究以及慢性疾病的诊断铺平了道路。
https://arxiv.org/abs/2501.08667
In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{this https URL}.
在计算机视觉领域,参数高效调优(Parameter-Efficient Tuning,PET)正逐渐取代传统的预训练后进行全量微调的范式。PET特别受到青睐,因为它能在大型基础模型中有效实现迁移学习成本的降低和硬件资源利用率的最大化。然而,现有的PET方法主要针对单模态优化设计,尽管一些开创性的研究已经开始初步探索跨模态应用,但它们仍停留在对齐编码器(如CLIP)的阶段,并未深入探究不对齐编码器的应用情况。这些方法在面对不对齐编码器时表现出次优性能,因为它们无法有效地在微调过程中对多模态特征进行对齐。 本文介绍了一种新的参数高效调优框架DETRIS,该框架通过建立每一层与其他所有前一层之间的密集互联来增强低秩视觉特征的传播。这种设计使跨模态特征的有效交互和适应不对齐编码器成为可能。此外,我们建议使用文本适配器来改进文本特征。我们的方法简单而高效,在仅更新0.9%至1.8%主干参数的情况下,在具有挑战性的基准上超越了当前最先进的技术。 该项目的代码和资源可在此网址获取:\url{this https URL}。
https://arxiv.org/abs/2501.08580
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at this https URL
语义分割对于理解图像至关重要,但这一过程需要大量的像素级详细标注。在现实世界中获取这些标注可能会非常昂贵。无监督领域适应(UDA)是一种利用带有标签的虚拟数据来训练模型,并将其调整应用于没有标签的真实数据的技术。最近的一些研究工作使用对比学习这种强大的自监督学习方法来进行此技术,但它们未能考虑到每类内部特征的多样性,在使用对比学习时会导致类别预测错误。我们分析了这些工作的局限性,并提出了一种名为伪标签引导像素对比(PGPC)的新框架来克服先前方法的缺点。此外,我们还研究如何利用目标图像中的更多信息而不引入来自伪标签的噪声。我们在两个标准UDA基准测试上测试了我们的方法,并表明它优于现有方法。具体来说,在基于DAFormer的GTA5到Cityscapes和SYNTHIA到Cityscapes任务中,分别实现了5.1%mIoU和4.6%mIoU的相对改进。此外,我们的方法可以增强其他UDA方法的表现而不增加模型复杂度。代码可在以下链接获取:[此处插入URL]
https://arxiv.org/abs/2501.09040