Digital Pathology is a cornerstone in the diagnosis and treatment of diseases. A key task in this field is the identification and segmentation of cells in hematoxylin and eosin-stained images. Existing methods for cell segmentation often require extensive annotated datasets for training and are limited to a predefined cell classification scheme. To overcome these limitations, we propose $\text{CellViT}^{\scriptscriptstyle ++}$, a framework for generalized cell segmentation in digital pathology. $\text{CellViT}^{\scriptscriptstyle ++}$ utilizes Vision Transformers with foundation models as encoders to compute deep cell features and segmentation masks simultaneously. To adapt to unseen cell types, we rely on a computationally efficient approach. It requires minimal data for training and leads to a drastically reduced carbon footprint. We demonstrate excellent performance on seven different datasets, covering a broad spectrum of cell types, organs, and clinical settings. The framework achieves remarkable zero-shot segmentation and data-efficient cell-type classification. Furthermore, we show that $\text{CellViT}^{\scriptscriptstyle ++}$ can leverage immunofluorescence stainings to generate training datasets without the need for pathologist annotations. The automated dataset generation approach surpasses the performance of networks trained on manually labeled data, demonstrating its effectiveness in creating high-quality training datasets without expert annotations. To advance digital pathology, $\text{CellViT}^{\scriptscriptstyle ++}$ is available as an open-source framework featuring a user-friendly, web-based interface for visualization and annotation. The code is available under this https URL.
数字病理学是疾病诊断和治疗的重要基石。该领域的一项关键任务是在苏木精-伊红染色图像中识别并分割细胞。现有的细胞分割方法通常需要大量的标注数据集进行训练,并且局限于预定义的细胞分类方案。为了克服这些限制,我们提出了一种名为$\text{CellViT}^{\scriptscriptstyle ++}$的框架,用于数字病理学中的通用细胞分割。 $\text{CellViT}^{\scriptscriptstyle ++}$采用基于视觉变换器(Vision Transformers)的基础模型作为编码器,同时计算深层细胞特征和分割掩模。为了适应未见过的细胞类型,我们采用了计算效率高的方法,该方法需要少量数据进行训练,并显著减少了碳排放。 我们在七个不同的数据集上展示了卓越的表现,这些数据集涵盖了广泛的细胞类型、器官及临床设置。该框架实现了令人印象深刻的零样本分割和高效的数据驱动的细胞分类。此外,$\text{CellViT}^{\scriptscriptstyle ++}$能够利用免疫荧光染色生成训练数据集,并且不需要病理学家的手动标注。自动化的数据生成方法超越了那些基于手动标注数据训练的网络的表现,证明其在创建高质量的训练数据集方面无需专家注释。 为了推进数字病理学的发展,$\text{CellViT}^{\scriptscriptstyle ++}$作为一个开源框架公开可用,并提供了一个用户友好的、基于Web的界面用于可视化和标注。代码可以在以下URL获取:[https://链接地址](https://link_to_the_code_repository)(请将“https://链接地址”替换为实际的代码仓库链接)。
https://arxiv.org/abs/2501.05269
Semantic segmentation for autonomous driving is an even more challenging task when faced with adverse driving conditions. Standard models trained on data recorded under ideal conditions show a deteriorated performance in unfavorable weather or illumination conditions. Fine-tuning on the new task or condition would lead to overwriting the previously learned information resulting in catastrophic forgetting. Adapting to the new conditions through traditional domain adaption methods improves the performance on the target domain at the expense of the source domain. Addressing these issues, we propose an architecture-based domain-incremental learning approach called Progressive Semantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing collection of domain-specific segmentation models. The task of inferring the domain and subsequently selecting the appropriate module for segmentation is carried out using a collection of convolutional autoencoders. We extensively evaluate our proposed approach using several datasets at varying levels of granularity in the categorization of adverse driving conditions. Furthermore, we demonstrate the generalization of the proposed approach to similar and unseen domains.
针对自动驾驶中的语义分割任务,在遇到不利的驾驶条件时变得更加具有挑战性。标准模型在理想条件下训练后,在恶劣天气或光照条件下性能会下降。对新任务或条件进行微调会导致覆盖之前学习的信息,从而引发灾难性遗忘。通过传统领域适应方法来适应新的条件虽然可以提高目标领域的表现,但也会牺牲源域的性能。 为解决这些问题,我们提出了一种基于架构的增量领域学习方法,称为渐进语义分割(Progressive Semantic Segmentation, PSS)。PSS是一个任务无关、动态增长的领域特定分割模型集合。通过一组卷积自编码器来完成推断领域并选择适当的模块进行分割的任务。 我们在多个数据集上对所提出的这种方法进行了广泛的评估,这些数据集中包含了不同程度的不利驾驶条件分类细节。此外,我们还展示了该方法在类似和未见过领域的泛化能力。
https://arxiv.org/abs/2501.05246
Late gadolinium enhancement MRI (LGE MRI) is the gold standard for the detection of myocardial scars for post myocardial infarction (MI). LGE MRI requires the injection of a contrast agent, which carries potential side effects and increases scanning time and patient discomfort. To address these issues, we propose a novel framework that combines cardiac motion observed in cine MRI with image texture information to segment the myocardium and scar tissue in the left ventricle. Cardiac motion tracking can be formulated as a full cardiac image cycle registration problem, which can be solved via deep neural networks. Experimental results prove that the proposed method can achieve scar segmentation based on non-contrasted cine images with comparable accuracy to LGE MRI. This demonstrates its potential as an alternative to contrast-enhanced techniques for scar detection.
延迟钆增强磁共振成像(LGE MRI)是检测心肌梗死后心肌疤痕的金标准。然而,LGE MRI需要注射对比剂,这可能带来潜在副作用,并且会增加扫描时间和患者的不适感。为了应对这些问题,我们提出了一种新的框架,该框架结合了心脏运动在电影磁共振成像(cine MRI)中的观察情况与图像纹理信息,以对左心室的心肌和疤痕组织进行分割。心脏运动追踪可以被表述为一个完整心脏影像周期的配准问题,并且可以通过深度神经网络来解决。实验结果证明,所提出的方法能够基于非对比剂的电影图像实现疤痕分割,其准确度与LGE MRI相当。这表明该方法有可能成为检测疤痕的一种替代对比剂增强技术的方案。
https://arxiv.org/abs/2501.05241
Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.
前景分割是计算机视觉中的一个基本任务,涵盖了多种细分任务。以往的研究通常为每个特定任务设计专门的架构,导致了缺乏统一性的问题。此外,它们主要集中在识别前景对象上,而未能有效地区分背景和前景之间的差异。在本文中,我们强调了背景的重要性及其与前景的关系。我们介绍了FOCUS框架(Foreground ObjeCts Universal Segmentation),这是一个能够处理多种前景任务的通用分割框架。我们开发了一种多尺度语义网络,利用物体边缘信息来增强图像特征。为了实现边界感知分割,我们提出了一种新颖的知识蒸馏方法,并结合对比学习策略在多模态特征空间中细化预测掩码。 我们在总计13个数据集上的5项任务上进行了广泛的实验,结果表明FOCUS在大多数指标上持续优于现有的特定任务模型。
https://arxiv.org/abs/2501.05238
External cervical resorption (ECR) is a resorptive process affecting teeth. While in some patients, active resorption ceases and gets replaced by osseous tissue, in other cases, the resorption progresses and ultimately results in tooth loss. For proper ECR assessment, cone-beam computed tomography (CBCT) is the recommended imaging modality, enabling a 3-D characterization of these lesions. While it is possible to manually identify and measure ECR resorption in CBCT scans, this process can be time intensive and highly subject to human error. Therefore, there is an urgent need to develop an automated method to identify and quantify the severity of ECR resorption using CBCT. Here, we present a method for ECR lesion segmentation that is based on automatic, binary classification of locally extracted voxel-wise texture features. We evaluate our method on 6 longitudinal CBCT datasets and show that certain texture-features can be used to accurately detect subtle CBCT signal changes due to ECR. We also present preliminary analyses clustering texture features within a lesion to stratify the defects and identify patterns indicative of calcification. These methods are important steps in developing prognostic biomarkers to predict whether ECR will continue to progress or cease, ultimately informing treatment decisions.
外部颈向性根尖吸收(ECR)是一种影响牙齿的吸收过程。在一些患者中,活性吸收会停止并被骨组织替代;而在其他情况下,这种吸收会继续发展,最终导致牙齿丧失。为了准确评估ECR,建议使用锥形束计算机断层扫描(CBCT),这是一种能够对这些病变进行三维表征的成像方式。尽管可以在CBCT扫描中手动识别和测量ECR吸收,但这一过程耗时且容易出现人为错误。因此,开发一种基于CBCT自动识别和量化ECR严重程度的方法显得非常迫切。 在这里,我们介绍了一种基于自动二元分类局部提取体素纹理特征的ECR病变分割方法。我们在6套纵向CBCT数据集上评估了该方法,并证明某些纹理特征可用于准确检测因ECR引起的细微CBCT信号变化。此外,我们还展示了初步分析结果,通过将病变内的纹理特征进行聚类来对缺陷进行分层并识别出钙化的指示模式。 这些方法对于开发预测ECR是否将继续进展或停止的预后生物标志物至关重要,并最终为治疗决策提供信息。
https://arxiv.org/abs/2501.05236
Tumor volume segmentation on MRI is a challenging and time-consuming process that is performed manually in typical clinical settings. This work presents an approach to automated delineation of head and neck tumors on MRI scans, developed in the context of the MICCAI Head and Neck Tumor Segmentation for MR-Guided Applications (HNTS-MRG) 2024 Challenge. Rather than designing a new, task-specific convolutional neural network, the focus of this research was to propose improvements to the configuration commonly used in medical segmentation tasks, relying solely on the traditional U-Net architecture. The empirical results presented in this article suggest the superiority of patch-wise normalization used for both training and sliding window inference. They also indicate that the performance of segmentation models can be enhanced by applying a scheduled data augmentation policy during training. Finally, it is shown that a small improvement in quality can be achieved by using Gaussian weighting to combine predictions for individual patches during sliding window inference. The model with the best configuration obtained an aggregated Dice Similarity Coefficient (DSCagg) of 0.749 in Task 1 and 0.710 in Task 2 on five cross-validation folds. The ensemble of five models (one best model per validation fold) showed consistent results on a private test set of 50 patients with an DSCagg of 0.752 in Task 1 and 0.718 in Task 2 (team name: this http URL). The source code and model weights are freely available at this http URL.
在MRI图像上对肿瘤体积进行分割是一项具有挑战性和耗时的过程,在典型的临床环境中通常需要手动操作。本研究提出了一种针对头部和颈部肿瘤的自动划分方法,该方法是在MICCAI 2024年头颈肿瘤MR导向应用(HNTS-MRG)分割挑战赛背景下开发的。不同于设计新的特定任务卷积神经网络,这项研究的重点在于改进医学分割任务中常用的配置设置,仅依赖传统的U-Net架构。 本文提供的实证结果表明,在训练和滑动窗口推理过程中使用块级归一化可以提高模型性能。此外,结果显示在训练期间应用预定的数据增强策略也可以提升分割模型的效果。最后,采用高斯加权来组合滑动窗口推理中各个区块的预测结果,可以在质量上实现微小但有价值的改进。 具有最佳配置的模型在任务1中的五折交叉验证下获得了0.749的Dice相似系数(DSCagg),而在任务2中则为0.710。由五个单独训练的最佳模型组成的集成模型,在50名患者的私人测试集上表现出了一致的结果:任务1中DSCagg值为0.752,而任务2中的DSCagg值为0.718(团队名称:请参考链接)。该研究的源代码和模型权重可以在上述提供的链接处免费获取。
https://arxiv.org/abs/2501.05120
The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \url{this https URL}.
预训练和微调的范式已经彻底革新了卫星遥感应用。然而,这种方法在航空激光扫描(ALS)领域仍被很大程度上忽视,而ALS对于森林管理和城市规划等应用场景来说是一项关键技术。在这项研究中,我们通过构建大规模的ALS点云数据集并评估其对下游应用程序的影响来填补这一空白。我们的数据集包括了从美国地质调查局3D高程计划提供的覆盖整个美利坚合众国连续区域内的ALS点云数据。 为了在确保高效的数据采集的同时捕捉多样化的土地覆盖和地形类型,我们引入了一种基于土地利用图和数字高程模型的地理空间采样方法来选择点云地块。作为基线的自监督学习模型,我们采用最先进的用于3D室外点云的掩码自动编码器BEV-MAE,并在构建的数据集上对其进行预训练。随后,我们将这些预训练模型微调以应用于下游任务,包括树种分类、地形场景识别和点云语义分割。 我们的研究结果表明,在所有下游任务中,预训练模型的表现均显著优于从头开始训练的模型,这证明了我们所提出数据集中学习到表示形式的可迁移性。此外,我们观察到使用我们的地理空间采样方法扩大量化规模会一致地提升性能,而基于随机采样的数据集进行预训练则无法达到类似的改善效果。这些发现强调了构建的数据集和我们的采样策略在预训练与微调范式中的实用性。 源代码和预训练模型将在[此处](此URL)公开提供。
https://arxiv.org/abs/2501.05095
Contactless fingerprint recognition systems offer a hygienic, user-friendly, and efficient alternative to traditional contact-based methods. However, their accuracy heavily relies on precise fingertip detection and segmentation, particularly under challenging background conditions. This paper introduces TipSegNet, a novel deep learning model that achieves state-of-the-art performance in segmenting fingertips directly from grayscale hand images. TipSegNet leverages a ResNeXt-101 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) for multi-scale representation, enabling accurate segmentation across varying finger poses and image qualities. Furthermore, we employ an extensive data augmentation strategy to enhance the model's generalizability and robustness. TipSegNet outperforms existing methods, achieving a mean Intersection over Union (mIoU) of 0.987 and an accuracy of 0.999, representing a significant advancement in contactless fingerprint segmentation. This enhanced accuracy has the potential to substantially improve the reliability and effectiveness of contactless biometric systems in real-world applications.
无接触指纹识别系统提供了一种卫生、用户友好且高效的替代方案,相比传统的基于物理接触的方法更为优越。然而,这类系统的准确性在很大程度上依赖于精确的手指尖检测和分割,尤其是在背景复杂的条件下更是如此。本文介绍了一种名为TipSegNet的新型深度学习模型,该模型能够直接从灰度手部图像中实现手指尖的精准分割,并且其性能达到了行业领先水平。 TipSegNet采用ResNeXt-101骨干网络进行稳健特征提取,并结合了特征金字塔网络(FPN),用于跨尺度表示。这种设计使得即使在手指姿态变化多端以及图像质量不一的情况下,也能够实现准确的分割效果。此外,我们还采用了广泛的增强数据策略来提升模型的泛化能力和鲁棒性。 实验结果表明,TipSegNet的表现超越了现有的方法,在均值交并比(mIoU)上达到了0.987,并且准确率达到0.999,标志着在无接触指纹分割领域取得了显著进展。这种增强后的准确性有望大幅提高无接触生物识别系统在实际应用中的可靠性和有效性。
https://arxiv.org/abs/2501.05076
Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in this https URL.
扩散模型由于在模拟复杂数据分布方面的说服力表现,在生成领域得到了广泛的应用。此外,它们还在判别性任务(如图像分割)中表现出具有竞争力的结果。尽管扩散模型已被探索用于自动音乐转录,但其性能尚未达到竞争水平。在这篇论文中,我们专注于离散扩散模型的改进能力,并提出了一种新的架构来处理钢琴转录问题。我们的模型利用了邻里注意力层作为去噪模块,在预训练声学模型精调特征的基础上,逐步预测目标高分辨率的钢琴滚筒图(piano roll)。为了进一步增强改进效果,我们设计了一种新策略,该策略在离散扩散模型的训练和推理阶段应用不同的过渡状态。我们在MAESTRO数据集上的实验表明,在F1分数方面,我们的方法优于先前基于扩散的钢琴转录模型及基线模型。我们的代码可以在提供的链接中找到(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2501.05068
3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
3D Referring Expression Segmentation (3D-RES) 的目标是根据给定的表达式对点云场景进行分割。然而,现有的3D-RES方法面临着两个主要挑战:特征模糊性和意图模糊性。特征模糊性来源于由于光照和视角等限制,在获取点云过程中导致的信息丢失或失真。意图模糊性指的是模型在解码过程中对所有查询一视同仁,缺乏特定任务的自上而下的指导信息。 在这篇论文中,我们引入了一种图像增强提示解码网络(IPDN),该网络利用多视角图像和任务驱动信息来提升模型的理解能力。为了应对特征模糊性问题,我们提出了一种多视图语义嵌入(MSE)模块,它将2D图像的多视角信息注入到3D场景中,并补偿潜在的空间信息损失。为了解决意图模糊性问题,我们设计了一个提示感知解码器(PAD),该解码器通过表达式与视觉特征之间交互产生的任务驱动信号来指导解码过程。 综合实验表明,在3D-RES和3D-GRES任务上,IPDN在mIoU度量标准下分别比最先进的方法高出1.9分和4.2分。
https://arxiv.org/abs/2501.04995
Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at this https URL.
基于文本描述的视频对象分割旨在对视频中的特定对象进行分割,这些对象由给定的文字描述指定。现有的基于Transformer的时间建模方法在查询一致性问题和上下文考虑不足方面面临挑战。查询不一致导致视频中间不同对象的掩码不稳定;而有限的上下文考量则可能导致由于未能充分考虑给定文本与实例之间的关系而导致对错误的对象进行分割。 为了解决这些问题,我们提出了多上下文时间一致性模块(MTCM),该模块由Aligner和多上下文增强器(MCE)组成。Aligner通过去除查询中的噪声并对其进行对齐来实现查询的一致性;而MCE则通过考虑多上下文来预测与文本相关的查询。 我们将MTCM应用到四种不同的模型上,所有这些模型的性能都有所提高,尤其是在MeViS数据集上的J&F指标达到了47.6。代码可在提供的链接中获取。
https://arxiv.org/abs/2501.04939
Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods. Project page: this https URL
最近在可控图像生成和编辑方面的进展主要由基于扩散的方法推动。尽管扩散模型在特定任务中通过定制设计表现出色,但建立统一的模型仍然具有挑战性。相比之下,自回归模型天生具备统一的标记化表示,这简化了创建用于各种任务的基础模型的过程。在这项工作中,我们提出了EditAR,这是一个单一的、通用的自回归框架,适用于多种条件图像生成任务,例如图像编辑、深度到图像、边缘到图像和分割到图像等。该模型将图像和指令作为输入,并以简单的下一个标记范式预测编辑后的图像标记。为了增强文本与图像之间的对齐,我们进一步提出将基础模型的知识提炼融入自回归建模过程中。我们在一系列基准测试中对其进行了多样化的任务评估,结果显示其性能可媲美各种最先进的特定任务方法。 项目页面:[此处应插入链接,请访问原始发布以获取准确的URL]
https://arxiv.org/abs/2501.04699
We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-in-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements across these establishing new state-of-the-art. Code: this https URL.
我们提出了一种名为Seg-TTO的创新框架,旨在解决零样本、开放词汇语义分割(OVSS)中的专门领域任务。尽管当前的开放式词汇方法在零样本设置下的标准分割基准测试中表现出色,但它们在高度特定领域的数据集上却不如监督模型。为了解决这一差距,我们专注于在测试时进行细分特性的优化。 分割任务需要理解单个图像内的多个概念,并同时保持表示中的局部性和空间结构。为此,我们提出了一种新颖的自监督目标,该目标符合这些要求,并利用它将模型参数与输入图像对齐以用于测试时间。在文本模态中,为每个类别学习多种嵌入来捕捉图像内多样化的概念;而在视觉模态中,则计算像素级别的损失并执行特定于保持空间结构的嵌入聚合操作。 我们的框架Seg-TTO被设计成可插拔模块,能够与现有的最先进的OVSS方法无缝集成。我们将Seg-TTO整合到三种最先进的OVSS方法中,并在涵盖各种专门领域内的22个具有挑战性的任务上进行了评估。结果显示,我们提出的Seg-TTO在性能上有明显改进,确立了新的最先进水平。 代码链接:[请在此处插入实际的URL链接]
https://arxiv.org/abs/2501.04696
Federated- and Continual Learning have been established as approaches to enable privacy-aware learning on continuously changing data, as required for deploying AI systems in histopathology images. However, data shifts can occur in a dynamic world, spatially between institutions and temporally, due to changing data over time. This leads to two issues: Client Drift, where the central model degrades from aggregating data from clients trained on shifted data, and Catastrophic Forgetting, from temporal shifts such as changes in patient populations. Both tend to degrade the model's performance of previously seen data or spatially distributed training. Despite both problems arising from the same underlying problem of data shifts, existing research addresses them only individually. In this work, we introduce a method that can jointly alleviate Client Drift and Catastrophic Forgetting by using our proposed Dynamic Barlow Continuity that evaluates client updates on a public reference dataset and uses this to guide the training process to a spatially and temporally shift-invariant model. We evaluate our approach on the histopathology datasets BCSS and Semicol and prove our method to be highly effective by jointly improving the dice score as much as from 15.8% to 71.6% in Client Drift and from 42.5% to 62.8% in Catastrophic Forgetting. This enables Dynamic Learning by establishing spatio-temporal shift-invariance.
联邦学习和持续学习已被确立为在不断变化的数据上进行隐私保护性学习的方法,这对于在组织病理学图像中部署AI系统是必要的。然而,在动态世界中,由于时间和空间的变化(例如不同机构之间以及时间跨度上的数据变更),会出现数据偏移问题。这导致了两个主要问题:客户端漂移和灾难性遗忘。 - 客户端漂移指的是中央模型在聚合来自训练数据已发生变化的客户端的数据时性能下降。 - 灾难性遗忘则发生在由于患者群体随时间变化等临时因素引起的数据迁移情况下,导致模型对于过去见过的数据表现不佳。 尽管这两个问题都源自于相同的问题——数据偏移,但现有的研究仅分别解决它们。在本工作中,我们引入了一种方法,通过使用我们提出的动态Barlow连续性技术来共同缓解客户端漂移和灾难性遗忘。该方法评估客户端更新并将其与公共参考数据集进行比较,并以此指导训练过程以达到时空不变性的模型。 我们在BCSS(乳腺癌筛查系统)和Semicol两个组织病理学数据集中测试了这种方法,证明我们的方法能够显著提升Dice分数,在解决客户端漂移问题时提高了从15.8%到71.6%,在缓解灾难性遗忘方面提升了从42.5%到62.8%。这使得动态学习成为可能,通过建立时空不变性模型来实现。
https://arxiv.org/abs/2501.04588
Endovascular navigation is a crucial aspect of minimally invasive procedures, where precise control of curvilinear instruments like guidewires is critical for successful interventions. A key challenge in this task is accurately predicting the evolving shape of the guidewire as it navigates through the vasculature, which presents complex deformations due to interactions with the vessel walls. Traditional segmentation methods often fail to provide accurate real-time shape predictions, limiting their effectiveness in highly dynamic environments. To address this, we propose SplineFormer, a new transformer-based architecture, designed specifically to predict the continuous, smooth shape of the guidewire in an explainable way. By leveraging the transformer's ability, our network effectively captures the intricate bending and twisting of the guidewire, representing it as a spline for greater accuracy and smoothness. We integrate our SplineFormer into an end-to-end robot navigation system by leveraging the condensed information. The experimental results demonstrate that our SplineFormer is able to perform endovascular navigation autonomously and achieves a 50% success rate when cannulating the brachiocephalic artery on the real robot.
血管内导航是微创手术中的关键环节,其中对如导丝这样的曲线器械的精确控制对于成功干预至关重要。在这个任务中面临的主要挑战之一就是准确预测导丝在通过血管时不断变化的形状,由于与血管壁的相互作用,这种情况下导丝会呈现出复杂的变形。传统的分割方法常常无法提供实时的、准确的形状预测,在动态环境中效果受限。 为了解决这一问题,我们提出了一种新的基于Transformer架构的方法——SplineFormer。这种方法专门设计用来以可解释的方式预测连续平滑的导丝形状。通过利用变压器的能力,我们的网络能够有效地捕捉到导丝弯曲和扭曲等复杂的变形情况,并将其表示成样条线,从而提高准确性和平滑度。 我们将SplineFormer整合进了一个端到端的机器人导航系统中,利用其集中后的信息进行操作。实验结果显示,我们的SplineFormer能够在真实的机器人上自主完成血管内导航任务,在尝试进入无名动脉时达到了50%的成功率。
https://arxiv.org/abs/2501.04515
Despite widespread adoption of deep learning models to address a variety of computer vision tasks, planetary science has yet to see extensive utilization of such tools to address its unique problems. On Titan, the largest moon of Saturn, tracking seasonal trends and weather patterns of clouds provides crucial insights into one of the most complex climates in the Solar System, yet much of the available image data are still analyzed in a conventional way. In this work, we apply a Mask R-CNN trained via transfer learning to perform instance segmentation of clouds in Titan images acquired by the Cassini spacecraft - a previously unexplored approach to a big data problem in planetary science. We demonstrate that an automated technique can provide quantitative measures for clouds, such as areas and centroids, that may otherwise be prohibitively time-intensive to produce by human mapping. Furthermore, despite Titan specific challenges, our approach yields accuracy comparable to contemporary cloud identification studies on Earth and other worlds. We compare the efficiencies of human-driven versus algorithmic approaches, showing that transfer learning provides speed-ups that may open new horizons for data investigation for Titan. Moreover, we suggest that such approaches have broad potential for application to similar problems in planetary science where they are currently under-utilized. Future planned missions to the planets and remote sensing initiatives for the Earth promise to provide a deluge of image data in the coming years that will benefit strongly from leveraging machine learning approaches to perform the analysis.
尽管深度学习模型已被广泛应用于解决各种计算机视觉任务,但行星科学领域尚未充分利用此类工具来应对其特有的问题。在土卫六——土星最大的卫星上,追踪季节趋势和云层天气模式为了解太阳系中最为复杂的气候之一提供了关键见解;然而,目前可用的大部分图像数据仍以传统方式分析处理。在这项工作中,我们采用了一种通过迁移学习训练的Mask R-CNN模型来对卡西尼号探测器获取的土卫六图像中的云层进行实例分割——这是一种以前未曾探索过的行星科学大数据问题解决方案。我们展示了自动化技术可以提供定量测量指标(如面积和中心点),这些指标对于人类制图来说耗时过长,难以实现。 此外,尽管存在特定于土卫六的技术挑战,我们的方法仍可达到与地球上及其他世界现行云层识别研究相媲美的准确度水平。我们将人力驱动的分析方式与算法化手段进行对比,证明了迁移学习提供了速度提升,可能为土卫六的数据调查开启新的前景。此外,我们建议此类方法在行星科学中具有广泛的应用潜力,尤其是在目前这些工具尚未充分利用的情况下。 未来计划中的行星探测任务和地球遥感项目有望在未来几年内产生大量图像数据,而利用机器学习方法进行分析将极大地受益于这种技术进步。
https://arxiv.org/abs/2501.04459
This study presents an open-source toolkit to address critical challenges in preprocessing data for self-supervised learning (SSL) for 3D medical imaging, focusing on data privacy and computational efficiency. The toolkit comprises two main components: a segmentation network that delineates foreground regions to optimize data sampling and thus reduce training time, and a segmentation network that identifies anonymized regions, preventing erroneous supervision in reconstruction-based SSL methods. Experimental results demonstrate high robustness, with mean Dice scores exceeding 98.5 across all anonymization methods and surpassing 99.5 for foreground segmentation tasks, highlighting the efficacy of the toolkit in supporting SSL applications in 3D medical imaging for both CT and MRI images. The weights and code is available at this https URL.
本研究提出了一种开源工具包,旨在解决三维医学影像自监督学习(SSL)数据预处理中的关键挑战,重点在于数据隐私和计算效率。该工具包包含两个主要组成部分:一个分割网络用于划定前景区域以优化数据采样并减少训练时间;另一个分割网络则用来识别匿名化区域,防止基于重建的SSL方法中出现错误的监督信号。 实验结果表明,该工具包具有很高的鲁棒性,在所有匿名化方法中的平均Dice得分均超过98.5,并且在前景分割任务上的表现更是超过了99.5,这凸显了其在支持CT和MRI影像三维医学影像自监督学习应用方面的有效性。权重和代码可在[此处](https://this https URL)获取。 注:URL部分使用的是占位符,请将“this https URL”替换为实际的网址链接地址。
https://arxiv.org/abs/2501.04361
While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ballé2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.
尽管现有的神经图像压缩(NIC)和神经视频压缩(NVC)方法已经取得了显著的成功,但它们的优化主要集中在人类视觉感知上。然而,随着人工智能的快速发展,许多图像和视频将被用于各种机器视觉任务中。因此,这些现有压缩方法在机器视觉中的表现无法达到竞争水平。为此,在这项工作中,我们引入了一种高效的自适应压缩(EAC)方法,专门针对人类视觉感知及多种机器视觉任务进行优化。 我们的方法包括两个关键模块:1)自适应压缩机制,该机制能够从潜在特征中选择多个子集,以平衡对多项机器视觉任务(如分割和检测)以及人眼视觉的优化。2)特定任务适配器,通过参数高效的微调策略来激发下游分析网络在具体机器视觉任务上的性能。 通过使用上述两个模块,我们可以优化比特率成本并提升机器视觉的表现。总体而言,我们提出的EAC方法可以无缝地与现有的NIC(如Ballé2018和Cheng2020)以及NVC(如DVC和FVC)方法结合在一起。在各种基准数据集(如VOC2007、ILSVRC2012、VOC2012、COCO、UCF101及DAVIS)上的广泛评估显示,我们的方法能够提升多项机器视觉任务的表现,同时保持人类视觉质量的水平。
https://arxiv.org/abs/2501.04329
Nanoparticle superlattices consisting of ordered arrangements of nanoparticles exhibit unique optical, magnetic, and electronic properties arising from nanoparticle characteristics as well as their collective behaviors. Understanding how processing conditions influence the nanoscale arrangement and microstructure is critical for engineering materials with desired macroscopic properties. Microstructural features such as grain boundaries, lattice defects, and pores significantly affect these properties but are challenging to quantify using traditional manual analyses as they are labor-intensive and prone to errors. In this work, we present a machine learning workflow for automating grain segmentation in scanning electron microscopy (SEM) images of nanoparticle superlattices. This workflow integrates signal processing techniques, such as Radon transforms, with unsupervised learning methods like agglomerative hierarchical clustering to identify and segment grains without requiring manually annotated data. In the workflow we transform the raw pixel data into explainable numerical representation of superlattice orientations for clustering. Benchmarking results demonstrate the workflow's robustness against noisy images and edge cases, with a processing speed of four images per minute on standard computational hardware. This efficiency makes the workflow scalable to large datasets and makes it a valuable tool for integrating data-driven models into decision-making processes for material design and analysis. For example, one can use this workflow to quantify grain size distributions at varying processing conditions like temperature and pressure and using that knowledge adjust processing conditions to achieve desired superlattice orientations and grain sizes.
由有序排列的纳米颗粒组成的纳米粒子超晶格展现出独特的光学、磁性和电子特性,这些特性源自于纳米颗粒本身的特性和它们集体行为的结果。理解加工条件如何影响纳米级排列和微观结构对于设计具有所需宏观性质的材料至关重要。微结构特征(如晶界、晶格缺陷及孔隙)显著地影响这些属性,但由于传统手动分析方法耗时且容易出错,因此量化这些特征颇具挑战性。 在这项工作中,我们提出了一种机器学习工作流程,用于自动分割扫描电子显微镜(SEM)图像中的纳米粒子超晶格的晶粒。该工作流程结合了信号处理技术(如Radon变换)与无监督学习方法(例如凝聚层次聚类),能够在不需要人工标注数据的情况下识别并分割晶粒。在我们的工作流程中,我们将原始像素数据转换成解释性数值表示,以描述超晶格取向进行聚类。 基准测试结果表明该工作流程能够应对含噪图像和边界情况的挑战,并且在标准计算硬件上每分钟可以处理四张图片的速度运行稳定。这种效率使得该工作流程适用于大规模的数据集,并成为将数据驱动模型整合到材料设计与分析决策过程中的宝贵工具。例如,人们可以使用此工作流程量化不同加工条件(如温度和压力)下的晶粒尺寸分布,并据此调整加工条件以实现所需超晶格取向及晶粒大小。
https://arxiv.org/abs/2501.04172
Autonomous robot navigation in complex environments requires robust perception as well as high-level scene understanding due to perceptual challenges, such as occlusions, and uncertainty introduced by robot movement. For example, a robot climbing a cluttered staircase can misinterpret clutter as a step, misrepresenting the state and compromising safety. This requires robust state estimation methods capable of inferring the underlying structure of the environment even from incomplete sensor data. In this paper, we introduce a novel method for robust state estimation of staircases. To address the challenge of perceiving occluded staircases extending beyond the robot's field-of-view, our approach combines an infinite-width staircase representation with a finite endpoint state to capture the overall staircase structure. This representation is integrated into a Bayesian inference framework to fuse noisy measurements enabling accurate estimation of staircase location even with partial observations and occlusions. Additionally, we present a segmentation algorithm that works in conjunction with the staircase estimation pipeline to accurately identify clutter-free regions on a staircase. Our method is extensively evaluated on real robot across diverse staircases, demonstrating significant improvements in estimation accuracy and segmentation performance compared to baseline approaches.
在复杂环境中的自主机器人导航需要强大的感知能力和高水平的场景理解能力,因为诸如遮挡等感知挑战以及由机器人运动引入的不确定性会对导航造成影响。例如,在攀爬拥挤楼梯时,机器人可能会将杂物误认为是台阶,从而错误地表示状态并危及安全。这要求能够从不完整的传感器数据中推断出环境基本结构的强大状态估计方法。 本文介绍了一种用于稳定估计楼梯状态的新方法。为了解决感知被遮挡的超出机器人视野范围的楼梯的问题,我们的方法结合了无限宽度的楼梯表示和有限端点状态来捕捉整个楼梯结构。这种表示方式整合到了贝叶斯推理框架中,以融合噪声测量值,即使是在部分观察和遮挡的情况下也能准确估计楼梯的位置。 此外,我们还提出了一种与楼梯估计算法协同工作的分割算法,能够精确识别楼梯上无杂物的区域。 我们的方法在多种实际机器人系统及各种不同类型的楼梯上进行了广泛的测试,并且相比基线方法,在状态估计精度和分割性能方面取得了显著改进。
https://arxiv.org/abs/2501.04170