When employing deep neural networks (DNNs) for semantic segmentation in safety-critical applications like automotive perception or medical imaging, it is important to estimate their performance at runtime, e.g. via uncertainty estimates or prediction quality estimates. Previous works mostly performed uncertainty estimation on pixel-level. In a line of research, a connected-component-wise (segment-wise) perspective was taken, approaching uncertainty estimation on an object-level by performing so-called meta classification and regression to estimate uncertainty and prediction quality, respectively. In those works, each predicted segment is considered individually to estimate its uncertainty or prediction quality. However, the neighboring segments may provide additional hints on whether a given predicted segment is of high quality, which we study in the present work. On the basis of uncertainty indicating metrics on segment-level, we use graph neural networks (GNNs) to model the relationship of a given segment's quality as a function of the given segment's metrics as well as those of its neighboring segments. We compare different GNN architectures and achieve a notable performance improvement.
在使用深度神经网络(DNNs)进行语义分割,特别是在安全关键应用(如汽车感知或医学成像)中时,在运行时估计其性能非常重要,例如通过不确定性估计或预测质量估计。之前的工作主要在像素级别进行不确定性估计。在研究领域,我们采用了连通组件(段)视角,通过所谓的元分类和回归来估计不确定性,分别估计预测质量。在这些工作中,每个预测段都被单独考虑以估计其不确定性或预测质量。然而,相邻的段可能提供关于给定预测段是否具有高质量的其他提示,我们将在本研究中研究这一点。基于不确定性的指标,我们使用图神经网络(GNNs)来建模给定段质量与给定段指标以及其相邻段质量之间的关系。我们比较了不同的GNN架构,并实现了显著的性能改进。
https://arxiv.org/abs/2409.11373
Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. this https URL
少样本语义分割解决了在查询图像中仅使用少量注释示例对物体进行分割的挑战。然而,许多先前的最先进方法要么必须舍弃精致的局部语义特征,要么遭受高计算复杂性的困扰。为了应对这些挑战,我们提出了一种基于Transformer架构的新少样本语义分割框架。我们的方法引入了空间Transformer解码器和上下文掩码生成模块,以改善支持图像和查询图像之间的语义关系。此外,我们还引入了多尺度解码器,通过以分层方式整合不同分辨率特征来优化分割掩码。此外,我们的方法将全局特征从中间编码器阶段集成,以提高上下文理解,同时保持轻量化的结构以减少复杂性。这种性能和效率的平衡使我们的方法在1-shot和5-shot设置下的基准数据集上实现最先进的结果。值得注意的是,仅含有150万参数的模型在克服现有方法局限性的同时,表现出竞争力的性能。
https://arxiv.org/abs/2409.11316
Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional challenge which encourages models not only to adapt to the novel classes but also to maintain strong performance on the training base classes. While previous datasets and benchmarks discussed the few-shot segmentation setting in remote sensing, we are the first to propose a generalized few-shot segmentation benchmark for remote sensing. The generalized setting is more realistic and challenging, which necessitates exploring it within the remote sensing context. We release the dataset augmenting OpenEarthMap with additional classes labelled for the generalized few-shot evaluation setting. The dataset is released during the OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU workshop in conjunction with CVPR 2024. In this work, we summarize the dataset and challenge details in addition to providing the benchmark results on the two phases of the challenge for the validation and test sets.
在各种应用中,包括遥感领域,使用有限标注数据进行学习是一个具有挑战性的问题。少样本语义分割是一种方法,可以鼓励深度学习模型从训练数据中学习新颖类别的少数标注示例。一般少样本分割设置还有一个挑战,即鼓励模型不仅适应新颖类别,还保持对训练基础类别的强烈性能。虽然之前的数据集和基准讨论了在遥感领域中的少样本分割设置,但我们是我们第一个提出一般少样本分割基准的人。这种一般设置更加真实和具有挑战性,需要考虑在遥感背景下进行探索。我们在OpenEarthMap数据集上进行数据增强,为一般少样本评估设置添加了更多标签。数据集在L3D-IVU工作组的OpenEarthMap土地覆盖制图通用少样本分割挑战中发布,与CVPR 2024一并。在这篇论文中,我们除了总结数据集和挑战细节外,还为验证和测试集提供了基准结果。
https://arxiv.org/abs/2409.11227
Semantic segmentation is an essential step for many vision applications in order to understand a scene and the objects within. Recent progress in hyperspectral imaging technology enables the application in driving scenarios and the hope is that the devices perceptive abilities provide an advantage over RGB-cameras. Even though some datasets exist, there is no standard benchmark available to systematically measure progress on this task and evaluate the benefit of hyperspectral data. In this paper, we work towards closing this gap by providing the HyperSpectral Semantic Segmentation benchmark (HS3-Bench). It combines annotated hyperspectral images from three driving scenario datasets and provides standardized metrics, implementations, and evaluation protocols. We use the benchmark to derive two strong baseline models that surpass the previous state-of-the-art performances with and without pre-training on the individual datasets. Further, our results indicate that the existing learning-based methods benefit more from leveraging additional RGB training data than from leveraging the additional hyperspectral channels. This poses important questions for future research on hyperspectral imaging for semantic segmentation in driving scenarios. Code to run the benchmark and the strong baseline approaches are available under this https URL.
语义分割是许多视觉应用中必不可少的步骤,以便理解和场景中的物体。最近的高频红外成像技术的进步使得该应用在驾驶场景中成为可能,希望设备感知能力能够超过RGB相机。尽管有些数据集存在,但还没有一个标准基准可以系统地测量进展并评估 hyperspectral 数据的优势。在本文中,我们努力填补这个空白,通过提供 HyperSpectral Semantic Segmentation 基准(HS3-Bench)。它结合了来自三个驾驶场景数据集的注释 hyperspectral 图像,并提供了标准的指标、实现和评估协议。我们使用该基准来推断出两个在单个数据集上预训练的强基模型,这些模型在有无预训练的情况下都超越了以前的最先进水平。此外,我们的结果表明,现有的基于学习的方法从利用附加的RGB训练数据中受益更多,而不是从利用附加的高频红外通道中受益。这使得未来关于在驾驶场景中进行语义分割的研究具有重要的问
https://arxiv.org/abs/2409.11205
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.
我们提出了一个新颖的基于频率的自监督学习(SSL)方法,显著增强了其在预训练方面的效果。此方向的先驱工作遮蔽了输入图像中的预定义频率,并使用重构损失来预训练模型。尽管取得了很好的效果,但如我们在论文中所述,这种实现存在两个基本局限。首先,使用预定义频率会忽视图像频率响应的变异性。其次,通过使用频率过滤的图像进行预训练,预训练后的模型在微调过程中需要更多的数据来适应自然外观的图像。为了应对这些缺点,我们提出了Fourrier变换压缩与自知识蒸馏(FOLK)相结合的方法,结合了两个专门的想法。首先,受到图像压缩的启发,我们根据图像频率响应动态选择遮蔽的频率,为预训练创造了更合适的SSL任务。其次,我们使用由知识蒸馏支持的两个分支框架,使模型能够同时接受滤波和原始图像作为输入,从而大大减轻下游任务的负担。我们的实验结果表明,FOLK在实现对许多最先进的SSL方法的竞争性能方面具有有效性,包括图像分类、少样本学习和语义分割等任务。
https://arxiv.org/abs/2409.10362
Large-scale semantic segmentation networks often achieve high performance, while their application can be challenging when faced with limited sample sizes and computational resources. In scenarios with restricted network size and computational complexity, models encounter significant challenges in capturing long-range dependencies and recovering detailed information in images. We propose a lightweight bilateral semantic segmentation network called bilateral attention fusion network (BAFNet) to efficiently segment high-resolution urban remote sensing images. The model consists of two paths, namely dependency path and remote-local path. The dependency path utilizes large kernel attention to acquire long-range dependencies in the image. Besides, multi-scale local attention and efficient remote attention are designed to construct remote-local path. Finally, a feature aggregation module is designed to effectively utilize the different features of the two paths. Our proposed method was tested on public high-resolution urban remote sensing datasets Vaihingen and Potsdam, with mIoU reaching 83.20% and 86.53%, respectively. As a lightweight semantic segmentation model, BAFNet not only outperforms advanced lightweight models in accuracy but also demonstrates comparable performance to non-lightweight state-of-the-art methods on two datasets, despite a tenfold variance in floating-point operations and a fifteenfold difference in network parameters.
大规模语义分割网络通常能够获得很好的性能,但在面对有限样本量和计算资源的情况下,其应用会变得具有挑战性。在有限网络大小和计算复杂性的场景中,模型在捕捉长距离依赖和恢复图像详细信息时会遇到重大挑战。为了有效地分割高分辨率的城市遥感图像,我们提出了一个轻量级双边语义分割网络,称为双边注意力融合网络(BAFNet),以高效地分割高分辨率的城市遥感图像。该模型包括两个路径,即依赖路径和远程局部路径。依赖路径利用大核注意力来获取图像中的长距离依赖关系。此外,多尺度局部注意力和高效的远程注意力的设计是为了构建远程局部路径。最后,一个特征聚合模块被设计为有效地利用两个路径的不同特征。我们的方法在公共高分辨率城市遥感数据集Vaihingen和Potsdam上进行了测试,mIoU分别达到83.20%和86.53%。作为轻量级语义分割模型,BAFNet不仅在准确性上优于先进的轻量级模型,而且在两个数据集上的性能与非轻量级最优方法相当,尽管在浮点运算上十倍差异,在网络参数上十五倍差异。
https://arxiv.org/abs/2409.10269
This paper presents a 2D lidar semantic segmentation dataset to enhance the semantic scene understanding for mobile robots in different indoor robotics applications. While most existing lidar semantic datasets focus on 3D lidar sensors and autonomous driving scenarios, the proposed 2D lidar semantic dataset is the first public dataset for 2D lidar sensors and mobile robots. It contains data collected in six different indoor environments and has nine categories of typical objects in indoor environments. A novel semi-automatic semantic labeling framework is proposed to provide point-wise annotation for the dataset with minimal human effort. Based on this 2D lidar dataset, a hardware-friendly stochastic semantic segmentation benchmark is proposed to enable 2D lidar sensors to have semantic scene understanding capabilities. A series of segmentation tests are performed to demonstrate that the proposed learning-based segmentation benchmark can achieve more accurate and richer segmentation for each lidar point compared to traditional geometry-based extraction algorithms.
本文提出了一份2D激光语义分割数据集,以增强不同室内机器人应用中移动机器人的语义场景理解。虽然大多数现有的激光语义数据集都关注3D激光传感器和自动驾驶场景,但本提出的2D激光语义数据集是第一个关注2D激光传感器和移动机器人的公共数据集。它包含了六个不同室内的数据,并包含了室内环境中的九个典型物品类别。为了最小化人类的工作量,提出了一种新颖的半自动语义标注框架,用于为该数据集提供点对点的语义标注。基于此2D激光数据集,提出了一种硬件友好性的随机语义分割基准,以实现2D激光传感器具有语义场景理解能力。进行了一系列分割测试,证明了与传统基于几何的提取算法相比,所提出的基于学习的分割基准可以实现更准确、更丰富的分割。
https://arxiv.org/abs/2409.09899
Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "person" in one dataset and class "face" in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi-dataset training approach by integrating language-based embeddings of class names and label space-specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.
利用多个训练数据集来扩展图像分割模型的优势在于提高稳健性和语义理解。各个数据集具有明确的地面真实值,非重叠的掩码布局和非重叠的语义。然而,将它们合并进行多数据集训练会破坏这种和谐,导致语义不一致;例如,一个数据集中的“人物”类和另一个数据集中的“面部”类可能需要对某些像素进行多标签处理。现有的方法在這種設置下挣扎,尤其是在对单个训练集的标签空间进行评估时。为了克服这些问题,我们引入了一种简单而有效的多数据集训练方法,通过将类名语言嵌入和标签空间特定查询嵌入集成在一起。我们的方法保持高度性能,不受底层不一致性训练数据之间的影响。值得注意的是,在四个有标签空间不一致的基准数据集上进行推理时,我们比以前的方法提高了16%的mIoU语义分割,91%的PQ panoptic分割,121%的实例分割和30%的新提出的PIQ指标。
https://arxiv.org/abs/2409.09893
Along with the rapid growth of autonomous vehicles (AVs), more and more demands are required for environment perception technology. Among others, HD mapping has become one of the more prominent roles in helping the vehicle realize essential tasks such as localization and path planning. While increasing research efforts have been directed toward HD Map development. However, a comprehensive overview of the overall HD map mapping and update framework is still lacking. This article introduces the development and current state of the algorithm involved in creating HD map mapping and its maintenance. As part of this study, the primary data preprocessing approach of processing raw data to information ready to feed for mapping and update purposes, semantic segmentation, and localization are also briefly reviewed. Moreover, the map taxonomy, ontology, and quality assessment are extensively discussed, the map data's general representation method is presented, and the mapping algorithm ranging from SLAM to transformers learning-based approaches are also discussed. The development of the HD map update algorithm, from change detection to the update methods, is also presented. Finally, the authors discuss possible future developments and the remaining challenges in HD map mapping and update technology. This paper simultaneously serves as a position paper and tutorial to those new to HD map mapping and update domains.
随着自动驾驶车辆(AVs)的快速发展,对环境感知技术的需求也在不断增加。其中,高精地图(HD Map)在帮助车辆实现关键任务(如定位和路径规划)方面发挥了越来越重要的作用。虽然越来越多的研究努力都指向HD Map的开发。然而,对整个HD Map映射和更新框架的全面概述仍然缺乏。本文介绍了创建HD Map映射及其维护的算法的开发和当前状态。作为本研究的一部分,还简要回顾了处理原始数据以供映射和更新目的的信息预处理方法、语义分割和定位。此外,还详细讨论了地图分类、本体、质量评估,以及从SLAM到基于变换器的学习方法映射的映射算法。HD Map更新算法的开发,从更改检测到更新方法,也进行了介绍。最后,作者讨论了HD Map映射和更新技术可能出现的未来发展和剩余挑战。本文既是针对HD Map映射和更新领域的新手的一份论文,也是一份指导。
https://arxiv.org/abs/2409.09726
Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity between parts of the test image and the prototypes. This improves interpretability since the user can inspect the link between the predicted output and the patterns learned by the model in terms of prototypical information. In this paper, we propose a method for interpretable semantic segmentation that leverages multi-scale image representation for prototypical part learning. First, we introduce a prototype layer that explicitly learns diverse prototypical parts at several scales, leading to multi-scale representations in the prototype activation output. Then, we propose a sparse grouping mechanism that produces multi-scale sparse groups of these scale-specific prototypical parts. This provides a deeper understanding of the interactions between multi-scale object representations while enhancing the interpretability of the segmentation model. The experiments conducted on Pascal VOC, Cityscapes, and ADE20K demonstrate that the proposed method increases model sparsity, improves interpretability over existing prototype-based methods, and narrows the performance gap with the non-interpretable counterpart models. Code is available at this http URL.
原型学习作为使语义分割具有可解释性的有前途的方法正在 emergence。该模型选择在训练过程中可见的真正补丁作为原型,并基于测试图像中部分与原型之间的相似性构建密集预测图。这提高了可解释性,因为用户可以检查预测输出与模型在原型信息方面的学习模式之间的联系。在本文中,我们提出了一种可解释性语义分割方法,该方法利用多尺度图像表示进行原型学习。首先,我们引入了一个原型层,该层明确地在多个尺度上学习多样原型部分,导致原型激活输出中的多尺度表示。然后,我们提出了一种稀疏分组机制,产生这些尺度特定的原型部分的多尺度稀疏组。这使得在多尺度对象表示之间进行深入交互,同时增强语义分割模型的可解释性。在Pascal VOC、Cityscapes和ADE20K等数据集上进行的实验证明,与现有基于原型的方法相比,所提出的方法增加了模型稀疏度,提高了可解释性,并缩小了与非可解释性对应模型的性能差距。代码可在此链接下载。
https://arxiv.org/abs/2409.09497
Class Incremental Semantic Segmentation (CISS) aims to mitigate catastrophic forgetting by maintaining a balance between previously learned and newly introduced knowledge. Existing methods, primarily based on regularization techniques like knowledge distillation, help preserve old knowledge but often face challenges in effectively integrating new knowledge, resulting in limited overall improvement. Endpoints Weight Fusion (EWF) method, while simple, effectively addresses some of these limitations by dynamically fusing the model weights from previous steps with those from the current step, using a fusion parameter alpha determined by the relative number of previously known classes and newly introduced classes. However, the simplicity of the alpha calculation may limit its ability to fully capture the complexities of different task scenarios, potentially leading to suboptimal fusion outcomes. In this paper, we propose an enhanced approach called Adaptive Weight Fusion (AWF), which introduces an alternating training strategy for the fusion parameter, allowing for more flexible and adaptive weight integration. AWF achieves superior performance by better balancing the retention of old knowledge with the learning of new classes, significantly improving results on benchmark CISS tasks compared to the original EWF. And our experiment code will be released on Github.
类逐级语义分割(CISS)旨在通过保持先前学习和新引入知识之间的平衡来减轻灾难性遗忘。现有的方法,主要基于像知识蒸馏这样的正则化技术,帮助保留旧知识,但往往难以有效地将新知识有效地整合,导致整体改善有限。端点权重融合(EWF)方法虽然简单,但有效地解决了一些这些限制,通过动态将前一步的模型权重与当前步的权重进行融合,使用由之前已知类和新引入类相对数量确定的融合参数alpha。然而,alpha计算的简单性可能限制其能力,使其无法完全捕捉不同任务场景的复杂性,从而导致最优融合结果。在本文中,我们提出了一个增强方法,称为自适应权重融合(AWF),引入了一种交替训练策略来处理融合参数,允许更灵活和自适应的权重整合。AWF通过更好地平衡保留旧知识与学习新类,显著地改善了CISS基准任务的结果,与原始EWF相比。我们的实验代码将在Github上发布。
https://arxiv.org/abs/2409.08516
We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.
我们提出了VistaFormer,一种轻量级的基于Transformer的模型架构,用于远程感测图像的语义分割。该模型使用了一个多尺度基于Transformer的编码器和一个轻量级的解码器,该解码器聚合了编码器块中捕获的全局和局部注意力。VistaFormer使用无位置的self-attention层,简化了模型架构,并消除了需要插值的时间和空间码,从而在训练和测试图像分辨率不一致时降低模型性能。我们研究了像云这样的嘈杂输入信号的简单过滤技术,并证明了用邻居注意(NA)替代多头自注意(MHSA)可以实现模型的扩展。在PASTIS和MTLCC裁剪类型分割基准上进行的实验显示,VistaFormer的性能优于类似模型,并且使用MHSA和NA时,浮点运算次数仅为MHSA的8%和NA的11%,同时训练参数也较少。VistaFormer与MHSA结合可以在PASTIS基准上提高0.1%的mIoU得分,在MTLCC基准上提高3%的性能,而VistaFormer与NA结合可以在MTLCC基准上提高3.7%的性能。
https://arxiv.org/abs/2409.08461
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic $n$-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer. Our project page is available at this http URL.
3D 分割是在计算机视觉中的一个核心问题,与许多其他密集预测任务类似,它需要大量的注释数据来进行适当的训练。然而,将3D点云密集地标注以实现完全监督的训练仍然过于费力和昂贵。半监督训练提供了一个更实际的选择,其中只有少量已标注数据,同时有一个更大的未标注数据集。因此,这个领域研究了未标注数据有效利用来减少由于缺乏注释而产生的性能差距。在这个工作中,我们受到贝叶斯深度学习的启发,首先提出了一个基于贝叶斯的自训练框架来进行半监督3D语义分割。通过随机推理,我们生成一系列伪标签,然后根据估计点间的不确定性来筛选这些伪标签。通过构建一个启发式的 $n$ 部分匹配算法,我们将方法扩展到半监督3D实例分割,最后,使用相同的构建块,扩展到密集3D视觉 grounding。我们在SemanticKITTI和ScribbleKITTI上对3D语义分割的半监督方法取得了最先进的成果,同时在ScanNet和S3DIS上对3D实例分割取得了显著的改善。在ScanRefer上,我们进一步实现了比仅监督基准更显著的密集3D视觉 grounding 的改善。我们的项目页面可以在这个链接 http:// 这种方式上查看。
https://arxiv.org/abs/2409.08102
RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.
RGB-D已成为辅助驾驶中理解复杂场景的重要数据来源。然而,现有研究对深度图的固有空间特性关注不足。这个缺陷对注意表示造成了关注偏移问题导致的预测误差。为此,我们提出了一种新颖的可学习深度交互金字塔Transformer(DiPFormer),以探索深度的有效性。首先,我们引入了深度空间感知优化(Depth SAO)作为偏移,以表示真实世界空间关系。其次,通过深度线性交叉注意(Depth LCA),相似的特征空间被学习来阐明像素级别的空间差异。最后,我们使用了MLP解码器来有效地融合多尺度特征以满足实时需求。全面的实验证明,与传统的关注偏移解决方案相比,DiPFormer在道路检测(+7.5%)和语义分割(+4.9% / +1.5%)任务上取得了最先进的性能。DiPFormer在KITTI(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360)和Cityscapes(83.4% mIoU)数据集上实现了最先进的性能。
https://arxiv.org/abs/2409.07995
Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to $\sim 2\%$ improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.
手术场景传达了手术质量的关键信息。像素级定位工具和解剖结构是显微镜或内窥镜手术视图的深入手术分析的第一步。这通常通过完全监督的方法来完成,这些方法是注释贪婪的,在某些情况下,需要医学专业知识。考虑到通过标准化手术工作流程获得的手术视频的丰富性,我们提出了一个注释效率的框架,用于对手术场景进行语义分割。我们利用基于图像的自监督物体发现来识别手术视频中最显眼的工具和解剖结构。这些建议在最小监督的精细化调整步骤中进一步细化。仅使用36个注释标签的无忧设置表明,与完全监督分割模型相当,具有类似的定位性能。此外,将手术阶段标签作为弱标签可以更好地引导模型注意,从而将工具定位精度提高约2%。对CADIS数据集的广泛消融研究证实了我们在发现相关手术对象时,无需或仅需监督的有效性。
https://arxiv.org/abs/2409.07801
Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary-Aware Contrastive Loss, as the overall training objective for semi-supervised learning to mitigate the long-tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross-attention block in CMAformer effectively integrates spatial attention and channel attention for multi-scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi-supervised learning ensembles. We achieve state-of-the-art results on multiple public medical image datasets. Example code are available at: \url{this https URL}.
医学图像分割是语义分割在医疗领域中的关键应用,通过专用计算机视觉技术已经取得了显著的进步。虽然基于深度学习的医学图像分割对于辅助医学诊断至关重要,但缺乏多样化的训练数据导致长尾问题。此外,之前的大多数混合CNN-ViT架构在不同的卷积神经网络层中结合各种注意力的能力有限。为了解决这些问题,我们提出了Lagrange Duality Consistency(LDC)损失,与边界感知对比损失相结合,作为半监督学习以缓解长尾问题的总训练目标。此外,我们还引入了CMAformer,一种结合ResUNet和Transformer优势的网络。CMAformer中的交叉注意块有效地融合了多尺度特征。总体而言,我们的结果表明,CMAformer与特征融合框架和新的一致损失在半监督学习集成模型中具有很强的互补性。我们在多个公共医学图像数据集上实现了最先进的成果。示例代码可在此处查看:\url{此链接}。
https://arxiv.org/abs/2409.07793
Medical image segmentation, a crucial task in computer vision, facilitates the automated delineation of anatomical structures and pathologies, supporting clinicians in diagnosis, treatment planning, and disease monitoring. Notably, transformers employing shifted window-based self-attention have demonstrated exceptional performance. However, their reliance on local window attention limits the fusion of local and global contextual information, crucial for segmenting microtumors and miniature organs. To address this limitation, we propose the Adaptive Semantic Segmentation Network (ASSNet), a transformer architecture that effectively integrates local and global features for precise medical image segmentation. ASSNet comprises a transformer-based U-shaped encoder-decoder network. The encoder utilizes shifted window self-attention across five resolutions to extract multi-scale features, which are then propagated to the decoder through skip connections. We introduce an augmented multi-layer perceptron within the encoder to explicitly model long-range dependencies during feature extraction. Recognizing the constraints of conventional symmetrical encoder-decoder designs, we propose an Adaptive Feature Fusion (AFF) decoder to complement our encoder. This decoder incorporates three key components: the Long Range Dependencies (LRD) block, the Multi-Scale Feature Fusion (MFF) block, and the Adaptive Semantic Center (ASC) block. These components synergistically facilitate the effective fusion of multi-scale features extracted by the decoder while capturing long-range dependencies and refining object boundaries. Comprehensive experiments on diverse medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, demonstrate that ASSNet achieves state-of-the-art results. Code and models are available at: \url{this https URL}.
医学图像分割是计算机视觉中一个关键的任务,有助于自动描绘解剖结构和疾病,支持临床医生进行诊断、治疗规划和疾病监测。值得注意的是,采用平移窗口基于自注意力的Transformer表现尤为出色。然而,它们对局部窗口注意力的依赖性限制了从局部到全局上下文信息的融合,这对于分割微肿瘤和小器官至关重要。为了克服这一限制,我们提出了自适应语义分割网络(ASSNet),一种将局部和全局特征有效整合的Transformer架构,用于精确的医学图像分割。ASSNet由基于Transformer的U形编码器-解码器网络组成。编码器利用跨越五个分辨率的平移窗口自注意力来提取多尺度特征,然后通过级联连接传递给解码器。我们在编码器中引入了增强的多层感知器,以明确表示特征提取过程中的长距离依赖关系。为了补充我们的编码器,我们提出了自适应特征融合(AFF)解码器。该解码器包括三个关键组件:长距离依赖关系(LRD)模块、多尺度特征融合(MFF)模块和自适应语义中心(ASC)模块。这些组件协同作用促进解码器提取的多尺度特征的有效融合,同时捕捉长距离依赖关系并优化物体边界。在多种医学图像分割任务(包括多器官、肝脏肿瘤和膀胱肿瘤分割)的全面实验中,ASSNet实现了最先进的性能。代码和模型可以在该链接处获取:\url{这个链接}。
https://arxiv.org/abs/2409.07779
Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at this https URL.
开放词汇图像语义分割(OVS)旨在将图像划分为一系列类别的语义区域。现有的OVS方法通常依赖于基础视觉-语言模型,并利用相似计算来解决OVS任务。然而,这些方法主要针对自然图像,并且难以应对遥感图像的独特特点,例如快速变化的方向和显著的尺度变化。这些挑战使得地球视觉中的OVS任务变得复杂,需要特殊方法。为解决这一困境,我们提出了专门针对遥感图像的OVS框架,灵感来自于遥感特征的显著差异。特别地,为了应对不同的方向,我们引入了一个旋转聚合相似度计算模块,生成适应于不同方向的语义图。这些图随后在空间和分类级别上进行细化,以产生更准确的语义图。此外,为了管理显著的尺度变化,我们将多尺度图像特征融入了上采样过程,最终得到尺度明智的语义图。为了在地球视觉中提高OVS,促进可重复的研究,我们建立了第一个开源的OVS基准,包括四个公开的遥感数据集。在這個網址上进行了大量关于这个基准的实验,证明我们所提出的方法达到最先进的性能水平。所有代码和数据集都可以在这里找到:https://url.
https://arxiv.org/abs/2409.07683
We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).
我们提出了Vision Token Turing Machines(ViTTM),一种高效、低延迟、基于内存的视觉Transformer(ViT)。我们的方法基于神经元图灵机和标记图灵机,应用于自然语言处理和序列视觉理解任务。ViTTMs是为非序列化计算机视觉任务设计的,如图像分类和分割。我们的模型创建了两种令牌:过程令牌和内存令牌;过程令牌在网络中的每个编码器块中通过编码器模块读取并写入内存中的令牌,允许它们从内存中存储和检索信息。通过确保过程令牌的数量少于内存令牌的数量,我们能够降低网络的推理时间,同时保持其准确性。在ImageNet-1K上,最先进的ViT-B的延迟为529.5毫秒,准确率为81.0%,而我们的ViTTM-B是56% faster(234.1ms),具有2.4倍的FLOPs,准确率为82.9%。在ADE20K语义分割上,ViT-B达到45.65mIoU/13.8帧/秒(FPS),而我们的ViTTM-B模型达到45.17mIoU/26.8 FPS (+94%)。
https://arxiv.org/abs/2409.07613
Computed tomography (CT) reconstruction plays a crucial role in industrial nondestructive testing and medical diagnosis. Sparse view CT reconstruction aims to reconstruct high-quality CT images while only using a small number of projections, which helps to improve the detection speed of industrial assembly lines and is also meaningful for reducing radiation in medical scenarios. Sparse CT reconstruction methods based on implicit neural representations (INRs) have recently shown promising performance, but still produce artifacts because of the difficulty of obtaining useful prior information. In this work, we incorporate a powerful prior: the total number of material categories of objects. To utilize the prior, we design AC-IND, a self-supervised method based on Attenuation Coefficient Estimation and Implicit Neural Distribution. Specifically, our method first transforms the traditional INR from scalar mapping to probability distribution mapping. Then we design a compact attenuation coefficient estimator initialized with values from a rough reconstruction and fast segmentation. Finally, our algorithm finishes the CT reconstruction by jointly optimizing the estimator and the generated distribution. Through experiments, we find that our method not only outperforms the comparative methods in sparse CT reconstruction but also can automatically generate semantic segmentation maps.
计算断层成像(CT)重建在工业无损测试和医学诊断中起着关键作用。稀疏视野CT重建旨在通过仅使用少量投影来重建高质量CT图像,从而提高工业装配线检测速度,并对减少医学场景中的辐射有意义。基于隐式神经表示(INRs)的稀疏CT重建方法最近显示出良好的性能,但由于难以获得有用的先验信息,仍然会产生伪影。在本文中,我们引入了一个强大的先验:对象的总体类别数。为了利用这个先验,我们设计了一种基于衰减系数估计和隐式神经分布的自监督方法。具体来说,我们的方法首先将传统的INR从标量映射转换为概率分布映射。然后,我们设计了一个紧凑的衰减系数估计器,其初始值来自粗略重建和快速分割。最后,我们的算法通过联合优化估计算法和生成的分布来完成CT重建。通过实验,我们发现,我们的方法不仅在稀疏CT重建方面超越了比较方法,而且还可以自动生成语义分割图。
https://arxiv.org/abs/2409.07171