This work presents a novel RGB-D-inertial dynamic SLAM method that can enable accurate localisation when the majority of the camera view is occluded by multiple dynamic objects over a long period of time. Most dynamic SLAM approaches either remove dynamic objects as outliers when they account for a minor proportion of the visual input, or detect dynamic objects using semantic segmentation before camera tracking. Therefore, dynamic objects that cause large occlusions are difficult to detect without prior information. The remaining visual information from the static background is also not enough to support localisation when large occlusion lasts for a long period. To overcome these problems, our framework presents a robust visual-inertial bundle adjustment that simultaneously tracks camera, estimates cluster-wise dense segmentation of dynamic objects and maintains a static sparse map by combining dense and sparse features. The experiment results demonstrate that our method achieves promising localisation and object segmentation performance compared to other state-of-the-art methods in the scenario of long-term large occlusion.
这项工作提出了一种 novel RGB-D-inertial 动态 SLAM 方法,能够在长时间内多个动态物体遮挡大部分摄像头视图的情况下实现准确的定位。大多数动态 SLAM 方法要么在动态物体占据视觉输入的较小比例时将其视为异常值并删除,要么在跟踪摄像头之前使用语义分割方法检测动态物体。因此,在没有先前信息的情况下难以检测造成大规模遮挡的动态物体。在长时间大规模遮挡的情况下,剩余的静态背景视觉信息不足以支持定位。因此,我们框架提出了一种稳健的视觉-inertial Bundle 调整方法,可以同时跟踪摄像头并估计动态物体的密集群组分割,并通过结合密集和稀疏特征维持静态稀疏地图。实验结果显示,与我们在其他长期大规模遮挡场景中使用的先进方法相比,我们的方法实现了 promising Localization 和物体分割性能。
https://arxiv.org/abs/2303.13316
Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.
知识蒸馏是一种流行的技术,通过模拟从大型教师模型向小型学生模型传输知识来实现。然而,直接对齐教师和学生的特征映射可能强制学生接受过于严格的约束,从而损害学生模型的性能。为了减轻上述特征不匹配的问题,现有的工作主要关注在空间上对齐教师和学生的特征映射,使用像素转换。在本文中,我们新发现,沿着通道维度对齐教师和学生的特征映射也可以有效地解决特征不匹配问题。具体来说,我们提出了一种可学习非线性通道转换来对齐学生和教师模型的特征。基于它,我们进一步提出了一种简单而通用的特征蒸馏框架,只有一个超参数来平衡蒸馏损失和任务特定损失。广泛的实验结果表明,我们的方法在多种计算机视觉任务中实现了显著的性能改进,包括图像分类(+3.28%的 ImageNet-1K 上 Top-1 准确性)、物体检测(+3.9%的 MS COCO 上的 bbox mAP)、实例分割(+2.8%的 ResNet50 上的Mask-RCNN Mask mAP)、语义分割(+4.66%的 ResNet18 上的PSPNet 在 Cityscapes 上的语义分割 mIoU),这表明了我们方法的有效性和灵活性。代码将公开可用。
https://arxiv.org/abs/2303.13212
Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.
目前的注意力算法(例如,自我注意力)是基于刺激驱动的,并在图像中强调所有突出的物体。然而,像人类这样的智能代理通常基于当前任务的指导,只关注任务相关的物体。这种任务引导的高层次注意力提供了任务适应的表示,帮助模型适应多种任务。在本文中,我们将从视觉分析的迭代(AbS)视角看待高层次注意力。先前的工作表明视觉注意力和稀疏重建之间存在功能等价性。我们表明,一个以目标为导向的高层次信号驱动的AbS视觉系统自然地模拟了高层次注意力。我们还提出了分析-分析迭代视觉卷积器(AbSViT),它是一个高层次信号驱动的ViT模型,其变化模拟了AbS,并实现了可控制高层次注意力。对于实际应用场景,AbSViT在视觉语言任务(如VQA和零样本检索)中 consistently improves over baselines,特别是在语言指导高层次注意力的情况下。AbSViT还可以作为一个通用的骨架,提高分类、语义分割和模型鲁棒性。
https://arxiv.org/abs/2303.13043
LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at this https URL.
利用激光雷达点云识别3D点云的方法可以造福多种应用程序。如果没有特别考虑激光雷达点云分布,大多数当前方法都面临信息断开和接收域有限的问题,特别是对于稀疏遥远的点。在这项工作中,我们研究了激光雷达点云的 varying-sparss分布,并提出了Sphere Former直接聚合从密集接近点到稀疏遥远的信息。我们设计了径向窗口自注意力,将空间划分为多个非重叠的窄长窗口。它克服了信息断开的问题,并且极大地扩展了接收域,这极大地提高了稀疏遥远的点的性能。此外,为了适应窄长窗口,我们提出了指数分割,生成精细的位置编码和动态特征选择,以提高模型表示能力。值得注意的是,我们的方法在nuScenes和SemanticKITTI语义分割基准测试中分别获得了81.9%和74.8%的mIoU,同时,在nuScenes物体检测基准测试中获得了第3名,72.8%的NDS和68.5%的mAP。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.12766
Agricultural robots have the prospect to enable more efficient and sustainable agricultural production of food, feed, and fiber. Perception of crops and weeds is a central component of agricultural robots that aim to monitor fields and assess the plants as well as their growth stage in an automatic manner. Semantic perception mostly relies on deep learning using supervised approaches, which require time and qualified workers to label fairly large amounts of data. In this paper, we look into the problem of reducing the amount of labels without compromising the final segmentation performance. For robots operating in the field, pre-training networks in a supervised way is already a popular method to reduce the number of required labeled images. We investigate the possibility of pre-training in a self-supervised fashion using data from the target domain. To better exploit this data, we propose a set of domain-specific augmentation strategies. We evaluate our pre-training on semantic segmentation and leaf instance segmentation, two important tasks in our domain. The experimental results suggest that pre-training with domain-specific data paired with our data augmentation strategy leads to superior performance compared to commonly used pre-trainings. Furthermore, the pre-trained networks obtain similar performance to the fully supervised with less labeled data.
农业机器人有望实现更高效、可持续的粮食、饲料和纤维生产。对作物和杂草的感知是农业机器人的核心组件,旨在自动监测农田和评估植物及其生长阶段。语义感知主要依赖于使用监督方法的深度学习,这需要时间和合格的工人对大量数据进行标签。在本文中,我们探讨了如何减少标签数量而不影响最终分割性能的问题。对于在田间运行的机器,使用监督学习方法进行预训练已经成为减少所需标签图像的流行方法。我们探讨了使用目标域数据进行自监督预训练的可能性。为了更好地利用这些数据,我们提出了一组目标域增强策略。我们评估了我们的预训练在语义分割和叶实例分割两个我们域中最重要的任务上的性能。实验结果表明,与常用的预训练方法相比,使用目标域数据和我们的数据增强策略导致更好的性能。此外,预训练网络获得了与完全监督并具有较少标签数据相同的性能。
https://arxiv.org/abs/2303.12499
There is a recent trend in the LiDAR perception field towards unifying multiple tasks in a single strong network with improved performance, as opposed to using separate networks for each task. In this paper, we introduce a new LiDAR multi-task learning paradigm based on the transformer. The proposed LiDARFormer utilizes cross-space global contextual feature information and exploits cross-task synergy to boost the performance of LiDAR perception tasks across multiple large-scale datasets and benchmarks. Our novel transformer-based framework includes a cross-space transformer module that learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D sparse voxel feature maps. Additionally, we propose a transformer decoder for the segmentation task to dynamically adjust the learned features by leveraging the categorical feature representations. Furthermore, we combine the segmentation and detection features in a shared transformer decoder with cross-task attention layers to enhance and integrate the object-level and class-level features. LiDARFormer is evaluated on the large-scale nuScenes and the Waymo Open datasets for both 3D detection and semantic segmentation tasks, and it outperforms all previously published methods on both tasks. Notably, LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and 74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a single model LiDAR-only method.
最近的研究表明,在激光雷达感知领域,趋势是将多个任务统一到一个强大的网络中,以提高性能,而不是每个任务使用单独的网络。在本文中,我们介绍了基于Transformer的激光雷达多任务学习范式。我们提议的激光雷达前体使用跨空间全局特征信息,利用跨任务协同作用来提高多个大规模数据集和基准的性能。我们的新型Transformer框架包括一个跨空间Transformer模块,用于学习2D密集鸟眼视图(BEV)和3D稀疏立方点特征映射的注意特征。我们还提议了一个分割任务Transformer解码器,通过利用分类特征表示动态地调整学习的特征。此外,我们将分割和检测特征在共享的Transformer解码器中与跨任务注意力层组合在一起,以增强和整合对象级和类级特征。激光雷达前体在大型无物体场景和Waymo开放数据集上,以及3D检测和语义分割任务中的两个任务上的大规模基准数据集上进行了评估,并在两个任务上优于所有先前发布的方法。特别是,激光雷达前体在Waymo和无物体场景检测基准数据集上实现了最先进的76.4%L2mAP和74.3%NDS性能。
https://arxiv.org/abs/2303.12194
When a small number of poisoned samples are injected into the training dataset of a deep neural network, the network can be induced to exhibit malicious behavior during inferences, which poses potential threats to real-world applications. While they have been intensively studied in classification, backdoor attacks on semantic segmentation have been largely overlooked. Unlike classification, semantic segmentation aims to classify every pixel within a given image. In this work, we explore backdoor attacks on segmentation models to misclassify all pixels of a victim class by injecting a specific trigger on non-victim pixels during inferences, which is dubbed Influencer Backdoor Attack (IBA). IBA is expected to maintain the classification accuracy of non-victim pixels and misleads classifications of all victim pixels in every single inference. Specifically, we consider two types of IBA scenarios, i.e., 1) Free-position IBA: the trigger can be positioned freely except for pixels of the victim class, and 2) Long-distance IBA: the trigger can only be positioned somewhere far from victim pixels, given the possible practical constraint. Based on the context aggregation ability of segmentation models, we propose techniques to improve IBA for the scenarios. Concretely, for free-position IBA, we propose a simple, yet effective Nearest Neighbor trigger injection strategy for poisoned sample creation. For long-distance IBA, we propose a novel Pixel Random Labeling strategy. Our extensive experiments reveal that current segmentation models do suffer from backdoor attacks, and verify that our proposed techniques can further increase attack performance.
当一小部分有毒样本被注入到深度神经网络的训练数据中时,网络可能会被诱导在推断过程中表现出恶意行为,这可能对现实世界的应用构成潜在威胁。虽然在分类研究中已经对语义分割中的后进攻击进行了深入研究,但对语义分割中的中间件攻击却往往被忽视。与分类不同,语义分割的目标是对给定图像中的每个像素进行分类。在这项工作中,我们探讨了分割模型中的中间件攻击,以将非受害者像素的分类错误地归为受害者像素,并称之为影响者中间件攻击(IBA)。IBA旨在维持非受害者像素的分类准确性,并在每个推断中都误导所有受害者像素的分类。具体而言,我们考虑了两种IBA场景,即1)自由位置的IBA:触发器可以在非受害者像素之间自由放置,除了受害者像素之外,2)远距离的IBA:触发器只能放置在距离受害者像素很远的地方,考虑到可能的实际限制。基于分割模型的上下文聚合能力,我们提出了方法以改进IBA的场景。具体而言,对于自由位置的IBA,我们提出了一种简单但有效的相邻像素触发器注入策略,用于有毒样本的创建。对于远距离的IBA,我们提出了一种新颖的像素随机标签策略。我们的广泛实验表明,当前分割模型确实遭受了中间件攻击,并验证了我们提出的方法可以进一步增加攻击性能。
https://arxiv.org/abs/2303.12054
Seeing only a tiny part of the whole is not knowing the full circumstance. Bird's-eye-view (BEV) perception, a process of obtaining allocentric maps from egocentric views, is restricted when using a narrow Field of View (FoV) alone. In this work, mapping from 360° panoramas to BEV semantics, the 360BEV task, is established for the first time to achieve holistic representations of indoor scenes in a top-down view. Instead of relying on narrow-FoV image sequences, a panoramic image with depth information is sufficient to generate a holistic BEV semantic map. To benchmark 360BEV, we present two indoor datasets, 360BEV-Matterport and 360BEV-Stanford, both of which include egocentric panoramic images and semantic segmentation labels, as well as allocentric semantic maps. Besides delving deep into different mapping paradigms, we propose a dedicated solution for panoramic semantic mapping, namely 360Mapper. Through extensive experiments, our methods achieve 44.32% and 45.78% in mIoU on both datasets respectively, surpassing previous counterparts with gains of +7.60% and +9.70% in mIoU. Code and datasets will be available at: \url{this https URL}.
只看到整体的一部分并不了解完整的情况。 bird's-eye-view (BEV) 感知是从自我中心视角获取外置坐标系地图的过程,当仅使用狭窄的的视角( FoV)时受到限制。在本研究中,将 360° 全景映射到 BEV 语义映射作为首次尝试,以从上往下的方式呈现室内场景的全体表示。不再依赖狭窄的 FoV 图像序列,一个具有深度信息的全景图像就足够了生成整体 BEV 语义地图。为了比较 360BEV,我们提供了两个室内数据集:360BEV- Matterport 和 360BEV-Stanford,其中包含自我中心全景图像和语义分割标签,以及外置坐标系语义地图。除了深入研究不同的映射范式外,我们提出了全景语义映射的专门解决方案,即 360Mapper。通过广泛的实验,我们的方法在两个数据集上的 mIoU 分别达到 44.32% 和 45.78%,超越了之前的对应方法,在 mIoU 上获得了 7.60% 和 9.70% 的提升。代码和数据集将发表于 \url{this https URL}。
https://arxiv.org/abs/2303.11910
With the rapid development of Pattern Recognition and Computer Vision technologies, tasks like object detection or semantic segmentation have achieved even better accuracy than human beings. Based on these solid foundations, autonomous driving is becoming an important research direction, aiming to revolute the future of transportation and mobility. Sensors are critical to autonomous driving's security and feasibility to perceive the surrounding environment. Multi-Sensor fusion has become a current research hot spot because of its potential for multidimensional perception and integration ability. In this paper, we propose a novel feature-level multi-sensor fusion technology for end-to-end autonomous driving navigation with imitation learning. Our paper mainly focuses on fusion technologies for Lidar and RGB information. We also provide a brand-new penalty-based imitation learning method to reinforce the model's compliance with traffic rules and unify the objective of imitation learning and the metric of autonomous driving.
随着模式识别和计算机视觉技术的快速发展,像物体检测或语义分割这样的任务已经比人类 achieving更高的精度。基于这些坚实的基础,自动驾驶已经成为一个重要的研究方向,旨在重塑交通和 mobility的未来。传感器对于自动驾驶的安全和感知周围环境至关重要。多传感器融合已成为当前研究的热点,因为它具有三维感知和集成能力的潜力。在本文中,我们提出了一种 novel feature-level 多传感器融合技术,用于端到端自动驾驶导航,结合模仿学习。我们的论文主要关注基于激光和RGB信息的 fusion 技术。我们还提供了一种基于惩罚的模仿学习方法,以加强模型遵守交通规则的能力,并统一模仿学习和自动驾驶度量的目标。
https://arxiv.org/abs/2303.11888
Existing works on open-vocabulary semantic segmentation have utilized large-scale vision-language models, such as CLIP, to leverage their exceptional open-vocabulary recognition capabilities. However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging. To address these issues, we aim to attentively relate objects within an image to given categories by leveraging relational information among class categories and visual semantics through aggregation, while also adapting the CLIP representations to the pixel-level task. However, we observe that direct optimization of the CLIP embeddings can harm its open-vocabulary capabilities. In this regard, we propose an alternative approach to optimize the image-text similarity map, i.e. the cost map, using a novel cost aggregation-based method. Our framework, namely CAT-Seg, achieves state-of-the-art performance across all benchmarks. We provide extensive ablation studies to validate our choices. Project page: this https URL.
现有的开放词汇语义分割工作使用了大型视觉语言模型,如CLIP,来利用其卓越的开放词汇识别能力。然而,将从图像级别监督学习的能力转移到像素级别分割任务并处理任意 unseen 类别Inference 的问题使得任务变得具有挑战性。为了解决这些问题,我们旨在仔细关注图像中的对象与给定类别的关系,通过利用类类别之间的关系信息和视觉语义的聚合来利用关系信息,同时适应Clip表示以适应像素级别任务。然而,我们观察到直接优化Clip嵌入可能会损害其开放词汇能力。因此,我们提出了一种替代方法,使用一种独特的成本聚合方法来优化图像-文本相似度地图,即成本地图,我们的框架名为CAT-Seg,在所有基准测试中取得了最先进的性能。我们提供了广泛的对比研究来验证我们的选择。项目页面:这个https URL。
https://arxiv.org/abs/2303.11797
Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at this https URL.
收集和标注带有像素级标签的图像是费时且繁琐的。相比之下,使用生成模型(如DALL-E和稳定扩散模型)来生成合成数据可以非常自由地使用,而无需手动收集和标注。在本文中,我们表明,可以使用预定义的稳定扩散模型训练的DiffuMask合成数据中的准确语义掩码,而无需手动收集和标注真实数据。我们的方法称为DiffuMask,利用了文本和图像之间的交叉注意力地图的潜力,这种扩展基于文本驱动的图像合成到语义掩码生成的过程自然且无缝。DiffuMask使用文本指导的交叉注意力信息来定位特定类别/单词的特定区域,并与其他实用技术相结合,创造了一种高分辨率、类别歧视性的像素级掩码。这些方法显然有助于减少数据收集和标注成本。实验结果表明,在DiffuMask训练的人工合成数据上使用的现有分割方法可以在真实数据对照下实现竞争性能(VOC 2012,城市景观)。对于某些类别(如鸟类),DiffuMask表现出令人期望的性能,几乎达到了真实数据的最新结果(仅3%的mIoU差距)。此外,在开放词汇分割(零样本)设置中,DiffuMask在VOC 2012中 unseen类别上实现了新的SOTA结果。项目网站可以在这个httpsURL上找到。
https://arxiv.org/abs/2303.11681
Semantic segmentation is still a challenging task for parsing diverse contexts in different scenes, thus the fixed classifier might not be able to well address varying feature distributions during testing. Different from the mainstream literature where the efficacy of strong backbones and effective decoder heads has been well studied, in this paper, additional contextual hints are instead exploited via learning a context-aware classifier whose content is data-conditioned, decently adapting to different latent distributions. Since only the classifier is dynamically altered, our method is model-agnostic and can be easily applied to generic segmentation models. Notably, with only negligible additional parameters and +2\% inference time, decent performance gain has been achieved on both small and large models with challenging benchmarks, manifesting substantial practical merits brought by our simple yet effective method. The implementation is available at \url{this https URL}.
语义分割仍然是处理不同场景下 diverse 上下文的艰巨任务,因此,固定的分类器可能无法在测试期间很好地处理不同特征分布。与主流文献中,强有力的基座和有效的解码头的有效性已经得到充分研究不同,在本文中,额外的上下文暗示是通过学习一个基于数据的上下文分类器来实现的,其内容是基于数据Condition的,能够较好地适应不同隐晦分布。由于分类器的动态变化只涉及一个变量,我们的方法和模型无关,可以轻松应用于通用的分割模型。值得注意的是,仅需要几乎没有额外的参数和推理时间 +2%,在小模型和具有挑战性的基准上取得了不错的性能增益,表现出我们简单但有效的方法带来的实际优点。实现可访问 \url{this https URL}。
https://arxiv.org/abs/2303.11633
Novel class discovery (NCD) for semantic segmentation is the task of learning a model that can segment unlabelled (novel) classes using only the supervision from labelled (base) classes. This problem has recently been pioneered for 2D image data, but no work exists for 3D point cloud data. In fact, the assumptions made for 2D are loosely applicable to 3D in this case. This paper is presented to advance the state of the art on point cloud data analysis in four directions. Firstly, we address the new problem of NCD for point cloud semantic segmentation. Secondly, we show that the transposition of the only existing NCD method for 2D semantic segmentation to 3D data is suboptimal. Thirdly, we present a new method for NCD based on online clustering that exploits uncertainty quantification to produce prototypes for pseudo-labelling the points of the novel classes. Lastly, we introduce a new evaluation protocol to assess the performance of NCD for point cloud semantic segmentation. We thoroughly evaluate our method on SemanticKITTI and SemanticPOSS datasets, showing that it can significantly outperform the baseline. Project page at this link: this https URL.
新的类发现(NCD)语义分割任务的任务是学习一种模型,可以利用标记(基础)类的监督来分割未标记(新)类。这个问题最近在2D图像数据上率先提出,但对于3D点云数据却没有研究。事实上,对于2D的假设在此处并不适用于3D。本文旨在推进点云数据分析的前沿技术,从四个方面推进。首先,我们解决点云语义分割中的NCD新问题。其次,我们表明将仅使用2D语义分割方法中的唯一可用NCD方法应用于3D数据是性能较差的。第三,我们提出了基于在线聚类的NCD新方法,利用不确定性量化生产原型,用于伪标记新类点的原型。最后,我们引入了一种新的评估协议,以评估点云语义分割中的NCD性能。我们对SemanticKITTI和SemanticPOSS数据集进行了充分的评估,表明它可以显著优于基准。该项目页面在此链接:此https URL。
https://arxiv.org/abs/2303.11610
Semi-supervised semantic segmentation learns a model for classifying pixels into specific classes using a few labeled samples and numerous unlabeled images. The recent leading approach is consistency regularization by selftraining with pseudo-labeling pixels having high confidences for unlabeled images. However, using only highconfidence pixels for self-training may result in losing much of the information in the unlabeled datasets due to poor confidence calibration of modern deep learning networks. In this paper, we propose a class-adaptive semisupervision framework for semi-supervised semantic segmentation (CAFS) to cope with the loss of most information that occurs in existing high-confidence-based pseudolabeling methods. Unlike existing semi-supervised semantic segmentation frameworks, CAFS constructs a validation set on a labeled dataset, to leverage the calibration performance for each class. On this basis, we propose a calibration aware class-wise adaptive thresholding and classwise adaptive oversampling using the analysis results from the validation set. Our proposed CAFS achieves state-ofthe-art performance on the full data partition of the base PASCAL VOC 2012 dataset and on the 1/4 data partition of the Cityscapes dataset with significant margins of 83.0% and 80.4%, respectively. The code is available at this https URL.
半监督语义分割通过少量的标记样本和大量的未标记图像来学习将像素分类到特定类别的模型。最近的领导方法是通过自我训练来保持一致性,同时使用具有未标记图像中高可信度伪标记像素的方法。然而,仅使用高可信度像素进行自我训练可能会在未标记数据集上丢失大部分信息,因为现代深度学习网络的信誉校准较差。在本文中,我们提出了一种按类自适应半监督语义分割框架(CAFS),以应对现有的高可信度伪标记方法中发生的大部分信息丢失。与现有的半监督语义分割框架不同,CAFS在一个标记数据集上构建了一个验证集,以利用每个类别的校准性能。基于验证集的分析结果,我们提出了一种校准意识的分类wise自适应阈值法和分类wise自适应过度采样方法。我们提出的CAFS在PASCAL VOC 2012基础数据集的完整数据分区和城市景观数据集的1/4数据分区中取得了最先进的性能,分别占总数据的83.0%和80.4%。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.11606
Deep Neural Networks (DNNs)-based semantic segmentation models trained on a source domain often struggle to generalize to unseen target domains, i.e., a domain gap problem. Texture often contributes to the domain gap, making DNNs vulnerable to domain shift because they are prone to be texture-biased. Existing Domain Generalized Semantic Segmentation (DGSS) methods have alleviated the domain gap problem by guiding models to prioritize shape over texture. On the other hand, shape and texture are two prominent and complementary cues in semantic segmentation. This paper argues that leveraging texture is crucial for improving performance in DGSS. Specifically, we propose a novel framework, coined Texture Learning Domain Randomization (TLDR). TLDR includes two novel losses to effectively enhance texture learning in DGSS: (1) a texture regularization loss to prevent overfitting to source domain textures by using texture features from an ImageNet pre-trained model and (2) a texture generalization loss that utilizes random style images to learn diverse texture representations in a self-supervised manner. Extensive experimental results demonstrate the superiority of the proposed TLDR; e.g., TLDR achieves 46.5 mIoU on GTA-to-Cityscapes using ResNet-50, which improves the prior state-of-the-art method by 1.9 mIoU.
基于深度神经网络(DNN)的语义分割模型在训练源domain时往往难以泛化到未知的目标domain,即存在域差问题。纹理常常导致域差,使DNN容易受到域移的影响,因为它们通常倾向于受到纹理偏见。现有的源普遍语义分割(DGSS)方法已经通过指导模型将形状优先级高于纹理来缓解域差问题,通过引导模型将形状优先级高于纹理,可以消除域差问题。另一方面,形状和纹理是语义分割中的两个主要且互补的线索。本文认为利用纹理是改善DGSS性能的关键。具体而言,我们提出了一种新的框架,称为纹理学习域随机化(TLDR),它包括两个新的损失,以有效地增强DGSS中的纹理学习:(1)纹理正则化损失,以避免对源domain的纹理进行过度拟合,使用图像Net上预训练的纹理特征,(2)纹理泛化损失,利用随机风格图像以自监督方式学习多种纹理表示。广泛的实验结果证明了所提出的TLDR的优越性,例如,TLDR使用ResNet-50在GTA到城市景观的语义分割任务中取得了46.5 mIoU,比先前的方法提高了1.9 mIoU。
https://arxiv.org/abs/2303.11546
We present Generative Semantic Segmentation (GSS), a generative learning approach for semantic segmentation. Uniquely, we cast semantic segmentation as an image-conditioned mask generation problem. This is achieved by replacing the conventional per-pixel discriminative learning with a latent prior learning process. Specifically, we model the variational posterior distribution of latent variables given the segmentation mask. To that end, the segmentation mask is expressed with a special type of image (dubbed as maskige). This posterior distribution allows to generate segmentation masks unconditionally. To achieve semantic segmentation on a given image, we further introduce a conditioning network. It is optimized by minimizing the divergence between the posterior distribution of maskige (i.e., segmentation masks) and the latent prior distribution of input training images. Extensive experiments on standard benchmarks show that our GSS can perform competitively to prior art alternatives in the standard semantic segmentation setting, whilst achieving a new state of the art in the more challenging cross-domain setting.
我们提出了生成语义分割(GSS),一种生成式语义分割学习方法。独特之处在于,我们将语义分割视为对图像条件掩码生成的问题。这可以通过将传统的像素级区分学习替换为潜在先验学习过程来实现。具体来说,我们模型了给定掩码的潜在变量后缀分布。为此,掩码用一种特殊的图像(被称为maskige)来表达。这种后缀分布允许无条件地生成分割掩码。为了在给定图像上进行语义分割,我们引入了一个条件网络。它通过最小化maskige(即分割掩码)和输入训练图像的潜在先验分布之间的差异来优化。广泛的基准测试表明,我们的GSS可以在标准语义分割场景中与现有方法竞争,并在更具挑战性的跨域场景中达到新的高水平。
https://arxiv.org/abs/2303.11316
Motivated by the increasing popularity of transformers in computer vision, in recent times there has been a rapid development of novel architectures. While in-domain performance follows a constant, upward trend, properties like robustness or uncertainty estimation are less explored -leaving doubts about advances in model reliability. Studies along these axes exist, but they are mainly limited to classification models. In contrast, we carry out a study on semantic segmentation, a relevant task for many real-world applications where model reliability is paramount. We analyze a broad variety of models, spanning from older ResNet-based architectures to novel transformers and assess their reliability based on four metrics: robustness, calibration, misclassification detection and out-of-distribution (OOD) detection. We find that while recent models are significantly more robust, they are not overall more reliable in terms of uncertainty estimation. We further explore methods that can come to the rescue and show that improving calibration can also help with other uncertainty metrics such as misclassification or OOD detection. This is the first study on modern segmentation models focused on both robustness and uncertainty estimation and we hope it will help practitioners and researchers interested in this fundamental vision task. Code available at this https URL.
受到计算机视觉Transformer越来越受欢迎的影响,近年来出现了迅速发展的新颖架构。虽然域内表现遵循一贯的、向上的趋势,但像鲁棒性或不确定性估计这样的属性较少被探索-这导致了模型可靠性进步的质疑。在这些轴上的研究存在,但主要局限于分类模型。相比之下,我们进行了一项语义分割的研究,这是许多现实世界应用中相关的任务之一,模型可靠性至关重要。我们分析了广泛的模型类型,涵盖了从较旧的ResNet架构到新颖的Transformer,并基于四个指标评估它们的可靠性:鲁棒性、校准、误分类检测和分布外检测。我们发现,虽然最近的模型表现出显著的鲁棒性,但它们在不确定性估计方面整体并不更可靠。我们进一步探索了能够提供帮助的方法,并表明改进校准也可以帮助其他不确定性指标,如误分类或分布外检测。这是第一个专注于鲁棒性和不确定性估计的现代分割模型研究,我们希望通过它帮助关注这个基本的视觉任务的实践者和研究人员。代码在此httpsURL可用。
https://arxiv.org/abs/2303.11298
Whilst the availability of 3D LiDAR point cloud data has significantly grown in recent years, annotation remains expensive and time-consuming, leading to a demand for semi-supervised semantic segmentation methods with application domains such as autonomous driving. Existing work very often employs relatively large segmentation backbone networks to improve segmentation accuracy, at the expense of computational costs. In addition, many use uniform sampling to reduce ground truth data requirements for learning needed, often resulting in sub-optimal performance. To address these issues, we propose a new pipeline that employs a smaller architecture, requiring fewer ground-truth annotations to achieve superior segmentation accuracy compared to contemporary approaches. This is facilitated via a novel Sparse Depthwise Separable Convolution module that significantly reduces the network parameter count while retaining overall task performance. To effectively sub-sample our training data, we propose a new Spatio-Temporal Redundant Frame Downsampling (ST-RFD) method that leverages knowledge of sensor motion within the environment to extract a more diverse subset of training data frame samples. To leverage the use of limited annotated data samples, we further propose a soft pseudo-label method informed by LiDAR reflectivity. Our method outperforms contemporary semi-supervised work in terms of mIoU, using less labeled data, on the SemanticKITTI (59.5@5%) and ScribbleKITTI (58.1@5%) benchmark datasets, based on a 2.3x reduction in model parameters and 641x fewer multiply-add operations whilst also demonstrating significant performance improvement on limited training data (i.e., Less is More).
过去几年中,3D激光雷达点云数据的可用性急剧增长,但标注仍然昂贵且费时,导致对于像自动驾驶这样的应用 domains 的半监督语义分割方法的需求增加。现有的工作往往使用相对较大的逻辑分割基础网络来提高分割精度,以牺牲计算成本。此外,许多人使用均匀采样来减少需要 ground-truth 数据的学习需求,往往导致性能不佳。为了解决这些问题,我们提出了一种新的 pipeline,使用较小的架构,需要更少的 ground-truth 标注来实现比 contemporary 方法更高的分割精度。这可以通过一种新的稀疏深度分离卷积模块来实现,该模块显著减少了网络参数计数,同时保留了整体任务表现。为了有效地缩小我们的训练数据集,我们提出了一种新的时间序列冗余帧采样方法(ST-RFD),利用环境传感器运动知识提取训练数据帧样本的更多元化的子集。为了利用有限的标注数据样本,我们进一步提出了一种基于 LiDAR 反射信息的软伪标签方法。我们的方法在更少标记数据的情况下比 contemporary 半监督工作在语义KITTI(59.5@5%)和 ScribbleKITTI(58.1@5%)基准数据集上的性能表现更好,基于模型参数的2.3x减少和641x fewer 乘加操作,同时也展示了在有限训练数据上的显著性能改进(即更少是更多)。
https://arxiv.org/abs/2303.11203
Despite the success of vision transformers (ViTs), they still suffer from significant drops in accuracy in the presence of common corruptions, such as noise or blur. Interestingly, we observe that the attention mechanism of ViTs tends to rely on few important tokens, a phenomenon we call token overfocusing. More critically, these tokens are not robust to corruptions, often leading to highly diverging attention patterns. In this paper, we intend to alleviate this overfocusing issue and make attention more stable through two general techniques: First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Specifically, TAP learns average pooling schemes for each token such that the information of potentially important tokens in the neighborhood can adaptively be taken into account. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few by using our Attention Diversification Loss (ADL). We achieve this by penalizing high cosine similarity between the attention vectors of different tokens. In experiments, we apply our methods to a wide range of transformer architectures and improve robustness significantly. For example, we improve corruption robustness on ImageNet-C by 2.4% while simultaneously improving accuracy by 0.4% based on state-of-the-art robust architecture FAN. Also, when finetuning on semantic segmentation tasks, we improve robustness on CityScapes-C by 2.4% and ACDC by 3.1%.
尽管视觉Transformer(ViT)取得了成功,但在存在常见腐败现象(如噪声或模糊)的情况下,它们的精度仍然出现了显著的下降。有趣的是,我们观察到ViTs的注意力机制往往会依赖于少数重要的代币,这种现象我们称之为代币过度关注。更为重要的是,这些代币对腐败并不具有鲁棒性,往往会导致高度分散的注意力模式。在本文中,我们希望通过两种方式来解决这种过度关注的问题,并使其注意力更加稳定:第一种是我们Token-aware的平均池化(TAP)模块,鼓励每个代币的局部邻居参与注意力机制。具体来说,TAP学习每个代币的平均池化方案,以便在邻居中可能重要的代币信息可以自适应地考虑。第二种是使用我们的注意力多样性损失(ADL)来强制输出代币聚合从 diverse 一组输入代币的信息,而不是仅仅关注几个。我们通过惩罚不同代币注意力向量的高余弦相似度来实现这一点。在实验中,我们应用这些方法到各种Transformer架构中,显著增强了稳健性。例如,基于最先进的稳健架构FAN,我们在ImageNet-C上提高了腐败鲁棒性2.4%,同时提高了精度0.4%。当针对语义分割任务进行微调时,我们在CityScapes-C上提高了鲁棒性2.4%,并在ACDC上提高了3.1%。
https://arxiv.org/abs/2303.11126
Robust semantic segmentation of intraoperative image data could pave the way for automatic surgical scene understanding and autonomous robotic surgery. Geometric domain shifts, however, although common in real-world open surgeries due to variations in surgical procedures or situs occlusions, remain a topic largely unaddressed in the field. To address this gap in the literature, we (1) present the first analysis of state-of-the-art (SOA) semantic segmentation networks in the presence of geometric out-of-distribution (OOD) data, and (2) address generalizability with a dedicated augmentation technique termed "Organ Transplantation" that we adapted from the general computer vision community. According to a comprehensive validation on six different OOD data sets comprising 600 RGB and hyperspectral imaging (HSI) cubes from 33 pigs semantically annotated with 19 classes, we demonstrate a large performance drop of SOA organ segmentation networks applied to geometric OOD data. Surprisingly, this holds true not only for conventional RGB data (drop of Dice similarity coefficient (DSC) by 46 %) but also for HSI data (drop by 45 %), despite the latter's rich information content per pixel. Using our augmentation scheme improves on the SOA DSC by up to 67 % (RGB) and 90 % (HSI) and renders performance on par with in-distribution performance on real OOD test data. The simplicity and effectiveness of our augmentation scheme makes it a valuable network-independent tool for addressing geometric domain shifts in semantic scene segmentation of intraoperative data. Our code and pre-trained models will be made publicly available.
在 intraoperative 图像数据上进行 robust semantic segmentation 可以开辟自动 surgical 场景理解和机器人手术的路径。然而,由于手术操作或装置位置的变化,实际开放式手术中常常出现几何域转移,这是一个在学术界尚未解决的话题。为了填补文献中的这一空缺,我们(1)将当前最先进的(SOA)语义分割网络在存在几何分布差异数据的情况下进行初步分析,(2)使用一种名为“器官移植”的专门增强技术,从通用计算机视觉社区中借用。根据对六个不同的 OOD 数据集的全面的验证,其中包括 600 个 RGB 和超光谱成像(HSI)立方体,从 33 只猪语义标注了 19 个类别的数据,我们展示了 SOA 器官分割网络对几何 OOD 数据的应用表现出很大性能下降。令人惊奇的是,不仅对传统 RGB 数据(DSC 下降 46%),而且对 HSI 数据(下降 45%),尽管 HSI 像素的信息密度更高。使用我们的增强方案可以提高 SOA 的 DSC 至 67 %(RGB)和 90 %(HSI),使其在真实 OOD 测试数据上的分布性能与分布性能相当。我们的增强方案简单而有效,使其成为解决几何域转移在语义场景分割 intraoperative 数据中几何分布差异的问题的一种有价值的网络无关工具。我们的代码和预训练模型将公开可用。
https://arxiv.org/abs/2303.10972