Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at this https URL.
弱监督的对象定位和语义分割旨在使用图像级别标签只定位对象。最近,出现了一种新范式,通过生成前景预测图(FPM)来实现像素级定位。虽然现有的FPM相关方法使用交叉熵来评估前景预测图和指导生成器学习,但本文提出了对对象定位学习过程的两个惊人的实验观察:对于训练网络,当前景掩膜扩展时,1)交叉熵收敛到零,当前景掩膜仅覆盖对象区域的一部分时。2)激活值持续增加,直到前景掩膜扩展到对象边界。因此,为了实现更有效的定位表现,我们主张使用激活值来学习更多的对象区域。在本文中,我们提出了一种背景激活抑制(BAS)方法。具体来说,一个激活图约束(AMC)模块旨在抑制背景激活值,以促进生成器学习。同时,通过使用前景区域指导和监督区域大小,BAS可以学习整个对象区域。在推理阶段,我们考虑不同类别的预测图一起获得最终定位结果。广泛的实验表明,BAS在CUB-200-2011和ILSVRC数据集上实现了显著和一致性的提高,与基准方法相比。此外,我们的方法还在PASCAL VOC 2012和MS COCO 2014数据集上实现了弱监督语义分割性能的顶尖水平。代码和模型在此httpsURL上可用。
https://arxiv.org/abs/2309.12943
Coral reefs are among the most diverse ecosystems on our planet, and are depended on by hundreds of millions of people. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular SfM photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego-motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high-precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: a 100 m video transect acquired within 5 minutes of diving with a cheap consumer-grade camera can be fully automatically analyzed within 5 minutes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method democratizes coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning-based Structure-from-Motion has broad implications for fast low-cost mapping of underwater environments other than coral reefs.
珊瑚礁是地球上最多样化的生态系统之一,是数百万人的依赖者。不幸的是,大多数珊瑚礁都受到全球气候变化和当地人类活动的压力的严重威胁。为了更好地理解珊瑚礁恶化的动态机制,提高空间和时间分辨率的监测是关键。然而,用于量化珊瑚覆盖和物种数量的常规监测方法由于需要大量的手动劳动而 scale 受到限制。尽管计算机视觉工具被用于协助这个过程,特别是 SfM 照相测量和图像分割深度学习网络,但数据分析造成了瓶颈,有效地限制了其 scalability。本文提出了从自我运动视频映射水下环境的新方法,将 3D 映射系统统一起来,使用机器学习适应水下挑战条件,并与现代方法之一,图像语义分割相结合。该方法在红海北部阿喀巴湾的珊瑚礁区举例说明,展示了前所未有的高精度 3D 语义映射,且所需劳动成本 significantly 减少了:使用廉价的消费级摄像机在潜水5分钟内采集的100米视频线可以在5分钟内完全自动分析。我们的方法 significantly 提高了珊瑚礁监测的规模,通过迈向完全自动分析视频线一大步。该方法通过减少劳动、设备、后勤和计算成本,实现了珊瑚礁线民主化。这种方法可以帮助更有效地传达保护政策。基于学习的结构自运动计算方法对于快速、低成本映射水下环境除珊瑚礁以外的其他生态系统也有广泛的影响。
https://arxiv.org/abs/2309.12804
To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.
为了减轻昂贵的人工标注成本,半监督语义分割使用少量标记图像和大量未标记图像来预测具有相同大小的像素级标签地图。以往的方法通常采用共训练,使用具有相同架构但不同初始化的卷积神经网络,但无法捕捉到足够多样化的特征。这激励我们使用三角训练和发展三重视角编码器,利用编码器具有不同架构来提取多样化特征,并利用知识蒸馏技能学习这些编码器之间的互补语义。此外,现有的方法只是将特征从编码器和解码器中拼接在一起,导致冗余特征,需要巨大的内存成本。这启发我们设计一种双频解码器,通过从空间域到频率域 projected特征来选择这些重要特征,并在频率域中引入双频通道注意力机制来建模特征重要性。因此,我们提出了一个名为 TriKD 的三重视角知识蒸馏框架,包括三重视角编码器和双频解码器。在两个基准上进行了广泛的实验,分别是Pascal VOC 2012和城市景观,其结果证实了该方法的优越性,具有精度和推理速度的良好权衡。
https://arxiv.org/abs/2309.12557
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.
自注意力视觉转换器(ViTs)在计算机视觉中已成为一种高度竞争的架构。与卷积神经网络(CNNs)不同,ViTs能够进行全球信息共享。随着ViTs各种结构的开发,ViTs在许多视觉任务中变得越来越有利。然而,自注意力的平方复杂度使得ViTs的计算密集型,且它们的局部和翻译等移量偏见要求比CNNs更大的模型大小来有效地学习视觉特征。在本文中,我们提出了一种轻量级高效的视觉转换器模型,称为“双重代币-ViT”,利用CNNs和ViTs的优点。双重代币-ViT有效地融合代币与通过卷积结构获取的局部信息和通过自注意力结构获取的全球信息,以构建高效的注意结构。此外,我们在整个过程中使用位置aware全球代币来丰富全球信息,这进一步加强了双重代币-ViT的效果。位置aware全球代币也包含图像的位置信息,这使得我们的模型更适合视觉任务。我们进行了广泛的实验,对图像分类、物体检测和语义分割任务进行测试,以证明双重代币-ViT的有效性。在ImageNet-1K数据集上,不同大小的模型在仅0.5G和1.0G FLOP的情况下,分别实现了75.4%和79.4%的准确率,而使用1.0G FLOPs的模型比使用全球代币的LightViT-T高出0.7%。
https://arxiv.org/abs/2309.12424
Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.
传统上,训练神经网络进行语义分割需要昂贵的人类手写标注。但最近, unsupervised learning 领域的进展已经在这个问题上取得了重大进展,并且正在接近与监督算法的差距。为了实现这一点,学习从整个数据集上随机采样的特征之间的相关性是语义知识的蒸馏方法。在本文中,我们利用这些进展,通过使用深度信息将场景的结构信息引入训练过程。我们实现以下方法(1):通过空间相关性学习特征映射与深度映射之间的关系,以诱导关于场景结构的知识,(2):实施最远点采样,通过使用场景深度信息的 3D 采样技术更有效地选择相关的特征。最后,我们通过广泛的实验和在多个基准数据集上显著改善性能展示了我们技术贡献的有效性。
https://arxiv.org/abs/2309.12378
Multi-modal unsupervised domain adaptation (MM-UDA) for 3D semantic segmentation is a practical solution to embed semantic understanding in autonomous systems without expensive point-wise annotations. While previous MM-UDA methods can achieve overall improvement, they suffer from significant class-imbalanced performance, restricting their adoption in real applications. This imbalanced performance is mainly caused by: 1) self-training with imbalanced data and 2) the lack of pixel-wise 2D supervision signals. In this work, we propose Multi-modal Prior Aided (MoPA) domain adaptation to improve the performance of rare objects. Specifically, we develop Valid Ground-based Insertion (VGI) to rectify the imbalance supervision signals by inserting prior rare objects collected from the wild while avoiding introducing artificial artifacts that lead to trivial solutions. Meanwhile, our SAM consistency loss leverages the 2D prior semantic masks from SAM as pixel-wise supervision signals to encourage consistent predictions for each object in the semantic mask. The knowledge learned from modal-specific prior is then shared across modalities to achieve better rare object segmentation. Extensive experiments show that our method achieves state-of-the-art performance on the challenging MM-UDA benchmark. Code will be available at this https URL.
多模态无监督 domain 转换(MM-UDA)用于 3D 语义分割是实现在自主系统中添加语义理解而无需昂贵点标注的实际解决方案。虽然以前的 MM-UDA 方法可以在总体上改善性能,但它们面临着严重的类不平衡表现,限制了它们在实际应用中的采用。这种不平衡表现的主要原因是:1)使用不平衡数据进行自我训练和2)缺乏像素级别的 2D 监督信号。在本文中,我们提出了多模态前向帮助(MoPA) domain 转换方法,以提高罕见的物体性能。具体而言,我们开发有效的地面插入(VGI)来纠正不平衡监督信号,通过从野外收集的罕见的物体插入先前的罕见物体,避免引入导致简单解决方案的人工 artifacts。同时,我们的 Sam 一致性损失利用 Sam 的 2D 先前语义掩码作为像素级别的监督信号,鼓励对语义掩中的每个物体进行一致性预测。从modal 特定前向帮助中学习的知识然后将在不同模态中共享,以实现更好的罕见的物体分割。广泛的实验表明,我们的方法在挑战性的 MM-UDA 基准测试中实现了最先进的性能。代码将在此 https URL 上提供。
https://arxiv.org/abs/2309.11839
Recently, multi-modality models have been introduced because of the complementary information from different sensors such as LiDAR and cameras. It requires paired data along with precise calibrations for all modalities, the complicated calibration among modalities hugely increases the cost of collecting such high-quality datasets, and hinder it from being applied to practical scenarios. Inherit from the previous works, we not only fuse the information from multi-modality without above issues, and also exhaust the information in the RGB modality. We introduced the 2D Detection Annotations Transmittable Aggregation(\textbf{2DDATA}), designing a data-specific branch, called \textbf{Local Object Branch}, which aims to deal with points in a certain bounding box, because of its easiness of acquiring 2D bounding box annotations. We demonstrate that our simple design can transmit bounding box prior information to the 3D encoder model, proving the feasibility of large multi-modality models fused with modality-specific data.
最近,由于来自不同传感器,如激光雷达和摄像头的互补信息,引入了多模态模型。这要求所有模态的精确校准,而复杂的校准在模态之间增加了很大的收集高质量数据的成本,并阻碍了将其应用于实际场景。继承之前的工作,我们不仅融合来自不同模态的信息,而且也没有上述问题,还设计了数据特定的分支,称为 \textbf{Local Object Branch},旨在处理特定边界框内的点,因为其 acquisition 2D 边界框annotations 的简便性。我们证明,我们的简单设计可以向 3D 编码器模型传输边界框先验信息,证明了大型多模态模型与特定模态数据融合的可行性。
https://arxiv.org/abs/2309.11755
Unsupervised domain adaptation (UDA) is an effective approach to handle the lack of annotations in the target domain for the semantic segmentation task. In this work, we consider a more practical UDA setting where the target domain contains sequential frames of the unlabeled videos which are easy to collect in practice. A recent study suggests self-supervised learning of the object motion from unlabeled videos with geometric constraints. We design a motion-guided domain adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object motion to learn effective representations in the target domain. MoDA differs from previous methods that use temporal consistency regularization for the target domain frames. Instead, MoDA deals separately with the domain alignment on the foreground and background categories using different strategies. Specifically, MoDA contains foreground object discovery and foreground semantic mining to align the foreground domain gaps by taking the instance-level guidance from the object motion. Additionally, MoDA includes background adversarial training which contains a background category-specific discriminator to handle the background domain gaps. Experimental results on multiple benchmarks highlight the effectiveness of MoDA against existing approaches in the domain adaptive image segmentation and domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in conjunction with existing state-of-the-art approaches to further improve performance.
无监督域适应(UDA)是一种有效的方法,用于处理目标域中缺乏标注数据的任务,即语义分割任务。在本文中,我们考虑一个更加实用的UDA场景,其中目标域包含未标注视频的连续帧,在实践中很容易收集。一项最近的研究建议从未标注视频进行自监督的物体运动学习。我们设计了一个基于运动指导的目标域自适应语义分割框架(MoDA),该框架利用自监督的物体运动学习在目标域中学习有效的表示。MoDA与之前的方法不同,使用目标域帧的时序一致性 regularization,而不同于以往的方法,MoDA采用不同的策略来处理前方和背景类别之间的域对齐。具体而言,MoDA包括前方物体发现和前方语义挖掘,通过从物体运动获取实例级指导,将前方域差距对齐。此外,MoDA还包括背景对抗训练,其中包含背景类别特定的分类器,以处理背景域差距。多个基准测试的实验结果强调了MoDA在域自适应图像分割和域自适应视频分割中的效果。此外,MoDA是灵活的,可以与现有的高级方法相结合,以进一步提高性能。
https://arxiv.org/abs/2309.11711
Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy support sets during testing time. To this end, we first propose a Component-level Clean Noise Separation (CCNS) representation learning to learn discriminative feature representations that separates the clean samples of the target classes from the noisy samples. Leveraging the well separated clean and noisy support samples from our CCNS, we further propose a Multi-scale Degree-based Noise Suppression (MDNS) scheme to remove the noisy shots from the support set. We conduct extensive experiments on various noise settings on two benchmark datasets. Our results show that the combination of CCNS and MDNS significantly improves the performance. Our code is available at this https URL.
少量点云语义分割的目标是训练模型,以在新 unseen 类只有少量支持样本的情况下,快速适应这些新类。然而,支持样本中的噪声假设在许多实际实际场景中很容易违反。在本文中,我们重点是在测试期间,在噪声支持样本的有害影响下,提高少量点云分割的鲁棒性。为此,我们提出了一种组件级别的干净噪声分离(CCNS)表示学习,以学习区分目标类干净样本和噪声样本的表示。利用我们的 CCNS 中的干净和噪声支持样本,我们进一步提出了一种多尺度度数based噪声抑制(MDNS)方案,以从支持样本中删除噪声样本。我们在两个基准数据集上进行了广泛的实验,研究了各种噪声设置。我们的结果表明,CCNS 和 MDNS 的组合显著提高了性能。我们的代码可在 this https URL 上获取。
https://arxiv.org/abs/2309.11228
Quantization of deep neural networks (DNN) has become a key element in the efforts of embedding such networks on end-user devices. However, current quantization methods usually suffer from costly accuracy degradation. In this paper, we propose a new method for Enhanced Post Training Quantization named EPTQ. The method is based on knowledge distillation with an adaptive weighting of layers. In addition, we introduce a new label-free technique for approximating the Hessian trace of the task loss, named Label-Free Hessian. This technique removes the requirement of a labeled dataset for computing the Hessian. The adaptive knowledge distillation uses the Label-Free Hessian technique to give greater attention to the sensitive parts of the model while performing the optimization. Empirically, by employing EPTQ we achieve state-of-the-art results on a wide variety of models, tasks, and datasets, including ImageNet classification, COCO object detection, and Pascal-VOC for semantic segmentation. We demonstrate the performance and compatibility of EPTQ on an extended set of architectures, including CNNs, Transformers, hybrid, and MLP-only models.
将深度神经网络(DNN)嵌入用户设备已成为将此类网络嵌入到终端设备的努力的关键元素。然而,当前量化方法通常导致昂贵的精度下降。在本文中,我们提出了一种名为EPTQ的新方法,用于增强训练后量化,该方法基于知识蒸馏,并采用层自适应权重。此外,我们介绍了一种无标签方法,用于近似任务损失的哈密顿路径,名为无标签哈密顿方法。该方法消除了计算哈密顿路径所需的标签数据集的要求。自适应知识蒸馏使用无标签哈密顿方法来在优化期间更加注重模型的敏感部分。经验表明,通过使用EPTQ,我们能够在各种模型、任务和数据集上实现最先进的结果,包括图像网分类、COCO物体检测和Pascal-VOC语义分割。我们证明了EPTQ的性能与兼容性,并将其扩展到包括卷积神经网络、转换器、混合模型和仅使用递归神经网络模型的多个架构。
https://arxiv.org/abs/2309.11531
Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.
声音可以在我们的生活中传递重要的空间推理信息。为了赋予深度学习这种能力,我们利用跨modal知识蒸馏技术解决了在2D和3D中利用声音进行密集室内预测的挑战。在这项工作中,我们提出了一种通过匹配(SAM)蒸馏框架来实现空间对齐的方法,该框架能够在视觉到音频知识转移中识别两个modal之间的局部对应关系。SAM将音频特征与视觉一致性可学习的空间嵌入相结合,以解决学生模型多层中的不一致性问题。我们的方法不依赖于特定的输入表示,可以在输入形状或维度上具有灵活性,而不会影响性能。凭借新 curated 的基准名为周围Dense Auditory Prediction(DAPS),我们成为第一个在2D和3D中利用音频观察解决广泛directional周围室内预测问题的人。具体而言,对于基于音频的深度估计、语义分割和具有挑战性的3D场景重建,我们提出的蒸馏框架 consistently 实现了各种指标和基本架构的前沿性能。
https://arxiv.org/abs/2309.11081
In this paper, we present CaveSeg - the first visual learning pipeline for semantic segmentation and scene parsing for AUV navigation inside underwater caves. We address the problem of scarce annotated training data by preparing a comprehensive dataset for semantic segmentation of underwater cave scenes. It contains pixel annotations for important navigation markers (e.g. caveline, arrows), obstacles (e.g. ground plain and overhead layers), scuba divers, and open areas for servoing. Through comprehensive benchmark analyses on cave systems in USA, Mexico, and Spain locations, we demonstrate that robust deep visual models can be developed based on CaveSeg for fast semantic scene parsing of underwater cave environments. In particular, we formulate a novel transformer-based model that is computationally light and offers near real-time execution in addition to achieving state-of-the-art performance. Finally, we explore the design choices and implications of semantic segmentation for visual servoing by AUVs inside underwater caves. The proposed model and benchmark dataset open up promising opportunities for future research in autonomous underwater cave exploration and mapping.
在本文中,我们介绍了水下洞穴中的洞穴SEG(深视觉模型),它是第一个用于AUV在水下洞穴中进行语义分割和场景解析的视觉学习管道。我们为了解决缺乏标注训练数据的问题,准备了一组全面的数据集,用于水下洞穴场景的语义分割。该数据集包含对重要导航标记(例如洞穴线、箭头)像素annotations、障碍物(例如地面平原和上方层)、潜水员、以及用于控制运动的开放区域。通过对美国、墨西哥和西班牙等地的水下洞穴系统的全面基准分析,我们证明了基于洞穴SEG的强有力深度视觉模型可以用于快速水下洞穴环境中语义场景解析。特别是,我们制定了一种新型Transformer-based模型,其计算量较轻,并提供了近乎实时的执行,除了实现最先进的性能外。最后,我们探索了语义分割对AUV水下洞穴内视觉控制的设计和影响。我们提出的模型和基准数据集为未来自主水下洞穴探索和 mapping的研究打开了 promising 机会。
https://arxiv.org/abs/2309.11038
Transformer first appears in the field of natural language processing and is later migrated to the computer vision domain, where it demonstrates excellent performance in vision tasks. However, recently, Retentive Network (RetNet) has emerged as an architecture with the potential to replace Transformer, attracting widespread attention in the NLP community. Therefore, we raise the question of whether transferring RetNet's idea to vision can also bring outstanding performance to vision tasks. To address this, we combine RetNet and Transformer to propose RMT. Inspired by RetNet, RMT introduces explicit decay into the vision backbone, bringing prior knowledge related to spatial distances to the vision model. This distance-related spatial prior allows for explicit control of the range of tokens that each token can attend to. Additionally, to reduce the computational cost of global modeling, we decompose this modeling process along the two coordinate axes of the image. Abundant experiments have demonstrated that our RMT exhibits exceptional performance across various computer vision tasks. For example, RMT achieves 84.1% Top1-acc on ImageNet-1k using merely 4.5G FLOPs. To the best of our knowledge, among all models, RMT achieves the highest Top1-acc when models are of similar size and trained with the same strategy. Moreover, RMT significantly outperforms existing vision backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Our work is still in progress.
Transformer最初出现在自然语言处理领域,后来迁移到计算机视觉领域,并在视觉任务中表现出出色的性能。然而,最近,Retentive Network (RetNet)出现了,它成为可以替代Transformer的一种架构,引起了NLP社区的广泛关注。因此,我们提出了一个问题,即将RetNet的想法转移到视觉任务是否可以带来出色的性能。为了解决这个问题,我们结合了 RetNet 和 Transformer,提出了 RMT。受到RetNet的启发,RMT将视觉主干线引入了明确的衰减,将与空间距离相关的先前知识引入到视觉模型中。这种距离相关的空间先验可以 explicit 控制每个 token 可以关注的范围。此外,为了降低全球建模的计算成本,我们将这个建模过程分解到图像的两个坐标轴上。大量的实验表明,我们的 RMT 在各种不同的计算机视觉任务中表现出卓越的性能。例如,RMT在ImageNet-1k上使用仅4.5G FLOPs的情况下取得了84.1%的Top1准确率。据我们所知,在所有模型中,当模型大小相似并使用相同的训练策略时,RMT取得了最高的Top1准确率。此外,RMT在后续任务(如物体检测、实例分割和语义分割)中显著优于现有的视觉主干线。我们的工作仍在进行中。
https://arxiv.org/abs/2309.11523
This paper presents a fully unsupervised deep change detection approach for mobile robots with 3D LiDAR. In unstructured environments, it is infeasible to define a closed set of semantic classes. Instead, semantic segmentation is reformulated as binary change detection. We develop a neural network, RangeNetCD, that uses an existing point-cloud map and a live LiDAR scan to detect scene changes with respect to the map. Using a novel loss function, existing point-cloud semantic segmentation networks can be trained to perform change detection without any labels or assumptions about local semantics. We demonstrate the performance of this approach on data from challenging terrains; mean intersection over union (mIoU) scores range between 67.4% and 82.2% depending on the amount of environmental structure. This outperforms the geometric baseline used in all experiments. The neural network runs faster than 10Hz and is integrated into a robot's autonomy stack to allow safe navigation around obstacles that intersect the planned path. In addition, a novel method for the rapid automated acquisition of per-point ground-truth labels is described. Covering changed parts of the scene with retroreflective materials and applying a threshold filter to the intensity channel of the LiDAR allows for quantitative evaluation of the change detector.
本文介绍了一种完全无监督的深度变化检测方法,适用于具有三维激光雷达的移动机器人。在无结构环境中,无法定义一个封闭的语义类别集。因此,语义分割被重新表述为二进制变化检测。我们开发了一个新神经网络RangeNetCD,使用现有的点云地图和实时激光雷达扫描来检测地图与点云之间的场景变化。使用一种新的损失函数,现有的点云语义分割网络可以训练在没有本地语义标签或假设的情况下进行变化检测。我们使用了挑战性地形的数据来演示这种方法的性能。根据环境结构的数量,平均交叉 union (mIoU)得分范围在67.4%至82.2%。这种方法超越了所有实验中使用的几何基准。神经网络的运行速度超过10赫兹,并集成到机器人的自主堆中,以便安全绕过与计划路径相交的障碍。此外,描述了一种快速自动获取每个点真阳性标签的新方法。覆盖场景变化的部分使用反射材料,并应用阈值滤波器,将激光雷达的强度通道转换为通道,以便对变化检测进行量化评估。
https://arxiv.org/abs/2309.10924
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at this http URL.
当前的Panoptic segmentation方法需要巨大的标记阴性训练数据,这对广泛采用这些方法提出了巨大的挑战。同时,视觉表示学习领域的最近突破引发了范式的转变,导致出现了可以训练完全无标签图像的大型基础模型。在这个研究中,我们提议利用这些任务无关的图像特征,以通过呈现几乎无标签的Panoptic信息分割(SPINO)方法实现多次 Panoptic 分割。具体来说,我们的方法结合了 DINOv2 骨干网络和轻量级网络头部,用于语义分割和边界估计。我们表明,尽管我们训练了只有十张标记阴性的图像,但我们预测了高质量的伪标签,可以与任何现有的 Panoptic 分割方法一起使用。值得注意的是,我们证明了SPINO相对于完全监督基准线实现了竞争结果,同时使用了不到0.3%的 ground truth 标签,为利用基础模型学习复杂的视觉识别任务开辟了道路。为了展示其通用性,我们进一步在室内外真实的机器人视觉系统中部署了SPINO。为了促进未来的研究,我们将代码和训练模型在此httpURL上公开发布。
https://arxiv.org/abs/2309.10726
Current state-of-the-art point cloud-based perception methods usually rely on large-scale labeled data, which requires expensive manual annotations. A natural option is to explore the unsupervised methodology for 3D perception tasks. However, such methods often face substantial performance-drop difficulties. Fortunately, we found that there exist amounts of image-based datasets and an alternative can be proposed, i.e., transferring the knowledge in the 2D images to 3D point clouds. Specifically, we propose a novel approach for the challenging cross-modal and cross-domain adaptation task by fully exploring the relationship between images and point clouds and designing effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art performance for 3D point cloud semantic segmentation on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to existing unsupervised and weakly-supervised baselines.
当前先进的点云基于感知方法通常依赖于大规模的标记数据,这需要昂贵的手动标注。一种自然的选项是探索无监督对于三维感知任务的方法。然而,这些方法往往面临显著的性能下降困难。幸运的是,我们发现存在大量的基于图像的 datasets 并且可以提出一种替代方案,即将二维图像的知识转移到三维点云。具体来说,我们提出了一种挑战性的跨modal 和跨域适应任务的新方法,通过 fully Exploring 图像和点云之间的关系并设计有效的特征对齐策略,来设计。与现有的无监督和弱监督基准相比,我们的方法在语义KITTI上的三维点云语义分割任务中表现出最先进的性能。在没有三维标签的情况下,我们的方法比现有的无监督和弱监督基准表现出更高的三维点云语义分割性能。
https://arxiv.org/abs/2309.10649
Machine-learning models can be fooled by adversarial examples, i.e., carefully-crafted input perturbations that force models to output wrong predictions. While uncertainty quantification has been recently proposed to detect adversarial inputs, under the assumption that such attacks exhibit a higher prediction uncertainty than pristine data, it has been shown that adaptive attacks specifically aimed at reducing also the uncertainty estimate can easily bypass this defense mechanism. In this work, we focus on a different adversarial scenario in which the attacker is still interested in manipulating the uncertainty estimate, but regardless of the correctness of the prediction; in particular, the goal is to undermine the use of machine-learning models when their outputs are consumed by a downstream module or by a human operator. Following such direction, we: \textit{(i)} design a threat model for attacks targeting uncertainty quantification; \textit{(ii)} devise different attack strategies on conceptually different UQ techniques spanning for both classification and semantic segmentation problems; \textit{(iii)} conduct a first complete and extensive analysis to compare the differences between some of the most employed UQ approaches under attack. Our extensive experimental analysis shows that our attacks are more effective in manipulating uncertainty quantification measures than attacks aimed to also induce misclassifications.
机器学习模型可以被dversarial examples欺骗,即精心制作的输入扰动,迫使模型输出错误的预测。尽管最近提出了不确定性量化来检测dversarial输入,但根据假设,这些攻击表现出比原始数据更高的预测不确定性,因此已经表明,专门旨在减少不确定性估计的自适应攻击可以轻松绕过此防御机制。在本文中,我们关注一个不同的dversarial场景,攻击者仍然有兴趣操纵不确定性估计,但无论预测是否正确;特别是,我们的目标是破坏使用机器学习模型时,其输出被下游模块或人类操作员消耗的情况。按照这种方向,我们: \textit{(i)} 设计了一个针对攻击目标不确定性量化的威胁模型; \textit{(ii)} 设计了一系列不同的攻击策略,涵盖了分类和语义分割问题的不同概念; \textit{(iii)} 进行了第一次完整和广泛的分析,以比较攻击中最常用的UQ方法之间的差异。我们的广泛实验分析表明,我们的攻击在操纵不确定性量化措施方面比旨在同时导致错误分类的攻击更有效。
https://arxiv.org/abs/2309.10586
Annotating 3D LiDAR point clouds for perception tasks including 3D object detection and LiDAR semantic segmentation is notoriously time-and-energy-consuming. To alleviate the burden from labeling, it is promising to perform large-scale pre-training and fine-tune the pre-trained backbone on different downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations, and demonstrate its effectiveness on various public datasets with different downstream tasks under the label-efficiency setting. Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks. (2) SPOT uses beam re-sampling technique for point cloud augmentation and applies class-balancing strategies to overcome the domain gap brought by various LiDAR sensors and annotation strategies in different datasets. (3) Scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. We believe that our findings can facilitate understanding of LiDAR point clouds and pave the way for future exploration in LiDAR pre-training. Codes and models will be released.
为感知任务(包括3D物体检测和LiDAR语义分割)标注3D LiDAR点 clouds是相当费时和费力的。为了减轻标签负担,具有潜力进行大规模的预训练,并在不同后续数据集和任务上的预训练骨架进行 fine-tune。在本文中,我们提出了Spot,即通过Occupancy预测进行可转移3D表示的学习增量预训练,并在标签效率设定下,对各种公共数据集的不同后续任务进行了实验。我们的贡献有三个方面:(1)Occupancy预测表明,学习通用表示具有潜力,这通过广泛的实验得到了证明。(2)Spot使用波束采样技术进行点云增强,并应用类平衡策略,以克服不同数据集上各种LiDAR传感器和标注策略带来的领域差异。(3)Scalable预训练被观察到,即所有实验的后续表现随着更多的预训练数据而改善。我们相信,我们的发现将有助于理解LiDAR点 clouds,并为LiDAR预训练未来的探索铺平道路。代码和模型将发布。
https://arxiv.org/abs/2309.10527
Semantic segmentation is an essential technology for self-driving cars to comprehend their surroundings. Currently, real-time semantic segmentation networks commonly employ either encoder-decoder architecture or two-pathway architecture. Generally speaking, encoder-decoder models tend to be quicker,whereas two-pathway models exhibit higher accuracy. To leverage both strengths, we present the Spatial-Assistant Encoder-Decoder Network (SANet) to fuse the two architectures. In the overall architecture, we uphold the encoder-decoder design while maintaining the feature maps in the middle section of the encoder and utilizing atrous convolution branches for same-resolution feature extraction. Toward the end of the encoder, we integrate the asymmetric pooling pyramid pooling module (APPPM) to optimize the semantic extraction of the feature maps. This module incorporates asymmetric pooling layers that extract features at multiple resolutions. In the decoder, we present a hybrid attention module, SAD, that integrates horizontal and vertical attention to facilitate the combination of various branches. To ascertain the effectiveness of our approach, our SANet model achieved competitive results on the real-time CamVid and cityscape datasets. By employing a single 2080Ti GPU, SANet achieved a 78.4 % mIOU at 65.1 FPS on the Cityscape test dataset and 78.8 % mIOU at 147 FPS on the CamVid test dataset. The training code and model for SANet are available at this https URL
语义分割是自动驾驶汽车理解周围环境的关键技术。目前,实时语义分割网络通常采用编码-解码架构或两个路径架构。一般来说,编码-解码模型通常更快,而两个路径模型则具有更高的准确性。为了充分利用这两种优势,我们提出了空间助手编码-解码网络(SANet),将两种架构相结合。在整体架构中,我们坚持编码-解码设计,同时保持编码器中 middle 部分的特征映射,并使用膜卷积分支以提取同分辨率特征。在编码器的最后阶段,我们集成了偏置聚合金字塔聚合模块(APPPM),以优化特征映射的语义提取。这个模块包含了多个分辨率下偏置聚合层。在解码器中,我们提出了一种混合注意力模块,SAD,它将横向和纵向注意力相结合,以促进各种分支的组合。为了确定我们方法的有效性,我们的SANet模型在实时 CamVid 和城市风景数据集上取得了竞争结果。通过使用单个 2080 Ti GPU,SANet 在 Cityscape 测试数据集上实现了 78.4 % 的 mIOU 以 65.1 fps,而在 CamVid 测试数据集上实现了 78.8 % 的 mIOU 以 147 fps。SANet 的训练代码和模型可在此 https URL 上获取。
https://arxiv.org/abs/2309.10519
Catastrophic forgetting of previous knowledge is a critical issue in continual learning typically handled through various regularization strategies. However, existing methods struggle especially when several incremental steps are performed. In this paper, we extend our previous approach (RECALL) and tackle forgetting by exploiting unsupervised web-crawled data to retrieve examples of old classes from online databases. Differently from the original approach that did not perform any evaluation of the web data, here we introduce two novel approaches based on adversarial learning and adaptive thresholding to select from web data only samples strongly resembling the statistics of the no longer available training ones. Furthermore, we improved the pseudo-labeling scheme to achieve a more accurate labeling of web data that also consider classes being learned in the current step. Experimental results show that this enhanced approach achieves remarkable results, especially when multiple incremental learning steps are performed.
忘记之前的知识是持续学习中一个重要的问题,通常需要通过各种 Regularization 策略来处理。然而,现有的方法在执行多个增量步骤时往往面临挑战。在本文中,我们扩展了之前的方法(RECALL),并利用 unsupervised 的 Web 爬虫数据来解决忘记问题,从在线数据库中检索旧类的示例。与原始方法不同的是,我们引入了基于对抗学习和自适应阈值的两个新方法,从 Web 数据中仅选择 strongly 类似于不再可用的训练数据的统计样本。此外,我们改进了伪标签方案,以更准确地标签 Web 数据,并考虑在当前步骤中学习的新类。实验结果显示,这种增强方法取得了显著的结果,特别是在执行多个增量步骤时。
https://arxiv.org/abs/2309.10479