Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65\% compared to RoadFormer. Our source code will be publicly available at mias.group/RoadFormerPlus.
任务特定数据融合网络在城市场景解析方面取得了显著的成就。在这些网络中,我们最近提出的RoadFormer成功地从红绿蓝图像和表面法线图中提取异质特征,并通过注意力机制将这些特征融合在一起,证明了在RGB-Normal道路场景解析方面具有引人注目的效果。然而,当处理其他类型/来源的数据或执行更通用的全类别场景解析任务时,其性能显著下降。为了克服这些限制,本研究引入了RoadFormer+,一种高效、稳健、可适应的模型,能够有效地将RGB-X数据进行融合,其中“X”代表深度、热、表面法线和极化等额外类型/模块。具体来说,我们提出了一个新颖的混合特征解耦编码器,以提取异质特征并将它们解耦为全局和局部组件。这些解耦的特征通过一个双分支多尺度异质特征融合块进行融合,该块采用并行Transformer注意力和卷积神经网络模块将不同尺度和感受野上的多尺度特征合并。融合后的特征随后输入解码器以生成最终语义预测。值得注意的是,我们提出的RoadFormer+在KITTI Road基准上排名第一,并在Cityscapes、MFNet、FMB和ZJU数据集上实现了与最先进方法相同的平均交集 over 联合。此外,与RoadFormer相比,它减少了65%的可学习参数。我们的源代码将公开发布在mias.group/RoadFormerPlus上。
https://arxiv.org/abs/2407.21631
The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{this https URL}.
目前,对比学习方法主要集中在单粒度表示学习,例如部分级别、对象级别或场景级别,从而忽略了其他粒度级别上表示的转移性。在本文中,我们的目标是学习多粒度表示,可以有效地描述图像在各个粒度级别上的特征,从而提高在广泛的下游任务上的泛化能力。为此,我们提出了一个名为多粒度对比(MGC)的无监督表示学习的新方法。具体来说,我们通过构建积极视角和对应关系之间的细腻多粒度对应关系,然后通过对应关系进行多粒度对比,学习更通用的无监督表示。在没有在大规模数据集上预训练的情况下,我们的方法在广泛的下游任务上显著超过了现有最先进的 methods,包括目标检测、实例分割、场景解析、语义分割和关键点检测。此外,实验结果证实了我们的方法具有数据有效的特性和出色的表示转移能力。源代码和训练后的权重可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2407.02014
The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding.
第三个在野外像素级别的视频理解挑战(PVUW CVPR 2024)旨在通过在大型野外视频全景分割(VIPSeg)测试集和大型野外视频场景解析(VSPW)测试集中对具有挑战性的视频和场景进行基准测试,分别推动视频理解技术的进步。本文详细介绍了我们在PVUW'24 VPS挑战中取得第一名的 research 工作,建立了包括 Video Panoptic Quality(VPQ)和 Segmentation and Tracking Quality(STQ)在内的所有指标的最佳结果。通过微调我们的方法,我们的解决方案还获得了 PVUW'24 VSS 挑战中的第三名,根据mIoU(mean intersection over union)指标排名第三,根据 VC16(16-帧视频一致性)指标排名第一。我们的获胜解决方案站在了大型基础视觉变换模型(DINOv2 ViT-g)和经过验证的多阶段解耦视频实例分割(DVIS)框架的肩膀上,为视频理解技术的发展做出了巨大贡献。
https://arxiv.org/abs/2406.05352
Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model (VLM). Finally, we explore the benefit of the learned representation for scene parsing, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.
雷达传感器具有低成本、远距离和天气耐用性。因此,它们广泛应用于驾驶辅助功能,并预计将在未来自动驾驶的成功中扮演关键角色。在许多感知任务中,只考虑预处理后的雷达点云。相比之下,雷达频谱是一种原始的雷达测量形式,含有比雷达点云更多的信息。然而,雷达频谱的解读相当困难。在这项工作中,我们旨在探索频谱中包含的语义信息,从而在自动驾驶中实现更好的可解释性。为此,我们创建了一个雷达频谱-语言模型,使我们能够使用自由文本查询雷达频谱测量中是否存在场景元素。我们通过将现有的视觉语言模型(VLM)的嵌入空间与雷达频谱数据进行匹配来克服雷达频谱数据的稀缺性。最后,我们探讨了学习表示对场景解析的好处,只需将频谱嵌入注入到基线模型中,就能获得仅通过在自由空间分割和目标检测方面实现提高。
https://arxiv.org/abs/2406.02158
Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather than a static state. In this paper, we adopt semi-supervised video semantic segmentation method based on unreliable pseudo labels. Then, We ensemble the teacher network model with the student network model to generate pseudo labels and retrain the student network. Our method achieves the mIoU scores of 63.71% and 67.83% on development test and final test respectively. Finally, we obtain the 1st place in the Video Scene Parsing in the Wild Challenge at CVPR 2024.
像素级别的场景理解是计算机视觉中的一个基本问题,旨在识别给定图像中每个像素的对象类、掩码和语义。与图像场景解析相比,视频场景解析引入了时间信息,可以有效提高预测的一致性和准确性,因为现实世界实际上是基于视频的,而不是静态的状态。在本文中,我们采用了基于不可靠伪标签的半监督视频语义分割方法。然后,我们将教师网络模型与学生网络模型集成,生成伪标签并重新训练学生网络。我们的方法在开发测试和最终测试中的mIoU得分分别为63.71%和67.83%。最后,我们在CVPR 2024上的视频场景解析挑战中获得了第一名的成绩。
https://arxiv.org/abs/2406.00587
Advancements in machine learning, computer vision, and robotics have paved the way for transformative solutions in various domains, particularly in agriculture. For example, accurate identification and segmentation of fruits from field images plays a crucial role in automating jobs such as harvesting, disease detection, and yield estimation. However, achieving robust and precise infield fruit segmentation remains a challenging task since large amounts of labeled data are required to handle variations in fruit size, shape, color, and occlusion. In this paper, we develop a few-shot semantic segmentation framework for infield fruits using transfer learning. Concretely, our work is aimed at addressing agricultural domains that lack publicly available labeled data. Motivated by similar success in urban scene parsing, we propose specialized pre-training using a public benchmark dataset for fruit transfer learning. By leveraging pre-trained neural networks, accurate semantic segmentation of fruit in the field is achieved with only a few labeled images. Furthermore, we show that models with pre-training learn to distinguish between fruit still on the trees and fruit that have fallen on the ground, and they can effectively transfer the knowledge to the target fruit dataset.
机器学习、计算机视觉和机器人技术的发展为各个领域带来了 transformative 解决方案,尤其是在农业领域。例如,准确从田间图像中识别和分割水果在自动化诸如采摘、疾病检测和产量估计等任务中扮演着关键角色。然而,实现稳健且精确的田间水果分割仍然具有挑战性,因为需要大量标记数据来处理水果的大小、形状、颜色和遮挡的变异。在本文中,我们为田间水果使用迁移学习开发了一个几 shot semantic segmentation 框架。具体来说,我们的工作旨在解决缺乏公开可用标记数据的农业领域。受到城市场景解析的成功启发,我们提出了使用公共基准数据集进行水果转移学习的专用预训练方案。通过利用预训练的神经网络,可以在仅几张标记图片的情况下实现水果在田间的准确语义分割。此外,我们还证明了经过预训练的模型能够区分仍然在树上的水果和已经掉在地上的水果,并且可以有效地将知识传递到目标水果数据集中。
https://arxiv.org/abs/2405.02556
We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.
我们提出了一个基于编码图像稀疏、潜在特征表示的视觉场景分析和识别系统。该系统将图像稀疏表示编码为高维向量,然后通过分解为解析场景内容。稀疏特征表示通过卷积稀疏编码从图像统计信息中学习,而场景解析由共振器网络完成。将稀疏编码与共振器网络相结合可以增加分布式表示的容量,并在分解过程中减少组合搜索空间中的碰撞。我们发现,对于这个问题,共振器网络能够实现快速和准确的向量分解,并且我们开发了一个基于信心的度量来协助跟踪共振器网络的收敛。
https://arxiv.org/abs/2404.19126
Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
数据融合网络在色温场景解析方面表现出巨大的潜力。然而,现有的研究大多依赖于对称的多层解码器来进行异构特征提取和融合,而忽略了红光和热模态固有的差异。在通过自监督学习大量无标签数据上训练的视觉基础模型(VFMs)的最近进步证明,它们具有提取有信息量的通用特征的能力。然而,在领域内这一潜力尚未得到充分利用。在这项研究中,我们迈出这一新研究领域的一步,通过探索一种可行的策略,充分利用VFM特征进行红光-热场景解析。具体来说,我们深入研究了红光和热模态的独特特点,从而设计了一个半监督的 asymmetric 编码器,该编码器既包含一个VFM,也包含一个卷积神经网络。这种设计允许更有效地提取互补的异质特征,然后以双路、逐步的方式进行融合。此外,我们还引入了一个辅助任务,进一步丰富了融合特征的局部语义,从而提高了整个RGB-热场景解析的性能。我们提出的HAPNet,配备了所有这些组件,在所有其他最先进的RGB-热场景解析网络中表现出卓越的性能,在三处广泛使用的公共RGB-热场景解析数据集上实现了Top Rank。我们相信,这一新范式为数据融合场景解析方法的未来发展打开了新的机会。
https://arxiv.org/abs/2404.03527
The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
场景解析的复杂度随着物体和场景类别的数量增加而增加,在无限制的开放场景中更高。最大的挑战是在小尺度上成功识别物体,同时建模场景元素之间的空间关系。本文提出了一种新颖的特征增强网络,该网络从多个级联的特征提取中收集空间上下文,并为每个表示级别计算注意力权重以生成最终分类标签。一种新颖的“通道注意力模块”被设计用于计算注意力权重,确保在提取阶段相关的特征得到增强,而其他特征则得到削弱。模型还在低分辨率下学习空间上下文信息,以保留场景元素之间的抽象空间关系,并降低计算成本。在应用特征增强之前,将低分辨率的空间注意力特征连接到最终特征集合中。低分辨率的空间注意力特征使用辅助任务进行训练,帮助学习粗略全局场景结构。与最先进的模型相比,所提出的模型在ADE20K和Cityscapes数据集上都表现出色。
https://arxiv.org/abs/2402.19250
Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.
在对UAV图像进行道路场景解析时,有两个挑战需要面对。首先,UAV图像的高分辨率使得处理过程变得困难。其次,需要大量手动注释才能训练出 robust 和 accurate 的模型,这是监督式深度学习方法的一个缺点。在本文中,介绍了一种利用最近在视觉语言模型和基本计算机视觉模型方面的进展的无需手动注释的无监督道路解析框架。首先,采用一个视觉语言模型来高效地处理超大型分辨率UAV图像,以快速检测图像中的感兴趣道路区域。接着,采用视觉基础模型(SAM)来生成没有类别信息的道路区域的掩码。然后,利用自监督表示学习网络从所有掩码区域提取特征表示。最后,采用无监督聚类算法对特征表示进行聚类,并为每个聚类分配ID。掩码区域与相应的ID结合,生成初始伪标签,从而启动自训练的语义分割过程。与任何手动注释相比,所提出的方法在开发数据集上实现了令人印象深刻的89.96% mIoU。尤其值得注意的是,所提出的方法具有非凡的灵活性,甚至超越了人类定义的范畴,并且能够从数据集中获取新的类别知识。
https://arxiv.org/abs/2402.02985
Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing. Empirical evaluations on Scan-Net and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.
传统的3D实例分割进展通常与已标注的数据集的可用性相关,限制了其应用范围局限于少数物体类别。最近的努力试图利用像CLIP这样的视觉语言模型进行开放式语义推理,然而这些方法很难区分同一类别的物体,并依赖于不适用于所有任务的特定提示。在本文中,我们介绍了SAI3D,一种新颖的零散3D实例分割方法,它通过协同利用基于Segment Anything Model(SAM)生成的几何先验和语义线索来取得成功。我们的方法将3D场景分割为几何基本单元,然后将这些单元逐步合并为与多视角SAM掩码一致的3D实例分割。此外,我们还设计了一个具有动态阈值机制的分层区域生长算法,大大提高了细粒度3D场景解析的鲁棒性。在ScanNet和更具挑战性的ScanNet+ datasets上的实证评估表明,我们的方法具有优越性。值得注意的是,SAI3D在ScanNet和ScanNet+ datasets上优于现有开发生词基准,甚至超过了完全监督方法。
https://arxiv.org/abs/2312.11557
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.
目前最先进的3D点云理解方法仅在完全监督的方式下表现良好。据我们所知,还没有一个统一框架能够同时解决下游的高层次理解任务,尤其是当标签非常有限时。本文提出了一种通用的简单框架来解决标签有限时的点云理解问题。第一个贡献是我们对传统和学习3D描述符在弱监督3D场景理解任务上的方法进行了广泛的比较,并验证了我们自适应的传统PFH-based 3D描述符具有良好的泛化能力。第二个贡献是我们基于传统/学习3D描述符和学习语义提出了基于相似性的区域合并策略。合并过程考虑了低级几何和高级语义特征的相关性。实验结果表明,在三个最重要的弱监督点云理解任务(包括语义分割、实例分割和物体检测)中,我们的框架在标签非常有限的情况下具有最佳性能。我们的方法被称为区域合并3D(RM3D),在各种实验设置下的ScanNet数据高效学习在线基准和其他四个大型3D理解基准上具有卓越的性能,超过了没有复杂学习策略的各个3D理解任务。
https://arxiv.org/abs/2312.01262
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: this https URL.
目前最先进的3D点云理解方法仅在完全监督的情况下表现良好。据我们所知,还没有一个统一框架能够同时解决下游的高级理解任务,特别是在标签非常有限的情况下。本文提出了一种通用的且简单的框架来解决标签有限时的点云理解问题。我们提出了一种新颖的自监督聚类方法生成聚类。更重要的是,我们创新地提出了一种基于局部低级几何性质相似度和由弱标签学习到的较高级特征相似度的聚类方法,从而使真正的弱标签引导伪标签的合并。因此,在考虑几何和语义特征相关性的情况下,指导场景内同义点之间的标签传播。最后,我们提出了自监督重构和数据增强优化模块,以引导大规模3D语义场景解析中标签的传播。实验结果表明,在考虑有限点标签的情况下,我们的框架在包括语义分割、实例分割和目标检测的三个最重要的弱监督点云理解任务中具有最佳性能。这些开发的技术具有将应用于机器人操作和机器人自主导航下游任务的潜在优势。代码和模型公开可用,此处链接:https://this URL。
https://arxiv.org/abs/2312.02208
Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: this https URL.
深度神经网络模型在关闭设置和完整标签的情况下训练在3D场景理解方面取得了显著的进步。然而,当前的3D识别方法的主要瓶颈是,它们无法识别任何未在训练类别之外的新颖类别的现实世界应用。与此同时,最先进的3D场景理解方法主要需要高质量的标签来训练神经网络,而仅仅在完全监督的方式下表现良好。本文提出了一种处理有限标注场景的通用的简单框架。为了从预训练的视觉-语言模型中提取知识,我们提出了一种层次特征对齐的预训练和知识蒸馏策略,以提取和蒸馏大规模视觉-语言模型中的有意义的信息,从而帮助解决开箱见光的场景理解任务。为了利用边界信息,我们提出了一种基于能量的损失函数,其中边界感知使区域级别边界预测受益。为了鼓励潜在实例区分,并确保效率,我们提出了一种自监督的区域级别语义对比学习方案,基于神经网络的自信预测来区分中间特征嵌入的多阶段。在室内和室外场景的广泛实验中,我们的方法在数据有效的学习和开放世界少数样本学习方面都取得了显著的效果。所有代码、模型和数据都公开发布在以下这个链接上:https://this URL。
https://arxiv.org/abs/2312.00663
In this paper, we present CaveSeg - the first visual learning pipeline for semantic segmentation and scene parsing for AUV navigation inside underwater caves. We address the problem of scarce annotated training data by preparing a comprehensive dataset for semantic segmentation of underwater cave scenes. It contains pixel annotations for important navigation markers (e.g. caveline, arrows), obstacles (e.g. ground plain and overhead layers), scuba divers, and open areas for servoing. Through comprehensive benchmark analyses on cave systems in USA, Mexico, and Spain locations, we demonstrate that robust deep visual models can be developed based on CaveSeg for fast semantic scene parsing of underwater cave environments. In particular, we formulate a novel transformer-based model that is computationally light and offers near real-time execution in addition to achieving state-of-the-art performance. Finally, we explore the design choices and implications of semantic segmentation for visual servoing by AUVs inside underwater caves. The proposed model and benchmark dataset open up promising opportunities for future research in autonomous underwater cave exploration and mapping.
在本文中,我们介绍了水下洞穴中的洞穴SEG(深视觉模型),它是第一个用于AUV在水下洞穴中进行语义分割和场景解析的视觉学习管道。我们为了解决缺乏标注训练数据的问题,准备了一组全面的数据集,用于水下洞穴场景的语义分割。该数据集包含对重要导航标记(例如洞穴线、箭头)像素annotations、障碍物(例如地面平原和上方层)、潜水员、以及用于控制运动的开放区域。通过对美国、墨西哥和西班牙等地的水下洞穴系统的全面基准分析,我们证明了基于洞穴SEG的强有力深度视觉模型可以用于快速水下洞穴环境中语义场景解析。特别是,我们制定了一种新型Transformer-based模型,其计算量较轻,并提供了近乎实时的执行,除了实现最先进的性能外。最后,我们探索了语义分割对AUV水下洞穴内视觉控制的设计和影响。我们提出的模型和基准数据集为未来自主水下洞穴探索和 mapping的研究打开了 promising 机会。
https://arxiv.org/abs/2309.11038
The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.
最近的深度学习卷积神经网络的进步表明,在道路场景解析领域具有巨大的潜力。然而,现有的工作主要关注边界检测,而忽视了可能危及驾驶安全和舒适的危险道路缺陷。在本文中,我们介绍了路 former,这是一个专门为道路场景解析而开发的Transformer基数据融合网络。路 former利用双重编码器架构从RGB图像和表面正则信息中提取不同类型的特征。编码器特征随后被注入到一个 novel 的异质特征协同作用块,以有效地特征融合和重新校准。像素解码器随后从融合和重新校准的异质特征中学习多尺度的长期依赖关系,这些依赖关系随后由Transformer解码器处理,以产生最终的语义预测。此外,我们发布了SYN-UDTIRI,是第一个大规模的道路场景解析数据集,其中包含超过10,407张RGB图像、深度密度图像,以及不同形状和大小的自由空间和道路缺陷的像素级注释。我们对SYN-UDTIRI数据集进行了广泛的实验评估,并与其他三个公共数据集,包括KITTI道路、 CityScapes 和 ORFD,证明了路 former在道路场景解析领域比其他先进的网络表现更好。具体来说,路 former在KITTI道路基准测试中排名第一。我们的源代码、创建的数据集和演示视频均公开可在mias.group/RoadFormer网站上可用。
https://arxiv.org/abs/2309.10356
Video scene parsing incorporates temporal information, which can enhance the consistency and accuracy of predictions compared to image scene parsing. The added temporal dimension enables a more comprehensive understanding of the scene, leading to more reliable results. This paper presents the winning solution of the CVPR2023 workshop for video semantic segmentation, focusing on enhancing Spatial-Temporal correlations with contrastive loss. We also explore the influence of multi-dataset training by utilizing a label-mapping technique. And the final result is aggregating the output of the above two models. Our approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place on the VSPW challenge at CVPR 2023.
视频场景解析加入了时间信息,可以相较于图像场景解析提高预测的一致性和准确性。增加了时间维度可以实现更全面的理解场景,进而获得更加可靠的结果。本文介绍了CVPR2023年视频语义分割 workshop 中获胜的解决方案,重点研究了增强空间-时间相关性并使用对比损失的方法。此外,我们还使用标签映射技术探讨了多数据集训练的影响。最终的成果是合并了以上两个模型的输出。我们的方法在VSPW数据集上实现了65.95%的IoU表现,在CVPR2023年的VSPW挑战中排名第一。
https://arxiv.org/abs/2306.03508
Scene parsing is a great challenge for real-time semantic segmentation. Although traditional semantic segmentation networks have made remarkable leap-forwards in semantic accuracy, the performance of inference speed is unsatisfactory. Meanwhile, this progress is achieved with fairly large networks and powerful computational resources. However, it is difficult to run extremely large models on edge computing devices with limited computing power, which poses a huge challenge to the real-time semantic segmentation tasks. In this paper, we present the Cross-CBAM network, a novel lightweight network for real-time semantic segmentation. Specifically, a Squeeze-and-Excitation Atrous Spatial Pyramid Pooling Module(SE-ASPP) is proposed to get variable field-of-view and multiscale information. And we propose a Cross Convolutional Block Attention Module(CCBAM), in which a cross-multiply operation is employed in the CCBAM module to make high-level semantic information guide low-level detail information. Different from previous work, these works use attention to focus on the desired information in the backbone. CCBAM uses cross-attention for feature fusion in the FPN structure. Extensive experiments on the Cityscapes dataset and Camvid dataset demonstrate the effectiveness of the proposed Cross-CBAM model by achieving a promising trade-off between segmentation accuracy and inference speed. On the Cityscapes test set, we achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of 88.6FPS on NVIDIA GTX 1080Ti.
场景解析是实时语义分割面临的一个巨大的挑战。虽然传统的语义分割网络在语义准确性方面已经取得了显著的进展,但推理速度仍然不满意。与此同时,这种进展是通过相对较大的网络和强大的计算资源实现的。然而,在边缘计算设备上运行巨型模型具有有限的计算能力,这给实时语义分割任务带来了一个巨大的挑战。在本文中,我们提出了Cross-CBAM网络,这是一种全新的轻量级网络,用于实时语义分割。具体来说,我们提出了一种SE-ASPPSqueeze-and-Excitation Atrous Spatial Pyramid Pooling Module(缩放并刺激刺激空间Pyramid Pooling模块),以获取可变视角和多尺度信息。我们还提出了一个Cross Convolutional Block Attention Module(CCBAM),其中在CCBAM模块中采用了交叉乘法操作,以使高层次语义信息指导低层次的细节信息。与以前的工作不同,这些工作使用注意力来关注骨架中的想要的信息。CCBAM使用交叉注意力在FPN结构中进行特征融合。在Cityscapes测试集上,我们实现了73.4%的mIoU,以240.9FPS的速度在NVIDIA GTX 1080Ti上运行,并实现了77.2%的mIoU,以88.6FPS的速度在NVIDIA GTX 1080Ti上运行。
https://arxiv.org/abs/2306.02306
Deep learning has enabled various Internet of Things (IoT) applications. Still, designing models with high accuracy and computational efficiency remains a significant challenge, especially in real-time video processing applications. Such applications exhibit high inter- and intra-frame redundancy, allowing further improvement. This paper proposes a similarity-aware training methodology that exploits data redundancy in video frames for efficient processing. Our approach introduces a per-layer regularization that enhances computation reuse by increasing the similarity of weights during training. We validate our methodology on two critical real-time applications, lane detection and scene parsing. We observe an average compression ratio of approximately 50% and a speedup of \sim 1.5x for different models while maintaining the same accuracy.
深度学习已经使各种物联网(IoT)应用得以实现。然而,设计高精度和高计算效率的模型仍然是一个 significant 挑战,特别是实时视频处理应用中。这些应用表现出内外帧的冗余,从而允许进一步改进。本文提出了一种相似性 aware 的训练方法,利用视频帧中的数据冗余以高效处理。我们的方法引入了每层 Regularization,在训练期间通过增加权重之间的相似性来提高计算重用。我们对两个关键实时应用,车道检测和场景解析进行了验证。我们观察到平均压缩比例约为 50%,不同模型的速度up 达到了 \sim 1.5x,同时保持相同的精度。
https://arxiv.org/abs/2305.06492
Panoptic segmentation is one of the most challenging scene parsing tasks, combining the tasks of semantic segmentation and instance segmentation. While much progress has been made, few works focus on the real-time application of panoptic segmentation methods. In this paper, we revisit the recently introduced K-Net architecture. We propose vital changes to the architecture, training, and inference procedure, which massively decrease latency and improve performance. Our resulting RT-K-Net sets a new state-of-the-art performance for real-time panoptic segmentation methods on the Cityscapes dataset and shows promising results on the challenging Mapillary Vistas dataset. On Cityscapes, RT-K-Net reaches 60.2 % PQ with an average inference time of 32 ms for full resolution 1024x2048 pixel images on a single Titan RTX GPU. On Mapillary Vistas, RT-K-Net reaches 33.2 % PQ with an average inference time of 69 ms. Source code is available at this https URL.
Panoptic segmentation 是处理场景解析任务中最具挑战性的之一,将语义分割和实例分割任务结合在一起。尽管已经取得了很多进展,但只有少数工作关注实时 Panoptic segmentation 方法的应用。在本文中,我们重新审视了最近引入的 K-Net 架构。我们提出了关键的变化,修改了架构、训练和推理程序,极大地减少了延迟并提高了性能。我们得到的 RT-K-Net 在 Cityscapes 数据集上实现了实时 Panoptic segmentation 方法的最新前沿技术性能,并在挑战性的 Mapillary Vistas 数据集上取得了令人期望的结果。在 Cityscapes 中,RT-K-Net 的 PQ 准确率达到了 60.2%,平均推理时间为 32 毫秒,对于一张 1024x2048 像素的全分辨率图像,可以在单个 Titan RTX 显卡上执行。在 Mapillary Vistas 中,RT-K-Net 的 PQ 准确率达到了 33.2%,平均推理时间为 69 毫秒。源代码可在 this https URL 获取。
https://arxiv.org/abs/2305.01255