The complexity of scene parsing grows with the number of object and scene classes, which is higher in unrestricted open scenes. The biggest challenge is to model the spatial relation between scene elements while succeeding in identifying objects at smaller scales. This paper presents a novel feature-boosting network that gathers spatial context from multiple levels of feature extraction and computes the attention weights for each level of representation to generate the final class labels. A novel `channel attention module' is designed to compute the attention weights, ensuring that features from the relevant extraction stages are boosted while the others are attenuated. The model also learns spatial context information at low resolution to preserve the abstract spatial relationships among scene elements and reduce computation cost. Spatial attention is subsequently concatenated into a final feature set before applying feature boosting. Low-resolution spatial attention features are trained using an auxiliary task that helps learning a coarse global scene structure. The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.
在对UAV图像进行道路场景解析时，有两个挑战需要面对。首先，UAV图像的高分辨率使得处理过程变得困难。其次，需要大量手动注释才能训练出 robust 和 accurate 的模型，这是监督式深度学习方法的一个缺点。在本文中，介绍了一种利用最近在视觉语言模型和基本计算机视觉模型方面的进展的无需手动注释的无监督道路解析框架。首先，采用一个视觉语言模型来高效地处理超大型分辨率UAV图像，以快速检测图像中的感兴趣道路区域。接着，采用视觉基础模型（SAM）来生成没有类别信息的道路区域的掩码。然后，利用自监督表示学习网络从所有掩码区域提取特征表示。最后，采用无监督聚类算法对特征表示进行聚类，并为每个聚类分配ID。掩码区域与相应的ID结合，生成初始伪标签，从而启动自训练的语义分割过程。与任何手动注释相比，所提出的方法在开发数据集上实现了令人印象深刻的89.96% mIoU。尤其值得注意的是，所提出的方法具有非凡的灵活性，甚至超越了人类定义的范畴，并且能够从数据集中获取新的类别知识。
Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing. Empirical evaluations on Scan-Net and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.
传统的3D实例分割进展通常与已标注的数据集的可用性相关，限制了其应用范围局限于少数物体类别。最近的努力试图利用像CLIP这样的视觉语言模型进行开放式语义推理，然而这些方法很难区分同一类别的物体，并依赖于不适用于所有任务的特定提示。在本文中，我们介绍了SAI3D，一种新颖的零散3D实例分割方法，它通过协同利用基于Segment Anything Model（SAM）生成的几何先验和语义线索来取得成功。我们的方法将3D场景分割为几何基本单元，然后将这些单元逐步合并为与多视角SAM掩码一致的3D实例分割。此外，我们还设计了一个具有动态阈值机制的分层区域生长算法，大大提高了细粒度3D场景解析的鲁棒性。在ScanNet和更具挑战性的ScanNet+ datasets上的实证评估表明，我们的方法具有优越性。值得注意的是，SAI3D在ScanNet和ScanNet+ datasets上优于现有开发生词基准，甚至超过了完全监督方法。
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: this https URL.
Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: this https URL.
In this paper, we present CaveSeg - the first visual learning pipeline for semantic segmentation and scene parsing for AUV navigation inside underwater caves. We address the problem of scarce annotated training data by preparing a comprehensive dataset for semantic segmentation of underwater cave scenes. It contains pixel annotations for important navigation markers (e.g. caveline, arrows), obstacles (e.g. ground plain and overhead layers), scuba divers, and open areas for servoing. Through comprehensive benchmark analyses on cave systems in USA, Mexico, and Spain locations, we demonstrate that robust deep visual models can be developed based on CaveSeg for fast semantic scene parsing of underwater cave environments. In particular, we formulate a novel transformer-based model that is computationally light and offers near real-time execution in addition to achieving state-of-the-art performance. Finally, we explore the design choices and implications of semantic segmentation for visual servoing by AUVs inside underwater caves. The proposed model and benchmark dataset open up promising opportunities for future research in autonomous underwater cave exploration and mapping.
在本文中,我们介绍了水下洞穴中的洞穴SEG(深视觉模型),它是第一个用于AUV在水下洞穴中进行语义分割和场景解析的视觉学习管道。我们为了解决缺乏标注训练数据的问题,准备了一组全面的数据集,用于水下洞穴场景的语义分割。该数据集包含对重要导航标记(例如洞穴线、箭头)像素annotations、障碍物(例如地面平原和上方层)、潜水员、以及用于控制运动的开放区域。通过对美国、墨西哥和西班牙等地的水下洞穴系统的全面基准分析,我们证明了基于洞穴SEG的强有力深度视觉模型可以用于快速水下洞穴环境中语义场景解析。特别是,我们制定了一种新型Transformer-based模型,其计算量较轻,并提供了近乎实时的执行,除了实现最先进的性能外。最后,我们探索了语义分割对AUV水下洞穴内视觉控制的设计和影响。我们提出的模型和基准数据集为未来自主水下洞穴探索和 mapping的研究打开了 promising 机会。
The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.
最近的深度学习卷积神经网络的进步表明,在道路场景解析领域具有巨大的潜力。然而,现有的工作主要关注边界检测,而忽视了可能危及驾驶安全和舒适的危险道路缺陷。在本文中,我们介绍了路 former,这是一个专门为道路场景解析而开发的Transformer基数据融合网络。路 former利用双重编码器架构从RGB图像和表面正则信息中提取不同类型的特征。编码器特征随后被注入到一个 novel 的异质特征协同作用块,以有效地特征融合和重新校准。像素解码器随后从融合和重新校准的异质特征中学习多尺度的长期依赖关系,这些依赖关系随后由Transformer解码器处理,以产生最终的语义预测。此外,我们发布了SYN-UDTIRI,是第一个大规模的道路场景解析数据集,其中包含超过10,407张RGB图像、深度密度图像,以及不同形状和大小的自由空间和道路缺陷的像素级注释。我们对SYN-UDTIRI数据集进行了广泛的实验评估,并与其他三个公共数据集,包括KITTI道路、 CityScapes 和 ORFD,证明了路 former在道路场景解析领域比其他先进的网络表现更好。具体来说,路 former在KITTI道路基准测试中排名第一。我们的源代码、创建的数据集和演示视频均公开可在mias.group/RoadFormer网站上可用。
Video scene parsing incorporates temporal information, which can enhance the consistency and accuracy of predictions compared to image scene parsing. The added temporal dimension enables a more comprehensive understanding of the scene, leading to more reliable results. This paper presents the winning solution of the CVPR2023 workshop for video semantic segmentation, focusing on enhancing Spatial-Temporal correlations with contrastive loss. We also explore the influence of multi-dataset training by utilizing a label-mapping technique. And the final result is aggregating the output of the above two models. Our approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place on the VSPW challenge at CVPR 2023.
视频场景解析加入了时间信息,可以相较于图像场景解析提高预测的一致性和准确性。增加了时间维度可以实现更全面的理解场景,进而获得更加可靠的结果。本文介绍了CVPR2023年视频语义分割 workshop 中获胜的解决方案,重点研究了增强空间-时间相关性并使用对比损失的方法。此外,我们还使用标签映射技术探讨了多数据集训练的影响。最终的成果是合并了以上两个模型的输出。我们的方法在VSPW数据集上实现了65.95%的IoU表现,在CVPR2023年的VSPW挑战中排名第一。
Scene parsing is a great challenge for real-time semantic segmentation. Although traditional semantic segmentation networks have made remarkable leap-forwards in semantic accuracy, the performance of inference speed is unsatisfactory. Meanwhile, this progress is achieved with fairly large networks and powerful computational resources. However, it is difficult to run extremely large models on edge computing devices with limited computing power, which poses a huge challenge to the real-time semantic segmentation tasks. In this paper, we present the Cross-CBAM network, a novel lightweight network for real-time semantic segmentation. Specifically, a Squeeze-and-Excitation Atrous Spatial Pyramid Pooling Module(SE-ASPP) is proposed to get variable field-of-view and multiscale information. And we propose a Cross Convolutional Block Attention Module(CCBAM), in which a cross-multiply operation is employed in the CCBAM module to make high-level semantic information guide low-level detail information. Different from previous work, these works use attention to focus on the desired information in the backbone. CCBAM uses cross-attention for feature fusion in the FPN structure. Extensive experiments on the Cityscapes dataset and Camvid dataset demonstrate the effectiveness of the proposed Cross-CBAM model by achieving a promising trade-off between segmentation accuracy and inference speed. On the Cityscapes test set, we achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of 88.6FPS on NVIDIA GTX 1080Ti.
场景解析是实时语义分割面临的一个巨大的挑战。虽然传统的语义分割网络在语义准确性方面已经取得了显著的进展,但推理速度仍然不满意。与此同时,这种进展是通过相对较大的网络和强大的计算资源实现的。然而,在边缘计算设备上运行巨型模型具有有限的计算能力,这给实时语义分割任务带来了一个巨大的挑战。在本文中,我们提出了Cross-CBAM网络,这是一种全新的轻量级网络,用于实时语义分割。具体来说,我们提出了一种SE-ASPPSqueeze-and-Excitation Atrous Spatial Pyramid Pooling Module(缩放并刺激刺激空间Pyramid Pooling模块),以获取可变视角和多尺度信息。我们还提出了一个Cross Convolutional Block Attention Module(CCBAM),其中在CCBAM模块中采用了交叉乘法操作,以使高层次语义信息指导低层次的细节信息。与以前的工作不同,这些工作使用注意力来关注骨架中的想要的信息。CCBAM使用交叉注意力在FPN结构中进行特征融合。在Cityscapes测试集上,我们实现了73.4%的mIoU,以240.9FPS的速度在NVIDIA GTX 1080Ti上运行,并实现了77.2%的mIoU,以88.6FPS的速度在NVIDIA GTX 1080Ti上运行。
Deep learning has enabled various Internet of Things (IoT) applications. Still, designing models with high accuracy and computational efficiency remains a significant challenge, especially in real-time video processing applications. Such applications exhibit high inter- and intra-frame redundancy, allowing further improvement. This paper proposes a similarity-aware training methodology that exploits data redundancy in video frames for efficient processing. Our approach introduces a per-layer regularization that enhances computation reuse by increasing the similarity of weights during training. We validate our methodology on two critical real-time applications, lane detection and scene parsing. We observe an average compression ratio of approximately 50% and a speedup of \sim 1.5x for different models while maintaining the same accuracy.
深度学习已经使各种物联网(IoT)应用得以实现。然而，设计高精度和高计算效率的模型仍然是一个 significant 挑战，特别是实时视频处理应用中。这些应用表现出内外帧的冗余，从而允许进一步改进。本文提出了一种相似性 aware 的训练方法，利用视频帧中的数据冗余以高效处理。我们的方法引入了每层 Regularization，在训练期间通过增加权重之间的相似性来提高计算重用。我们对两个关键实时应用，车道检测和场景解析进行了验证。我们观察到平均压缩比例约为 50%，不同模型的速度up 达到了 \sim 1.5x，同时保持相同的精度。
Panoptic segmentation is one of the most challenging scene parsing tasks, combining the tasks of semantic segmentation and instance segmentation. While much progress has been made, few works focus on the real-time application of panoptic segmentation methods. In this paper, we revisit the recently introduced K-Net architecture. We propose vital changes to the architecture, training, and inference procedure, which massively decrease latency and improve performance. Our resulting RT-K-Net sets a new state-of-the-art performance for real-time panoptic segmentation methods on the Cityscapes dataset and shows promising results on the challenging Mapillary Vistas dataset. On Cityscapes, RT-K-Net reaches 60.2 % PQ with an average inference time of 32 ms for full resolution 1024x2048 pixel images on a single Titan RTX GPU. On Mapillary Vistas, RT-K-Net reaches 33.2 % PQ with an average inference time of 69 ms. Source code is available at this https URL.
Panoptic segmentation 是处理场景解析任务中最具挑战性的之一，将语义分割和实例分割任务结合在一起。尽管已经取得了很多进展，但只有少数工作关注实时 Panoptic segmentation 方法的应用。在本文中，我们重新审视了最近引入的 K-Net 架构。我们提出了关键的变化，修改了架构、训练和推理程序，极大地减少了延迟并提高了性能。我们得到的 RT-K-Net 在 Cityscapes 数据集上实现了实时 Panoptic segmentation 方法的最新前沿技术性能，并在挑战性的 Mapillary Vistas 数据集上取得了令人期望的结果。在 Cityscapes 中，RT-K-Net 的 PQ 准确率达到了 60.2%，平均推理时间为 32 毫秒，对于一张 1024x2048 像素的全分辨率图像，可以在单个 Titan RTX 显卡上执行。在 Mapillary Vistas 中，RT-K-Net 的 PQ 准确率达到了 33.2%，平均推理时间为 69 毫秒。源代码可在 this https URL 获取。
Autonomous vehicles (AVs) are becoming an indispensable part of future transportation. However, safety challenges and lack of reliability limit their real-world deployment. Towards boosting the appearance of AVs on the roads, the interaction of AVs with pedestrians including "prediction of the pedestrian crossing intention" deserves extensive research. This is a highly challenging task as involves multiple non-linear parameters. In this direction, we extract and analyse spatio-temporal visual features of both pedestrian and traffic contexts. The pedestrian features include body pose and local context features that represent the pedestrian's behaviour. Additionally, to understand the global context, we utilise location, motion, and environmental information using scene parsing technology that represents the pedestrian's surroundings, and may affect the pedestrian's intention. Finally, these multi-modality features are intelligently fused for effective intention prediction learning. The experimental results of the proposed model on the JAAD dataset show a superior result on the combined AUC and F1-score compared to the state-of-the-art.
Nowadays, many visual scene understanding problems are addressed by dense prediction networks. But pixel-wise dense annotations are very expensive (e.g., for scene parsing) or impossible (e.g., for intrinsic image decomposition), motivating us to leverage cheap point-level weak supervision. However, existing pointly-supervised methods still use the same architecture designed for full supervision. In stark contrast to them, we propose a new paradigm that makes predictions for point coordinate queries, as inspired by the recent success of implicit representations, like distance or radiance fields. As such, the method is named as dense prediction fields (DPFs). DPFs generate expressive intermediate features for continuous sub-pixel locations, thus allowing outputs of an arbitrary resolution. DPFs are naturally compatible with point-level supervision. We showcase the effectiveness of DPFs using two substantially different tasks: high-level semantic parsing and low-level intrinsic image decomposition. In these two cases, supervision comes in the form of single-point semantic category and two-point relative reflectance, respectively. As benchmarked by three large-scale public datasets PASCALContext, ADE20K and IIW, DPFs set new state-of-the-art performance on all of them with significant margins. Code can be accessed at this https URL.
Nowadays, many visual scene understanding problems are addressed by dense prediction networks. But pixel-wise dense annotations are very expensive (e.g., for scene parsing) or impossible (e.g., for intrinsic image decomposition), motivating us to leverage cheap point-level weak supervision. However, existing pointly-supervised methods still use the same architecture designed for full supervision. In stark contrast to them, we propose a new paradigm that makes predictions for point coordinate queries, as inspired by the recent success of implicit representations, like distance or radiance fields. As such, the method is named as dense prediction fields (DPFs). DPFs generate expressive intermediate features for continuous sub-pixel locations, thus allowing outputs of an arbitrary resolution. DPFs are naturally compatible with point-level supervision. We showcase the effectiveness of DPFs using two substantially different tasks: high-level semanticParsing and low-level intrinsic image decomposition. In these two cases, supervision comes in the form of single-point semantic category and two-point relative reflectance, respectively. As benchmarked by three large-scale public datasets PASCALContext, ADE20K and IIW, DPFs set new state-of-the-art performance on all of them with significant margins. Code can be accessed at this https URL.
Traffic scene parsing is one of the most important tasks to achieve intelligent cities. So far, little effort has been spent on constructing datasets specifically for the task of traffic scene parsing. To fill this gap, here we introduce the TSP6K dataset, containing 6,000 urban traffic images and spanning hundreds of street scenes under various weather conditions. In contrast to most previous traffic scene datasets collected from a driving platform, the images in our dataset are from the shooting platform high-hanging on the street. Such traffic images can capture more crowded street scenes with several times more traffic participants than the driving scenes. Each image in the TSP6K dataset is provided with high-quality pixel-level and instance-level annotations. We perform a detailed analysis for the dataset and comprehensively evaluate the state-of-the-art scene parsing methods. Considering the vast difference in instance sizes, we propose a detail refining decoder, which recovers the details of different semantic regions in traffic scenes. Experiments have shown its effectiveness in parsing high-hanging traffic scenes. Code and dataset will be made publicly available.
交通场景解析是实现智慧城市的最重要任务之一。迄今为止, little effort has been spent on constructing datasets specifically for the task of traffic scene parsing. 为填补这一差距,我们介绍了TSP6K dataset,其中包括6,000幅城市交通图像,涵盖数百个街道场景,在各种天气条件下。与大多数从驾驶平台收集的交通场景数据集不同,我们的数据集中的图像是从街道的高架上拍摄的。这些交通图像能够捕捉到比驾驶场景更加拥挤的街道场景,并有数倍于驾驶场景的交通参与者。每个图像在TSP6K dataset中都有高质量的像素级和实例级注释。我们对dataset进行了详细的分析,并全面评估了交通场景解析的最新方法。考虑到实例大小的巨大差异,我们提出了一种细节 refine Decoder,该算法可以恢复交通场景不同语义区域的详细信息。实验表明,它在解析高架上的交通场景方面非常有效。代码和数据集将公开可用。
LiDAR-based 3D Object detectors have achieved impressive performances in many benchmarks, however, multisensors fusion-based techniques are promising to further improve the results. PointPainting, as a recently proposed framework, can add the semantic information from the 2D image into the 3D LiDAR point by the painting operation to boost the detection performance. However, due to the limited resolution of 2D feature maps, severe boundary-blurring effect happens during re-projection of 2D semantic segmentation into the 3D point clouds. To well handle this limitation, a general multimodal fusion framework MSF has been proposed to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, MSF includes three main modules. First, SOTA off-the-shelf 2D/3D semantic segmentation approaches are employed to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further re-projected into the 3D point clouds with calibrated parameters. To handle the misalignment between the 2D and 3D parsing results, an AAF module is proposed to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a DFF module to aggregate deep features in different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets new SOTA results on the nuScenes testing benchmark.
Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at this https URL
Recently, 3D scenes parsing with deep learning approaches has been a heating topic. However, current methods with fully-supervised models require manually annotated point-wise supervision which is extremely user-unfriendly and time-consuming to obtain. As such, training 3D scene parsing models with sparse supervision is an intriguing alternative. We term this task as data-efficient 3D scene parsing and propose an effective two-stage framework named VIBUS to resolve it by exploiting the enormous unlabeled points. In the first stage, we perform self-supervised representation learning on unlabeled points with the proposed Viewpoint Bottleneck loss function. The loss function is derived from an information bottleneck objective imposed on scenes under different viewpoints, making the process of representation learning free of degradation and sampling. In the second stage, pseudo labels are harvested from the sparse labels based on uncertainty-spectrum modeling. By combining data-driven uncertainty measures and 3D mesh spectrum measures (derived from normal directions and geodesic distances), a robust local affinity metric is obtained. Finite gamma/beta mixture models are used to decompose category-wise distributions of these measures, leading to automatic selection of thresholds. We evaluate VIBUS on the public benchmark ScanNet and achieve state-of-the-art results on both validation set and online test server. Ablation studies show that both Viewpoint Bottleneck and uncertainty-spectrum modeling bring significant improvements. Codes and models are publicly available at this https URL.
Night-Time Scene Parsing (NTSP) is essential to many vision applications, especially for autonomous driving. Most of the existing methods are proposed for day-time scene parsing. They rely on modeling pixel intensity-based spatial contextual cues under even illumination. Hence, these methods do not perform well in night-time scenes as such spatial contextual cues are buried in the over-/under-exposed regions in night-time scenes. In this paper, we first conduct an image frequency-based statistical experiment to interpret the day-time and night-time scene discrepancies. We find that image frequency distributions differ significantly between day-time and night-time scenes, and understanding such frequency distributions is critical to NTSP problem. Based on this, we propose to exploit the image frequency distributions for night-time scene parsing. First, we propose a Learnable Frequency Encoder (LFE) to model the relationship between different frequency coefficients to measure all frequency components dynamically. Second, we propose a Spatial Frequency Fusion module (SFF) that fuses both spatial and frequency information to guide the extraction of spatial context features. Extensive experiments show that our method performs favorably against the state-of-the-art methods on the NightCity, NightCity+ and BDD100K-night datasets. In addition, we demonstrate that our method can be applied to existing day-time scene parsing methods and boost their performance on night-time scenes.
A key algorithm for understanding the world is material segmentation, which assigns a label (metal, glass, etc.) to each pixel. We find that a model trained on existing data underperforms in some settings and propose to address this with a large-scale dataset of 3.2 million dense segments on 44,560 indoor and outdoor images, which is 23x more segments than existing data. Our data covers a more diverse set of scenes, objects, viewpoints and materials, and contains a more fair distribution of skin types. We show that a model trained on our data outperforms a state-of-the-art model across datasets and viewpoints. We propose a large-scale scene parsing benchmark and baseline of 0.729 per-pixel accuracy, 0.585 mean class accuracy and 0.420 mean IoU across 46 materials.