Autonomous vehicles rely on LiDAR sensors to perceive the environment. Adverse weather conditions like rain, snow, and fog negatively affect these sensors, reducing their reliability by introducing unwanted noise in the measurements. In this work, we tackle this problem by proposing a novel approach for detecting adverse weather effects in LiDAR data. We reformulate this problem as an outlier detection task and use an energy-based framework to detect outliers in point clouds. More specifically, our method learns to associate low energy scores with inlier points and high energy scores with outliers allowing for robust detection of adverse weather effects. In extensive experiments, we show that our method performs better in adverse weather detection and has higher robustness to unseen weather effects than previous state-of-the-art methods. Furthermore, we show how our method can be used to perform simultaneous outlier detection and semantic segmentation. Finally, to help expand the research field of LiDAR perception in adverse weather, we release the SemanticSpray dataset, which contains labeled vehicle spray data in highway-like scenarios.
无人驾驶车辆依赖激光雷达传感器感知环境。如雨、雪和雾等不良天气条件会消极影响这些传感器,通过在测量中引入不必要的噪声,降低其可靠性。在本研究中,我们解决这个问题并提出了一种新的方法来检测激光雷达数据中的不良天气效应。我们将这个问题重新定义为异常检测任务,并使用基于能量的框架来检测点云中的异常。更具体地说,我们的算法学习将低能量评分与正常 points 关联,并将高能量评分与异常点关联,以 robust 地检测不良天气效应。在广泛的实验中,我们表明,我们的算法在不良天气检测方面表现更好,对未观测到的天气效应的鲁棒性比先前的先进方法更高。此外,我们展示如何应用我们的算法进行同时的异常检测和语义分割。最后,为了扩大激光雷达在不良天气条件下的感知研究领域,我们发布了语义喷雾数据集,其中包含道路类似场景中标注的车辆行驶喷雾数据。
https://arxiv.org/abs/2305.16129
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
大型预训练模型通过实现多模态学习对计算机视觉产生了重大影响。CLIP模型在图像分类、对象检测和语义分割等方面取得了令人印象深刻的结果。然而,模型在3D点云处理任务方面的性能受到3D投影和CLIP训练图像之间的域差的限制。本文提出了DiffCLIP,一个新的预训练框架,结合稳定的扩散控制Net,最小化视觉分支中的域差。此外,在文本分支中引入了少量的任务风格prompt generation模块。在ModelNet10、ModelNet40和扫描对象NN数据集上进行广泛的实验表明,DiffCLIP具有很强的3D理解能力。通过稳定的扩散和风格prompt generation,DiffCLIP实现了对扫描对象NN中 obj_bg 对象零样本分类的准确率为43.2%,这是当前最先进的性能,而ModelNet10中的对象零样本分类的准确率为80.6%,与当前最先进的性能相当。
https://arxiv.org/abs/2305.15957
Radars and cameras belong to the most frequently used sensors for advanced driver assistance systems and automated driving research. However, there has been surprisingly little research on radar-camera fusion with neural networks. One of the reasons is a lack of large-scale automotive datasets with radar and unmasked camera data, with the exception of the nuScenes dataset. Another reason is the difficulty of effectively fusing the sparse radar point cloud on the bird's eye view (BEV) plane with the dense images on the perspective plane. The recent trend of camera-based 3D object detection using BEV features has enabled a new type of fusion, which is better suited for radars. In this work, we present RC-BEVFusion, a modular radar-camera fusion network on the BEV plane. We propose BEVFeatureNet, a novel radar encoder branch, and show that it can be incorporated into several state-of-the-art camera-based architectures. We show significant performance gains of up to 28% increase in the nuScenes detection score, which is an important step in radar-camera fusion research. Without tuning our model for the nuScenes benchmark, we achieve the best result among all published methods in the radar-camera fusion category.
雷达和摄像头是高级驾驶辅助系统和自动驾驶研究的最常用的传感器之一。然而,与神经网络的雷达-摄像头融合研究却相对较少,这让人感到意外。其中一个原因是缺乏大规模包含雷达和未暴露摄像头数据的汽车数据集,除了nuScenes数据集之外。另一个原因是有效地融合在鸟眼视图(BEV)平面上的稀疏雷达点云和Perspective平面上的密集图像是困难的。最近的趋势是使用基于BEV特征的相机三维物体检测,这导致了一种新的融合类型,更适合雷达。在本文中,我们介绍了RC-BEVFusion,这是一个基于BEV平面的模块化雷达-摄像头融合网络。我们提出了BEVFeatureNet,这是一种新的雷达编码分支,并证明它可以被集成到多个先进的相机架构中。我们展示了nuScenes检测得分的显著提高,达到28%的增加,这是雷达-摄像头融合研究的一个重要步骤。在没有对nuScenes基准进行调优的情况下,我们取得了雷达-摄像头融合类别中最好的结果。
https://arxiv.org/abs/2305.15883
Architectures that first convert point clouds to a grid representation and then apply convolutional neural networks achieve good performance for radar-based object detection. However, the transfer from irregular point cloud data to a dense grid structure is often associated with a loss of information, due to the discretization and aggregation of points. In this paper, we propose a novel architecture, multi-scale KPPillarsBEV, that aims to mitigate the negative effects of grid rendering. Specifically, we propose a novel grid rendering method, KPBEV, which leverages the descriptive power of kernel point convolutions to improve the encoding of local point cloud contexts during grid rendering. In addition, we propose a general multi-scale grid rendering formulation to incorporate multi-scale feature maps into convolutional backbones of detection networks with arbitrary grid rendering methods. We perform extensive experiments on the nuScenes dataset and evaluate the methods in terms of detection performance and computational complexity. The proposed multi-scale KPPillarsBEV architecture outperforms the baseline by 5.37% and the previous state of the art by 2.88% in Car AP4.0 (average precision for a matching threshold of 4 meters) on the nuScenes validation set. Moreover, the proposed single-scale KPBEV grid rendering improves the Car AP4.0 by 2.90% over the baseline while maintaining the same inference speed.
将点云转换为网格表示,并应用卷积神经网络的目标实现了良好的雷达目标检测性能。然而,从不规则点云数据到稠密网格结构的传播常常伴随着信息的损失,因为点的组合和聚集。在本文中,我们提出了一种新架构,称为多尺度KPillarsBEV,旨在减轻网格渲染的负面影响。具体而言,我们提出了一种新的网格渲染方法,称为KPBEV,利用内核点卷积的描述能力,在网格渲染期间改善本地点云上下文的编码。此外,我们提出了一种通用的多尺度网格渲染框架,将多尺度特征映射添加到检测网络的卷积骨架中。我们对nuScenes数据集进行了广泛的实验,并按照检测性能和计算复杂性进行评估。提出的多尺度KPillarsBEV架构在nuScenes验证集上比基准表现更好,比先前的技术水平提高了2.88%。此外,提出的单尺度KPBEV网格渲染在保持同样推理速度的情况下提高了nuScenes验证集上的Car AP4.0表现,比基准提高了2.90%。
https://arxiv.org/abs/2305.15836
This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language to the targeted region in LiDAR point clouds. Previous approaches for REC usually focus on the 2D or 3D-indoor domain, which is not suitable for accurately predicting the location of the queried 3D region in an autonomous driving scene. In addition, the upper-bound limitation and the heavy computation cost motivate us to explore a better solution. In this work, we propose a new multi-modal visual grounding task, termed LiDAR Grounding. Then we devise a Multi-modal Single Shot Grounding (MSSG) approach with an effective token fusion strategy. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector without any post-processing. Moreover, the image feature can be flexibly integrated into our approach to provide rich texture and color information. The cross-modal learning enforces the detector to concentrate on important regions in the point cloud by considering the informative language expressions, thus leading to much better accuracy and efficiency. Extensive experiments on the Talk2Car dataset demonstrate the effectiveness of the proposed methods. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
本论文探讨了在自动驾驶场景中的3D指代表达理解(REC)问题,旨在在LiDAR点云上建立自然语言到目标区域的 ground 线。以往的 REC 方法通常只关注2D或3D室内区域,不适合在自动驾驶场景中准确预测 query 的3D区域的位置。此外,限制上限和高昂的计算成本也激励我们探索更好的解决方案。在这项工作中,我们提出了一种新的多模态视觉grounding任务,称为LiDARgrounding,然后开发了一种有效的 token fusion 策略来联合学习LiDAR基于物体检测器和语言特征,并从检测器直接预测目标区域,不需要任何后处理。此外,图像特征可以灵活地集成到我们的方法和提供丰富的纹理和颜色信息。跨模态学习强迫检测器集中关注点云中的重要区域,考虑 informative 语言表达方式,从而带来更好的精度和效率。在Talk2Car数据集上进行广泛的实验证明了所提出的方法的有效性。我们的工作深入探究了基于LiDAR的grounding任务,我们期望它为自动驾驶社区提供了一个有前途的方向。
https://arxiv.org/abs/2305.15765
In this paper, we propose a novel language-guided 3D arbitrary neural style transfer method (CLIP3Dstyler). We aim at stylizing any 3D scene with an arbitrary style from a text description, and synthesizing the novel stylized view, which is more flexible than the image-conditioned style transfer. Compared with the previous 2D method CLIPStyler, we are able to stylize a 3D scene and generalize to novel scenes without re-train our model. A straightforward solution is to combine previous image-conditioned 3D style transfer and text-conditioned 2D style transfer \bigskip methods. However, such a solution cannot achieve our goal due to two main challenges. First, there is no multi-modal model matching point clouds and language at different feature scales (\eg low-level, high-level). Second, we observe a style mixing issue when we stylize the content with different style conditions from text prompts. To address the first issue, we propose a 3D stylization framework to match the point cloud features with text features in local and global views. For the second issue, we propose an improved directional divergence loss to make arbitrary text styles more distinguishable as a complement to our framework. We conduct extensive experiments to show the effectiveness of our model on text-guided 3D scene style transfer.
本文提出了一种独特的语言指导的三维任意神经网络风格转移方法(CLIP3Dstyler)。我们旨在从文本描述中塑造任何具有任意风格的三维场景,并合成新的样式化视图,这种方法比图像 conditioned 风格转移方法更灵活。与之前的2D方法CLIPStyler相比,我们能够在不需要重新训练模型的情况下塑造三维场景并泛化到新的场景。一种简单的方法是将之前的图像 conditioned 3D风格转移方法和文本 conditioned 2D风格转移方法结合起来。但是,这种方法无法达到我们的目标,因为面临两个主要挑战。第一个挑战是不存在匹配点云和语言在不同特征尺度上的方法(例如低水平和高级别)。第二个挑战是,从文本提示中塑造内容时,我们观察到风格混合问题。为了解决这些问题,我们提出了一个三维风格化框架,在该框架中 local 和 global 视角下的点云特征与文本特征进行匹配。对于第二个问题,我们提出了改进的方向交叉损失,以使任意文本风格更加可区分,作为我们框架的补充。我们进行了广泛的实验,以展示我们模型在文本指导的三维场景风格转移方面的有效性。
https://arxiv.org/abs/2305.15732
Monocular 3D detection is a challenging task due to the lack of accurate 3D information. Existing approaches typically rely on geometry constraints and dense depth estimates to facilitate the learning, but often fail to fully exploit the benefits of three-dimensional feature extraction in frustum and 3D space. In this paper, we propose \textbf{OccupancyM3D}, a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations. Specifically, by using synchronized raw sparse LiDAR point clouds, we define the space status and generate voxel-based occupancy labels. We formulate occupancy prediction as a simple classification problem and design associated occupancy losses. Resulting occupancy estimates are employed to enhance original frustum/3D features. As a result, experiments on KITTI and Waymo open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin. Codes and pre-trained models will be available at \url{this https URL}.
单目3D检测是一项具有挑战性的任务,因为缺乏准确的3D信息。现有的方法通常依赖于几何约束和密集深度估计来促进学习,但往往无法 fully Exploiting 3D feature extraction in the aspect ratio and 3D space的 benefits。在本文中,我们提出了 \textbf{OccupancyM3D},一种学习单目3D检测占用率的方法。它直接学习 aspect ratio 和 3D空间中的占用率,导致更歧视性和 informative 3D features 和表示。具体来说,通过使用同步的原始稀疏LiDAR点云,我们定义空间状态并生成以立方体表示的占用标签。我们将其作为简单的分类问题并提出相关的占用损失。结果占用估计被用于增强原始 aspect ratio 和 3D features。因此,在KITTI和 Waymo开放数据集的实验中,表明所提出的方法实现了新的先进技术,并以显著优势超越了其他方法。代码和预训练模型将可在 \url{this https URL} 上提供。
https://arxiv.org/abs/2305.15694
Augmenting LiDAR input with multiple previous frames provides richer semantic information and thus boosts performance in 3D object detection, However, crowded point clouds in multi-frames can hurt the precise position information due to the motion blur and inaccurate point projection. In this work, we propose a novel feature fusion strategy, DynStaF (Dynamic-Static Fusion), which enhances the rich semantic information provided by the multi-frame (dynamic branch) with the accurate location information from the current single-frame (static branch). To effectively extract and aggregate complimentary features, DynStaF contains two modules, Neighborhood Cross Attention (NCA) and Dynamic-Static Interaction (DSI), operating through a dual pathway architecture. NCA takes the features in the static branch as queries and the features in the dynamic branch as keys (values). When computing the attention, we address the sparsity of point clouds and take only neighborhood positions into consideration. NCA fuses two features at different feature map scales, followed by DSI providing the comprehensive interaction. To analyze our proposed strategy DynStaF, we conduct extensive experiments on the nuScenes dataset. On the test set, DynStaF increases the performance of PointPillars in NDS by a large margin from 57.7% to 61.6%. When combined with CenterPoint, our framework achieves 61.0% mAP and 67.7% NDS, leading to state-of-the-art performance without bells and whistles.
将多个先前帧的 LiDAR 输入增强可以提供更丰富的语义信息,从而提高 3D 物体检测的性能,然而,多帧中密集的焦点 clouds 可能会伤害准确的定位信息,由于运动模糊和不准确点投影。在本文中,我们提出了一种新的特征融合策略:DynStaF(动态静态融合),该策略通过增强当前单帧(静态分支)中的准确位置信息与多帧(动态分支)中的最新帧(动态分支)提供丰富的语义信息。为了有效地提取和聚合互补特征,DynStaF 包含两个模块:邻居交叉注意力(NCA)和动态静态交互(DSI),通过双通道架构运行。NCA 将静态分支中的特征是查询,动态分支中的特征是键(值)。在计算注意力时,我们解决了点 clouds 的稀疏性,仅考虑邻居位置。NCA 在不同特征映射尺度上融合两个特征,然后提供全面的交互。为了分析我们提出的策略 DynStaF,我们开展了广泛的实验,在nuScenes数据集上进行了测试。在测试集上,DynStaF 从57.7%提高到了61.6%,与中心点组合后,我们的框架达到61.0% mAP和67.7% NDS,达到了最先进的性能,而无需各种花哨功能。
https://arxiv.org/abs/2305.15219
Recently, graph-based and Transformer-based deep learning networks have demonstrated excellent performances on various point cloud tasks. Most of the existing graph methods are based on static graph, which take a fixed input to establish graph relations. Moreover, many graph methods apply maximization and averaging to aggregate neighboring features, so that only a single neighboring point affects the feature of centroid or different neighboring points have the same influence on the centroid's feature, which ignoring the correlation and difference between points. Most Transformer-based methods extract point cloud features based on global attention and lack the feature learning on local neighbors. To solve the problems of these two types of models, we propose a new feature extraction block named Graph Transformer and construct a 3D point point cloud learning network called GTNet to learn features of point clouds on local and global patterns. Graph Transformer integrates the advantages of graph-based and Transformer-based methods, and consists of Local Transformer and Global Transformer modules. Local Transformer uses a dynamic graph to calculate all neighboring point weights by intra-domain cross-attention with dynamically updated graph relations, so that every neighboring point could affect the features of centroid with different weights; Global Transformer enlarges the receptive field of Local Transformer by a global self-attention. In addition, to avoid the disappearance of the gradient caused by the increasing depth of network, we conduct residual connection for centroid features in GTNet; we also adopt the features of centroid and neighbors to generate the local geometric descriptors in Local Transformer to strengthen the local information learning capability of the model. Finally, we use GTNet for shape classification, part segmentation and semantic segmentation tasks in this paper.
近年来,基于图和Transformer的深度学习网络在各种点云任务中表现出卓越的性能。大多数现有的图方法都基于静态图,通过给定一个固定输入来建立图关系。此外,许多图方法应用最大和平均收敛来聚合相邻特征,因此只有单个相邻点会影响中心点的特征或不同相邻点对中心点的特征具有相同的影响,而忽视了点之间的相关和差异。大多数Transformer方法基于全球注意力来提取点云特征,并且缺乏对本地邻居特征的学习。为了解决这两种模型的问题,我们提出了名为Graph Transformer的新特征提取块,并构建了一个名为GTNet的三维点云点云学习网络,以学习点云的本地和全局特征。Graph Transformer集成了基于图方法和Transformer方法的优点,包括本地Transformer和全球Transformer模块。本地Transformer使用动态图计算所有相邻点权重,通过域内交叉注意力动态更新图关系,因此每个相邻点可以影响中心点的不同权重;全球Transformer通过全球自注意力扩大本地Transformer的响应域。此外,为了避免网络深度增加引起的梯度消失问题,我们在GTNet中进行了残留连接,同时采用中心点和邻居的特征生成本地几何 descriptor,以加强模型的本地信息学习能力。最后,在本文中,我们使用GTNet进行形状分类、部分分割和语义分割任务。
https://arxiv.org/abs/2305.15213
We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at this https URL.
我们将在自动驾驶场景中引入一项全新的视觉问答任务(VQA),旨在基于路景线索回答自然语言问题。与传统的VQA任务相比,自动驾驶场景中的VQA任务面临更多的挑战。首先, raw 视觉数据是多模态的,包括相机和激光雷达捕获的图像和点云。其次,数据是多帧的,因为持续实时获取。第三,户外场景既有移动的前端,也有静态的背景。现有VQA基准点无法充分解决这些复杂性。为了解决这个问题,我们提出了 NuScenes-QA,它是自动驾驶场景中VQA的第一个基准,涵盖了34,000个视觉场景和460,000个问答对。具体来说,我们利用现有的3D检测注释生成场景图,并手动设计问答模板。随后,根据这些模板,通过编程方式生成问答对。全面的统计表明,我们的 NuScenes-QA是一个平衡的大型基准,具有多种问题格式。基于它,我们开发了一系列基准,采用高级3D检测和VQA技术。我们的广泛实验突出了这个新任务所带来的挑战。代码和数据集可在这个 https URL 上获取。
https://arxiv.org/abs/2305.14836
We introduce Point2SSM, a novel unsupervised learning approach that can accurately construct correspondence-based statistical shape models (SSMs) of anatomy directly from point clouds. SSMs are crucial in clinical research for analyzing the population-level morphological variation in bones and organs. However, traditional methods for creating SSMs have limitations that hinder their widespread adoption, such as the need for noise-free surface meshes or binary volumes, reliance on assumptions or predefined templates, and simultaneous optimization of the entire cohort leading to lengthy inference times given new data. Point2SSM overcomes these barriers by providing a data-driven solution that infers SSMs directly from raw point clouds, reducing inference burdens and increasing applicability as point clouds are more easily acquired. Deep learning on 3D point clouds has seen recent success in unsupervised representation learning, point-to-point matching, and shape correspondence; however, their application to constructing SSMs of anatomies is largely unexplored. In this work, we benchmark state-of-the-art point cloud deep networks on the task of SSM and demonstrate that they are not robust to the challenges of anatomical SSM, such as noisy, sparse, or incomplete input and significantly limited training data. Point2SSM addresses these challenges via an attention-based module that provides correspondence mappings from learned point features. We demonstrate that the proposed method significantly outperforms existing networks in terms of both accurate surface sampling and correspondence, better capturing population-level statistics.
我们介绍了 Point2SSM,一种全新的无监督学习方法,可以从点云直接准确地构建解剖学的生物统计形状模型(SSMs)。SSMs在临床研究中对于分析骨骼和器官的级联形态变异非常重要。然而,传统的SSMs制作方法存在一些限制,这些限制妨碍了其广泛采用,例如需要无噪声的表面网格或二进制体积、依赖假设或预先定义的模板、以及同时优化整个群体,导致新数据下的推断时间变得非常长。Point2SSM通过提供一种数据驱动的解决方案,从 raw 点云推断出SSMs,从而减少了推断负担并增加了适用性,因为点云更容易获取。三维点云深度学习最近在无监督表示学习、点-点匹配和形状对应性方面取得了成功。然而,将其应用于构建解剖学的SSMs仍然未被充分探索。在这个工作中,我们基准了最先进的点云深度学习网络SSM任务的性能,并证明了它们对于解剖学SSM的挑战不具有较强的鲁棒性,例如噪声、稀疏或不完整输入,以及训练数据显著限制。Point2SSM通过提供一种注意力模块,从学习到的点特征提供形状对应映射,解决了这些挑战。我们证明了该方法在准确的表面采样和对应性方面显著优于现有的网络,更好地捕捉人口级统计。
https://arxiv.org/abs/2305.14486
Reasoning over the interplay between object deformation and force transmission through contact is central to the manipulation of compliant objects. In this paper, we propose Neural Deforming Contact Field (NDCF), a representation that jointly models object deformations and contact patches from visuo-tactile feedback using implicit representations. Representing the object geometry and contact with the environment implicitly allows a single model to predict contact patches of varying complexity. Additionally, learning geometry and contact simultaneously allows us to enforce physical priors, such as ensuring contacts lie on the surface of the object. We propose a neural network architecture to learn a NDCF, and train it using simulated data. We then demonstrate that the learned NDCF transfers directly to the real-world without the need for fine-tuning. We benchmark our proposed approach against a baseline representing geometry and contact patches with point clouds. We find that NDCF performs better on simulated data and in transfer to the real-world.
对物体变形和接触力通过接触的交互作用进行推理是处理柔韧性物体操纵的核心。在本文中,我们提出了神经网络变形接触场(NDCF),一种通过隐含表示模型联合预测物体变形和接触点的方式。通过隐含表示表示物体几何和与环境的接触,允许一个模型预测不同复杂度的接触点。同时,同时学习几何和接触可以让我们强制物理先验,例如确保接触点在物体表面。我们提出了一种神经网络架构来学习NDCF,并使用模拟数据进行训练。然后我们证明, learned NDCF可以直接移植到现实世界,不需要进行微调。我们基准了我们提出的方法与一个以点云表示几何和接触点的基线相比。我们发现NDCF在模拟数据和移植到现实世界中表现更好。
https://arxiv.org/abs/2305.14470
In this work, we address the challenging task of few-shot and zero-shot 3D point cloud semantic segmentation. The success of few-shot semantic segmentation in 2D computer vision is mainly driven by the pre-training on large-scale datasets like imagenet. The feature extractor pre-trained on large-scale 2D datasets greatly helps the 2D few-shot learning. However, the development of 3D deep learning is hindered by the limited volume and instance modality of datasets due to the significant cost of 3D data collection and annotation. This results in less representative features and large intra-class feature variation for few-shot 3D point cloud segmentation. As a consequence, directly extending existing popular prototypical methods of 2D few-shot classification/segmentation into 3D point cloud segmentation won't work as well as in 2D domain. To address this issue, we propose a Query-Guided Prototype Adaption (QGPA) module to adapt the prototype from support point clouds feature space to query point clouds feature space. With such prototype adaption, we greatly alleviate the issue of large feature intra-class variation in point cloud and significantly improve the performance of few-shot 3D segmentation. Besides, to enhance the representation of prototypes, we introduce a Self-Reconstruction (SR) module that enables prototype to reconstruct the support mask as well as possible. Moreover, we further consider zero-shot 3D point cloud semantic segmentation where there is no support sample. To this end, we introduce category words as semantic information and propose a semantic-visual projection model to bridge the semantic and visual spaces. Our proposed method surpasses state-of-the-art algorithms by a considerable 7.90% and 14.82% under the 2-way 1-shot setting on S3DIS and ScanNet benchmarks, respectively. Code is available at this https URL.
在本作品中,我们面临了多视角和单视角3D点云语义分割的挑战任务。在2D计算机视觉中,多视角语义分割的成功主要依赖于像imagenet这样的大规模数据集的预训练。对大规模2D数据集的特征提取器进行预训练极大地帮助了2D单视角学习。然而,3D深度学习的发展受到数据集体积和实例模式的有限限制,因为收集和标注3D数据的成本非常高。这导致了3D点云多视角分割中缺乏代表性特征和群体特征变化较大。因此,直接在3D点云分割中扩展现有的单视角的典型方法不会像在2D领域那样有效。为了解决这一问题,我们提出了一个 Query-Guided Prototype Adaption (QGPA)模块,以从支持点云特征空间到查询点云特征空间的原型适应。通过这种原型适应,我们极大地缓解了点云群体特征内部差异的问题,并显著提高了单视角3D分割的性能。此外,为了增强原型表示,我们引入了一个Self-Reconstruction (SR)模块,使原型能够尽可能重建支持Mask。此外,我们还考虑了在没有支持样本的情况下进行单视角3D点云语义分割。为此,我们引入了类别词作为语义信息,并提出了语义-视觉投射模型,以连接语义和视觉空间。我们提出的方法在S3DIS和ScanNet基准测试中分别超越了现有算法的7.90%和14.82%。代码可在这个httpsURL上获取。
https://arxiv.org/abs/2305.14335
While point-based neural architectures have demonstrated their efficacy, the time-consuming sampler currently prevents them from performing real-time reasoning on scene-level point clouds. Existing methods attempt to overcome this issue by using random sampling strategy instead of the commonly-adopted farthest point sampling~(FPS), but at the expense of lower performance. So the effectiveness/efficiency trade-off remains under-explored. In this paper, we reveal the key to high-quality sampling is ensuring an even spacing between points in the subset, which can be naturally obtained through a grid. Based on this insight, we propose a hierarchical adaptive voxel-guided point sampler with linear complexity and high parallelization for real-time applications. Extensive experiments on large-scale point cloud detection and segmentation tasks demonstrate that our method achieves competitive performance with the most powerful FPS, at an amazing speed that is more than 100 times faster. This breakthrough in efficiency addresses the bottleneck of the sampling step when handling scene-level point clouds. Furthermore, our sampler can be easily integrated into existing models and achieves a 20$\sim$80\% reduction in runtime with minimal effort. The code will be available at this https URL
点基神经网络架构已经证明了其有效性,但消耗时间的采样方法 currently prevent them from performing real-time reasoning on scene-level point clouds. 现有的方法试图通过使用随机采样策略而不是常见的最远点采样(FPS)来解决这一问题,但付出了性能下降的代价。因此,有效性/效率的权衡仍然未被深入研究。在本文中,我们揭示了高质量采样的关键,是确保子集中的点之间的even spacing,这可以通过网格自然获得。基于这一洞察力,我们提出了一种HierarchicalAdaptiveVOXel引导点采样方法,具有线性复杂度和高并行化,为实时应用而设计。在大规模点云检测和分割任务的实验中,证明了我们的方法能够与最强大的FPS竞争性能,并以惊人的速度超过100倍的速度运行。这种效率突破解决了处理场景级点云时采样步骤的瓶颈。此外,我们的采样方法可以轻松地融入现有的模型中,并以 minimal effort 实现20$\sim$80%的运行时减少。代码将在此httpsURL上可用。
https://arxiv.org/abs/2305.14306
Adverse weather can cause noise to light detection and ranging (LiDAR) data. This is a problem since it is used in many outdoor applications, e.g. object detection and mapping. We propose the task of multi-echo denoising, where the goal is to pick the echo that represents the objects of interest and discard other echoes. Thus, the idea is to pick points from alternative echoes that are not available in standard strongest echo point clouds due to the noise. In an intuitive sense, we are trying to see through the adverse weather. To achieve this goal, we propose a novel self-supervised deep learning method and the characteristics similarity regularization method to boost its performance. Based on extensive experiments on a semi-synthetic dataset, our method achieves superior performance compared to the state-of-the-art in self-supervised adverse weather denoising (23% improvement). Moreover, the experiments with a real multi-echo adverse weather dataset prove the efficacy of multi-echo denoising. Our work enables more reliable point cloud acquisition in adverse weather and thus promises safer autonomous driving and driving assistance systems in such conditions. The code is available at this https URL
恶劣的天气可能导致光检测和范围(LiDAR)数据的噪声。这个问题因为许多户外应用,例如物体检测和地图制作,都使用LiDAR数据。我们提出了任务多回声去噪,其目标是选择代表感兴趣的物体的回声并删除其他回声。因此,我们的目标是从其他回声中选择点,由于噪声原因,这些点在标准最强的回声点云中不存在。从直觉上看,我们试图穿过恶劣的天气。为了实现这个目标,我们提出了一种全新的自监督深度学习方法,并使用特征相似性 Regularization方法来提高其性能。基于对半合成数据集的广泛实验,我们的方法和自监督恶劣的天气去噪的当前最佳方法相比取得了更好的性能(23%改进)。此外,与真实的多回声恶劣的天气数据集的实验证明了多回声去噪的有效性。我们的方法在恶劣的天气下提供更可靠的点云获取,因此在这样的条件下,可以承诺更安全的自主驾驶和驾驶辅助系统。代码在此https URL上可用。
https://arxiv.org/abs/2305.14008
This paper investigates the advantages of using Bird's Eye View (BEV) representation in 360-degree visual place recognition (VPR). We propose a novel network architecture that utilizes the BEV representation in feature extraction, feature aggregation, and vision-LiDAR fusion, which bridges visual cues and spatial awareness. Our method extracts image features using standard convolutional networks and combines the features according to pre-defined 3D grid spatial points. To alleviate the mechanical and time misalignments between cameras, we further introduce deformable attention to learn the compensation. Upon the BEV feature representation, we then employ the polar transform and the Discrete Fourier transform for aggregation, which is shown to be rotation-invariant. In addition, the image and point cloud cues can be easily stated in the same coordinates, which benefits sensor fusion for place recognition. The proposed BEV-based method is evaluated in ablation and comparative studies on two datasets, including on-the-road and off-the-road scenarios. The experimental results verify the hypothesis that BEV can benefit VPR by its superior performance compared to baseline methods. To the best of our knowledge, this is the first trial of employing BEV representation in this task.
本论文研究了在360度视觉位置识别(VPR)中使用鸟眼视角(BEV)表示的优势。我们提出了一种新的网络架构,该架构利用BEV表示在特征提取、特征聚合和视觉-激光雷达融合方面,从而将视觉提示和空间意识连接起来。我们的方法使用标准卷积神经网络提取图像特征,并按照预先定义的三维网格空间点进行特征组合。为了减轻相机之间的机械和时间不一致性,我们引入了可变形的注意力来学习补偿。在BEV特征表示的基础上,我们采用极化变换和离散傅里叶变换进行聚合,表明它是旋转不变的。此外,图像和点云提示可以在相同的坐标系中轻松地陈述,从而提高了位置识别传感器融合的性能。我们提出的基于BEV的方法在两个数据集上进行了 ablation和比较研究,包括在路上和在路上的场景。实验结果验证了我们的假设,即BEV可以通过比基准方法更好的表现来受益于VPR。据我们所知,这是使用BEV表示在这项工作中的第一个尝试。
https://arxiv.org/abs/2305.13814
The emerging topic of cross-source point cloud (CSPC) registration has attracted increasing attention with the fast development background of 3D sensor technologies. Different from the conventional same-source point clouds that focus on data from same kind of 3D sensor (e.g., Kinect), CSPCs come from different kinds of 3D sensors (e.g., Kinect and { LiDAR}). CSPC registration generalizes the requirement of data acquisition from same-source to different sources, which leads to generalized applications and combines the advantages of multiple sensors. In this paper, we provide a systematic review on CSPC registration. We first present the characteristics of CSPC, and then summarize the key challenges in this research area, followed by the corresponding research progress consisting of the most recent and representative developments on this topic. Finally, we discuss the important research directions in this vibrant area and explain the role in several application fields.
跨源点云(CSPC)注册是一项新兴的话题,随着3D传感器技术的快速发展而吸引了越来越多的关注。与传统专注于同一类型的3D传感器数据(例如kinect)的同源点云注册不同,CSPC注册来自不同类型的3D传感器(例如kinect和LiDAR)。CSPC注册 generalize了从同源到不同源的数据获取要求,导致通用应用和多个传感器的优势的合并。在本文中,我们将对CSPC注册进行系统综述。我们首先介绍了CSPC的特征,然后总结了该领域的关键挑战,随后包括该领域最近和代表性的发展。最后,我们将讨论这个充满活力的领域的重要研究方向,并解释它们在多个应用领域中的作用。
https://arxiv.org/abs/2305.13570
Evaluating simultaneous localization and mapping (SLAM) algorithms necessitates high-precision and dense ground truth (GT) trajectories. But obtaining desirable GT trajectories is sometimes challenging without GT tracking sensors. As an alternative, in this paper, we propose a novel prior-assisted SLAM system to generate a full six-degree-of-freedom ($6$-DOF) trajectory at around $10$Hz for benchmarking under the framework of the factor graph. Our degeneracy-aware map factor utilizes a prior point cloud map and LiDAR frame for point-to-plane optimization, simultaneously detecting degeneration cases to reduce drift and enhancing the consistency of pose estimation. Our system is seamlessly integrated with cutting-edge odometry via a loosely coupled scheme to generate high-rate and precise trajectories. Moreover, we propose a norm-constrained gravity factor for stationary cases, optimizing pose and gravity to boost performance. Extensive evaluations demonstrate our algorithm's superiority over existing SLAM or map-based methods in diverse scenarios in terms of precision, smoothness, and robustness. Our approach substantially advances reliable and accurate SLAM evaluation methods, fostering progress in robotics research.
评估同时定位和映射(SLAM)算法需要高精度和高密度的地面真实(GT)轨迹。但是,在没有GT跟踪传感器的情况下获得理想的GT轨迹有时会非常困难。作为一种替代方案,在本文中,我们提出了一种 novel prior-assisted SLAM系统,以生成约10Hz的全六自由度($6$-DOF)轨迹,作为基准,在因子图框架下。我们的模态因子利用先前的点云地图和LiDAR框架进行点-平面优化,同时检测退化情况,减少漂移,增强姿态估计的一致性。我们的系统通过松散耦合的方法无缝与先进的相控测量系统集成,生成高速度和精度的轨迹。此外,我们提出了一个静态情况下受限于norm的重力因子,优化姿态和重力以提升性能。广泛的评估证明了我们的算法在多种场景下的优越性,在精度、流畅性和可靠性方面。我们的方法极大地推进了可靠和准确的SLAM评估方法,促进了机器人研究的进度。
https://arxiv.org/abs/2305.13147
We present a new self-supervised paradigm on point cloud sequence understanding. Inspired by the discriminative and generative self-supervised methods, we design two tasks, namely point cloud sequence based Contrastive Prediction and Reconstruction (CPR), to collaboratively learn more comprehensive spatiotemporal representations. Specifically, dense point cloud segments are first input into an encoder to extract embeddings. All but the last ones are then aggregated by a context-aware autoregressor to make predictions for the last target segment. Towards the goal of modeling multi-granularity structures, local and global contrastive learning are performed between predictions and targets. To further improve the generalization of representations, the predictions are also utilized to reconstruct raw point cloud sequences by a decoder, where point cloud colorization is employed to discriminate against different frames. By combining classic contrast and reconstruction paradigms, it makes the learned representations with both global discrimination and local perception. We conduct experiments on four point cloud sequence benchmarks, and report the results on action recognition and gesture recognition under multiple experimental settings. The performances are comparable with supervised methods and show powerful transferability.
我们提出了一种新的点云序列理解自监督范式。借鉴了具有选择和生成自监督方法,我们设计了两个任务,即点云序列基于对比预测和重建(CPR),旨在协作学习更全面的时间空间表示。具体来说,密集点云片段首先输入到编码器中以提取嵌入。除了最后一个片段,所有片段都被基于上下文自回归器聚合起来,以预测最后一个目标片段。为了建模多粒度结构,在预测和目标之间进行 local 和 global 对比学习。为了进一步提高表示的泛化能力,预测也被用于解码器中重建原始点云序列,其中点云颜色化用于区分不同的帧。通过结合经典对比和重建范式,它使学习到的表示具有全球区分性和 local 感知。我们对四个点云序列基准进行了实验,并在不同的实验设置下报告了动作识别和手势识别的结果。表现与监督方法相当,表现出强大的转移性。
https://arxiv.org/abs/2305.12959
We introduce a novel approach for measuring the total curvature at every triangle of a discrete surface. This method takes advantage of the relationship between per triangle total curvature and the Dirichlet energy of the Gauss map. This new tool can be used on both triangle meshes and point clouds and has numerous applications. In this study, we demonstrate the effectiveness of our technique by using it for feature-aware mesh decimation, and show that it outperforms existing curvature-estimation methods from popular libraries such as Meshlab, Trimesh2, and Libigl. When estimating curvature on point clouds, our method outperforms popular libraries PCL and CGAL.
我们引入了一种新方法,用于测量离散表面上每个三角形的总曲率。这种方法利用每个三角形的总曲率和高斯地图的狄利克雷能量之间的关系。这个新工具可以应用于三角形网格和点云,并有许多应用。在本研究中,我们利用这种方法展示了我们技术的有效性,通过使用它进行特征 aware 网格削减,并证明了它比流行的库如Meshlab、Trimesh2和Libigl现有的曲率估计方法更有效。在点云上进行曲率估计时,我们的方法比流行的库PCL和CGAL更有效。
https://arxiv.org/abs/2305.12653