DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% PQ on SemanticKITTI and nuScenes benchmark, respectively. The source code and models are available at this https URL .
DEtectionTRansformer(DETR)开始使用一组可学习查询来实现统一的视觉感知。这项工作首先将这一吸引人的范式应用于基于激光雷达的点云分割,并获得了简单但有效的基线。虽然简单的适应方法获得了公正的结果,但实例分割性能明显低于以前的工作。通过深入研究细节,我们发现稀疏点云实例相对于整个场景来说相对较小,往往具有相似的几何形状,但在分割方面缺乏独特的外观,这在图像领域非常罕见。考虑到3D实例更多地取决于其位置信息,我们在建模期间强调它们的作用,设计了一个稳健的混合参数化位置嵌入(MPE),以指导分割过程。它被嵌入到基线特征中,然后迭代地指导掩码预测和查询更新过程,导致位置Aware分割(PA-Seg)和掩码焦点注意(MFA)。所有这些设计都促使查询关注特定的区域并识别各种实例。该方法被称为位置引导点云 Panoptic 分割转换器(P3 former),在语义KITTI和nuScenes基准测试中分别比以前的先进方法高出3.4%和1.2%。源代码和模型可在该httpsURL上提供。
https://arxiv.org/abs/2303.13509
Modern surgeries are performed in complex and dynamic settings, including ever-changing interactions between medical staff, patients, and equipment. The holistic modeling of the operating room (OR) is, therefore, a challenging but essential task, with the potential to optimize the performance of surgical teams and aid in developing new surgical technologies to improve patient outcomes. The holistic representation of surgical scenes as semantic scene graphs (SGG), where entities are represented as nodes and relations between them as edges, is a promising direction for fine-grained semantic OR understanding. We propose, for the first time, the use of temporal information for more accurate and consistent holistic OR modeling. Specifically, we introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction. We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images. We evaluate our method on the 4D-OR dataset and demonstrate that integrating temporality leads to more accurate and consistent results achieving an +5% increase and a new SOTA of 0.88 in macro F1. This work opens the path for representing the entire surgery history with memory scene graphs and improves the holistic understanding in the OR. Introducing scene graphs as memory representations can offer a valuable tool for many temporal understanding tasks.
现代手术在复杂且动态的场景中进行,包括医疗人员、患者和设备之间的不断变化的互动。因此,对手术空间的整个建模是一个具有挑战性但必要的任务,有潜力优化手术团队的表现,并帮助开发新的手术技术,提高患者的治疗效果。将手术场景作为一个语义场景图(SGG)的整个建模,其中实体表示为节点,它们之间的关系表示为边,是一个高精度的语义OR理解有前途的方向。我们首次提出了使用时间信息来进行更准确且一致的整个OR建模。具体来说,我们引入了记忆场景图,其中之前的时间步骤的场景图作为时间表示指导当前预测。我们设计了一个端到端架构,智能地融合我们的轻量级记忆场景图的时间信息与点云和图像的视觉信息。我们在4DOR数据集上评估了我们的方法,并证明了将时间整合在一起会导致更准确且一致的结果,实现+5%的增加,并在宏观F1中获得了一个新的SOTA。这项工作开辟了用记忆场景图代表整个手术历史并改善OR整体理解的道路。引入场景图作为记忆表示可以提供许多时间理解任务中的一种宝贵的工具。
https://arxiv.org/abs/2303.13293
Deep point cloud registration methods face challenges to partial overlaps and rely on labeled data. To address these issues, we propose UDPReg, an unsupervised deep probabilistic registration framework for point clouds with partial overlaps. Specifically, we first adopt a network to learn posterior probability distributions of Gaussian mixture models (GMMs) from point clouds. To handle partial point cloud registration, we apply the Sinkhorn algorithm to predict the distribution-level correspondences under the constraint of the mixing weights of GMMs. To enable unsupervised learning, we design three distribution consistency-based losses: self-consistency, cross-consistency, and local contrastive. The self-consistency loss is formulated by encouraging GMMs in Euclidean and feature spaces to share identical posterior distributions. The cross-consistency loss derives from the fact that the points of two partially overlapping point clouds belonging to the same clusters share the cluster centroids. The cross-consistency loss allows the network to flexibly learn a transformation-invariant posterior distribution of two aligned point clouds. The local contrastive loss facilitates the network to extract discriminative local features. Our UDPReg achieves competitive performance on the 3DMatch/3DLoMatch and ModelNet/ModelLoNet benchmarks.
深度点云注册方法面临着部分重叠和依赖标记数据的挑战。为了解决这些问题,我们提出了UDPReg,一个 unsupervised deep probabilistic 注册框架,用于处理部分重叠的点云。具体来说,我们首先采用一个网络来从点云学习高斯混合模型(GMM)的后概率分布。为了处理部分点云注册,我们应用Sinkhorn算法预测 GMM 混合权重的限制下的分布级对应关系。为了允许未监督学习,我们设计了三种分布一致性损失:自一致性、跨一致性和局部竞争。自一致性损失是由鼓励GMM在欧氏空间和特征空间共享相同的后概率分布而定义的。跨一致性损失源于两个相同的簇内的点云点号的共享簇 centroid。跨一致性损失使网络能够灵活学习两个对齐点云的 transformation-invariant 后概率分布。局部竞争损失使网络能够提取具有差异的局部特征。我们的UDPReg在3DMatch/3DLoMatch和ModelNet/ModelLoNet基准测试中取得了竞争性能。
https://arxiv.org/abs/2303.13290
Point cloud (PCD) anomaly detection steadily emerges as a promising research area. This study aims to improve PCD anomaly detection performance by combining handcrafted PCD descriptions with powerful pre-trained 2D neural networks. To this end, this study proposes Complementary Pseudo Multimodal Feature (CPMF) that incorporates local geometrical information in 3D modality using handcrafted PCD descriptors and global semantic information in the generated pseudo 2D modality using pre-trained 2D neural networks. For global semantics extraction, CPMF projects the origin PCD into a pseudo 2D modality containing multi-view images. These images are delivered to pre-trained 2D neural networks for informative 2D modality feature extraction. The 3D and 2D modality features are aggregated to obtain the CPMF for PCD anomaly detection. Extensive experiments demonstrate the complementary capacity between 2D and 3D modality features and the effectiveness of CPMF, with 95.15% image-level AU-ROC and 92.93% pixel-level PRO on the MVTec3D benchmark. Code is available on this https URL.
点云(PCD)异常检测逐渐成为一个有前途的研究领域。本研究旨在通过结合手工编写的点云描述与强大的预训练2D神经网络,提高PCD异常检测性能。为此,本研究提出了互补的伪多模态特征(CPMF),该特征使用手工编写的点云描述将3D模态中的 local 几何信息与使用预训练2D神经网络生成的伪2D模态中的 global 语义信息相结合。为了获取全局语义信息,CPMF将点云的起源点转换为包含多视角图像的伪2D模态。这些图像被发送到预训练2D神经网络进行2D模态信息 informative 特征提取。3D和2D模态特征的聚合得到了PCD异常检测的CPMF。广泛的实验结果表明,2D和3D模态特征之间的互补能力以及CPMF的有效性,在MVTec3D基准测试中,PCD异常检测的性能达到95.15%。代码可在本https URL上获取。
https://arxiv.org/abs/2303.13194
Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar this http URL address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.
旨在将自然语言描述连接到以3D点云表示的3D场景中的特定区域,3D视觉接地是人类-机器人交互中的一个极为基本的任务。识别错误可能会显著影响整体准确性,进而降低AI系统的运行水平。尽管其有效性,现有的方法却面临着在多个相邻对象中具有类似http URL地址的多个类似对象的情况下低识别精度的困难。这项工作直觉地引入了人类-机器人交互作为线索,以促进3D视觉接地的发展。具体而言,名为Embodied Reference Understanding(ERU)的新任务首先被设计为解决这个问题。然后,名为ScanERU的新数据集被构建用于评估这个想法的 effectiveness。与现有的数据集不同,我们的ScanERU是第一个涵盖半合成场景与文本、现实世界视觉和合成手势信息整合的数据集。此外,本文基于注意力机制和身体运动提出了一个启发性框架,以阐明ERU的研究。实验结果显示,该提议方法的优势,特别是在识别多个相同的对象方面。我们的代码和数据集已准备公开发布。
https://arxiv.org/abs/2303.13186
In this paper, we propose a novel representation for grasping using contacts between multi-finger robotic hands and objects to be manipulated. This representation significantly reduces the prediction dimensions and accelerates the learning process. We present an effective end-to-end network, CMG-Net, for grasping unknown objects in a cluttered environment by efficiently predicting multi-finger grasp poses and hand configurations from a single-shot point cloud. Moreover, we create a synthetic grasp dataset that consists of five thousand cluttered scenes, 80 object categories, and 20 million annotations. We perform a comprehensive empirical study and demonstrate the effectiveness of our grasping representation and CMG-Net. Our work significantly outperforms the state-of-the-art for three-finger robotic hands. We also demonstrate that the model trained using synthetic data performs very well for real robots.
在本文中,我们提出了一种利用多指机器人手与待操纵物体之间的接触关系进行 grasping 的新表示方法。这种表示方法极大地减少了预测维度,并加速了学习过程。我们介绍了一种有效的端到端网络 CMG-Net,用于在复杂环境中识别未知物体,该网络从一次点云图像中高效预测多指抓握姿态和手构型。此外,我们创造了一个合成的抓握数据集,其中包括五千个复杂场景、80个物体类别和2000万注释。我们进行了全面的实证研究,并证明了我们的 grasping 表示和 CMG-Net 的有效性。我们的工作 significantly outperforms 三维机器人手的技能水平。我们还证明了使用合成数据训练的模型对于真实机器人非常良好地表现。
https://arxiv.org/abs/2303.13182
Self-supervised learning is attracting large attention in point cloud understanding. However, exploring discriminative and transferable features still remains challenging due to their nature of irregularity and sparsity. We propose a geometrically and adaptively masked auto-encoder for self-supervised learning on point clouds, termed \textit{PointGame}. PointGame contains two core components: GATE and EAT. GATE stands for the geometrical and adaptive token embedding module; it not only absorbs the conventional wisdom of geometric descriptors that captures the surface shape effectively, but also exploits adaptive saliency to focus on the salient part of a point cloud. EAT stands for the external attention-based Transformer encoder with linear computational complexity, which increases the efficiency of the whole pipeline. Unlike cutting-edge unsupervised learning models, PointGame leverages geometric descriptors to perceive surface shapes and adaptively mines discriminative features from training data. PointGame showcases clear advantages over its competitors on various downstream tasks under both global and local fine-tuning strategies. The code and pre-trained models will be publicly available.
自监督学习在点云理解中引起了广泛关注。然而,探索具有区分性和可转移性的特征仍然由于它们的不规则性和稀疏性的性质而非常困难。我们提出了一种基于几何和自适应掩膜的自编码器,称为 \textit{PointGame}。PointGame包含两个核心组件:Gate和EAT。Gate表示基于几何特征的自适应 token 嵌入模块,不仅吸收了传统的几何描述符,有效地捕捉表面形状的经验,而且还利用自适应偏差来集中关注点云的突出部分。EAT表示基于外部注意力的Transformer编码器,具有线性计算复杂性,提高了整个流程的效率。与最先进的无监督学习模型不同,PointGame利用几何描述符来感知表面形状,并自适应地从训练数据中挖掘具有区分性的特征。PointGame在不同全球和 local微调策略下的多个后续任务中展现了明显的优势,代码和预训练模型将公开可用。
https://arxiv.org/abs/2303.13100
Channel pruning can effectively reduce both computational cost and memory footprint of the original network while keeping a comparable accuracy performance. Though great success has been achieved in channel pruning for 2D image-based convolutional networks (CNNs), existing works seldom extend the channel pruning methods to 3D point-based neural networks (PNNs). Directly implementing the 2D CNN channel pruning methods to PNNs undermine the performance of PNNs because of the different representations of 2D images and 3D point clouds as well as the network architecture disparity. In this paper, we proposed CP$^3$, which is a Channel Pruning Plug-in for Point-based network. CP$^3$ is elaborately designed to leverage the characteristics of point clouds and PNNs in order to enable 2D channel pruning methods for PNNs. Specifically, it presents a coordinate-enhanced channel importance metric to reflect the correlation between dimensional information and individual channel features, and it recycles the discarded points in PNN's sampling process and reconsiders their potentially-exclusive information to enhance the robustness of channel pruning. Experiments on various PNN architectures show that CP$^3$ constantly improves state-of-the-art 2D CNN pruning approaches on different point cloud tasks. For instance, our compressed PointNeXt-S on ScanObjectNN achieves an accuracy of 88.52% with a pruning rate of 57.8%, outperforming the baseline pruning methods with an accuracy gain of 1.94%.
通道剪枝可以有效地降低原始网络的计算成本和内存 footprint,同时保持相同的精度性能。虽然对2D图像based卷积神经网络(CNN)的通道剪枝已经取得了巨大的成功,但现有的工作很少将通道剪枝方法扩展到3D点based神经网络(PNNs)。直接实现2D CNN通道剪枝方法到PNNs会削弱PNNs的性能,因为2D图像和3D点云的表示不同,以及网络架构差异大。在本文中,我们提出了CP$^3$,它是一个基于点based网络的通道剪枝插件。CP$^3$精心设计,利用点云和PNNs的特点,以便为PNNs实现2D通道剪枝方法。具体来说,它提出了一种坐标增强的通道重要性度量,以反映维度信息和个体通道特征之间的相关性,它回收了PNNs采样过程中丢弃的点,并重新考虑它们的可能 exclusive 信息,以增强通道剪枝的稳健性。对各种PNN架构的实验表明,CP$^3$ constantly improves the state-of-the-art 2D CNN通道剪枝方法在不同点云任务中的性能。例如,我们的压缩PointNeXt-S在扫描对象NN上的点云任务中,具有88.52%的精度,剪枝率为57.8%,比基准剪枝方法的精度提高1.94%。
https://arxiv.org/abs/2303.13097
LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at this https URL.
利用激光雷达点云识别3D点云的方法可以造福多种应用程序。如果没有特别考虑激光雷达点云分布,大多数当前方法都面临信息断开和接收域有限的问题,特别是对于稀疏遥远的点。在这项工作中,我们研究了激光雷达点云的 varying-sparss分布,并提出了Sphere Former直接聚合从密集接近点到稀疏遥远的信息。我们设计了径向窗口自注意力,将空间划分为多个非重叠的窄长窗口。它克服了信息断开的问题,并且极大地扩展了接收域,这极大地提高了稀疏遥远的点的性能。此外,为了适应窄长窗口,我们提出了指数分割,生成精细的位置编码和动态特征选择,以提高模型表示能力。值得注意的是,我们的方法在nuScenes和SemanticKITTI语义分割基准测试中分别获得了81.9%和74.8%的mIoU,同时,在nuScenes物体检测基准测试中获得了第3名,72.8%的NDS和68.5%的mAP。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.12766
We introduce Uni-Fusion, an universal continuous mapping framework for surfaces, surface properties (color, infrared, etc.) and more (latent features in CLIP embedding space, etc.). We propose the first Universal Implicit Encoding model that supports encoding of both geometry and various types of properties (RGB, infrared, feature and etc.) without the need for any training. Based on that, our framework divides the point cloud into regular grid voxels and produces a latent feature in each voxel to form a Latent Implicit Map (LIM) for geometries and arbitrary properties. Then, by fusing a Local LIM of new frame to Global LIM, an incremental reconstruction is approached. Encoded with corresponding types of data, our Latent Implicit Map is capable to generate continuous surfaces, surface properties fields, surface feature fields and any other possible options. To demonstrate the capabilities of our model, we implement three applications: (1) incremental reconstruction for surfaces and color (2) 2D-to-3D fabricated properties transfers (3) open-vocabulary scene understanding by producing a text CLIP feature field on surfaces. We evaluate Uni-Fusion by comparing in corresponding applications, from which, Uni-Fusion shows high flexibility to various of application while performing best or competitive. The project page of Uni-Fusion is available at this https URL
我们引入了 Uni-Fusion,一个适用于表面、表面属性(颜色、红外等)以及更多的 universal 连续映射框架。我们提出了第一个 universal implicit 编码模型,该模型无需任何训练即可支持几何体和任意属性的编码(如 RGB、红外、特征等)。基于该模型,我们将其点云按 regular grid voxels 分割成单个的隐式映射(LIM)单元,并在每个 voxel 中产生隐式特征,以形成几何体和任意属性的隐式映射(LIM)。然后,通过将新帧的 local LIM 与 global LIM 融合,增量重建被 approached。与相应的数据编码,我们的隐式 implicit 映射可以生成连续的表面、表面属性场、表面特征场和任何其他可能的选择。为了展示我们模型的能力,我们实现了三个应用:(1)增量重建用于表面和颜色;(2)2D 到 3D 制造属性转移;(3)通过在表面上生成文本 CLIP 特征场,实现开放词汇场景理解。我们比较了相应的应用,Uni-Fusion 在表现最佳或竞争环境中表现出高灵活性。Uni-Fusion 项目的页面在此 https URL 上可用。
https://arxiv.org/abs/2303.12678
Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP^2) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.
Contrastive Language-Image Pre-training利用大规模未标记文本图像 pairs 展示了在开放世界视觉理解任务中的良好表现。然而,由于文本-3D数据 pairs 有限,将2D视觉-语言模型(VLM)在3D空间中的成功适应仍然是一个开放性问题。现有的工作利用VLM为3D理解而使用,通常只能构建2D中间表示,但却失去了3D几何信息。为了迈向开放世界3D视觉理解,我们提出了Contrastive Language-Image-Point Cloud Pretraining(CLIP^2),通过一种新的代理对齐机制,在真实的场景下直接学习可转移的3D点云表示。具体来说,我们利用2D和3D场景中的自然对应关系,从这些复杂的场景中构建对齐的文本-图像-点代理。此外,我们提出了一个跨modalContrastive目标,以学习语义和实例级别的对齐点云表示。在室内和室外场景中的实验结果显示,我们学习到的3D表示在后续任务中具有很强的转移能力,包括零和经验3D识别,这极大地提高了现有方法。此外,我们提供了不同表示在真实场景下的能力分析,并提出了可选的集成方案。
https://arxiv.org/abs/2303.12417
Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale registration methods are rarely explored. Challenges mainly arise from the huge point number, complex distribution, and outliers of outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local features, and then leverage estimators (eg. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment without any further post-processing. Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers by extracting point features globally. Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes. Furthermore, to effectively reduce mismatches, a bijective association transformer is designed for regressing the initial transformation. Extensive experiments on KITTI and NuScenes datasets demonstrate that our RegFormer achieves state-of-the-art performance in terms of both accuracy and efficiency.
虽然点云注册在对象级别和室内场景方面取得了显著的进展,但大规模注册方法却很少有人探索。挑战主要来自户外激光雷达扫描点云的巨大点数量和复杂的分布以及其中的异常点。此外,大多数现有的注册工作通常采用两个阶段的模式:首先通过提取专属的局部特征找到匹配点,然后利用Estimator(例如RANSAC)过滤异常点,这些异常点 highly reliant on 设计良好的特征描述器和后续处理选择。为了解决这些问题,我们提出了一种端到端Transformer网络(RegFormer),可以在不需要进一步后处理的情况下大规模点云对齐。具体来说,我们提出了一种投影aware的分层Transformer,以捕获远程依赖并全局提取点特征以过滤异常点。我们的Transformer具有线性复杂性,即使对于大规模场景也保证高效率。此外,为了有效地减少不匹配,我们设计了一对角关联Transformer,以回归初始变换。对KITTI和NuScenes数据集的广泛实验表明,我们的RegFormer在精度和效率方面都实现了最先进的表现。
https://arxiv.org/abs/2303.12384
Modern depth sensors such as LiDAR operate by sweeping laser-beams across the scene, resulting in a point cloud with notable 1D curve-like structures. In this work, we introduce a new point cloud processing scheme and backbone, called CurveCloudNet, which takes advantage of the curve-like structure inherent to these sensors. While existing backbones discard the rich 1D traversal patterns and rely on Euclidean operations, CurveCloudNet parameterizes the point cloud as a collection of polylines (dubbed a "curve cloud"), establishing a local surface-aware ordering on the points. Our method applies curve-specific operations to process the curve cloud, including a symmetric 1D convolution, a ball grouping for merging points along curves, and an efficient 1D farthest point sampling algorithm on curves. By combining these curve operations with existing point-based operations, CurveCloudNet is an efficient, scalable, and accurate backbone with low GPU memory requirements. Evaluations on the ShapeNet, Kortx, Audi Driving, and nuScenes datasets demonstrate that CurveCloudNet outperforms both point-based and sparse-voxel backbones in various segmentation settings, notably scaling better to large scenes than point-based alternatives while exhibiting better single object performance than sparse-voxel alternatives.
现代深度传感器,如LiDAR,通过扫射场景来收集激光束,产生点云,其中存在显著的1D曲线状结构。在这项工作中,我们介绍了一种新的点云处理方案和骨架,称为 CurveCloudNet,利用这些传感器固有的曲线状结构。虽然现有的骨架抛弃了丰富的1D路径模式,并依赖于欧几里得操作,但 CurveCloudNet将点云参数化为线族(被称为“曲线云”),在点云中建立 local surface-aware 排序。我们的方法和现有的点based操作一起处理曲线云,包括对称的1D卷积、在曲线上的球分组和在曲线上的高效1D最远点采样算法。通过将这些曲线操作与现有的点based操作相结合, CurveCloudNet是一个高效、可扩展、准确的骨架,低GPU内存要求。 ShapeNet、 Kortx、奥迪驾驶和nuScenes等数据集的评估表明, CurveCloudNet在各种分割设置下比点based和稀疏点骨架更有效,特别是在大型场景下,比稀疏点骨架表现出更好的单个物体性能。
https://arxiv.org/abs/2303.12050
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. However, LiDAR point clouds are usually textureless and incomplete, which hinders effective appearance matching. Besides, previous methods greatly overlook the critical motion clues among targets. In this work, beyond 3D Siamese tracking, we introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective. Following this paradigm, we propose a matching-free two-stage tracker M^2-Track. At the 1st-stage, M^2-Track localizes the target within successive frames via motion transformation. Then it refines the target box through motion-assisted shape completion at the 2nd-stage. Due to the motion-centric nature, our method shows its impressive generalizability with limited training labels and provides good differentiability for end-to-end cycle training. This inspires us to explore semi-supervised LiDAR SOT by incorporating a pseudo-label-based motion augmentation and a self-supervised loss term. Under the fully-supervised setting, extensive experiments confirm that M^2-Track significantly outperforms previous state-of-the-arts on three large-scale datasets while running at 57FPS (~8%, ~17% and ~22% precision gains on KITTI, NuScenes, and Waymo Open Dataset respectively). While under the semi-supervised setting, our method performs on par with or even surpasses its fully-supervised counterpart using fewer than half labels from KITTI. Further analysis verifies each component's effectiveness and shows the motion-centric paradigm's promising potential for auto-labeling and unsupervised domain adaptation.
在激光雷达点云中的三维单物体跟踪(LiDAR SOT)在无人驾驶中扮演着关键角色。当前的方法都基于外观匹配,但LiDAR点云通常缺乏纹理和不完整,这阻碍了有效的外观匹配。此外,以前的方法严重忽略了目标之间的关键运动线索。在本文中,除了3D Siamese跟踪,我们引入了一种以运动为中心的范式,从新的角度处理LiDAR SOT。遵循这个范式,我们提出了一个无匹配的两步跟踪器M^2-Track。在第一个阶段,M^2-Track通过运动变换在相邻帧内定位目标。然后,在第二个阶段,它通过运动辅助的形状重构优化目标框。由于运动中心性质,我们的方法和 limited训练标签的情况下表现出令人印象深刻的泛化能力,并为端到端循环训练提供了良好的不同iability。这激励我们探索半监督的LiDAR SOT,通过添加伪标签的运动增强和自监督损失函数。在完全监督的情况下,广泛的实验确认M^2-Track在三个大规模数据集上显著优于以前的最高水平,同时运行在57FPS(KITTI、NuScenes和Waymo Open Dataset分别提高了~8%、~17%和~22%的精度)。在半监督的情况下,我们的方法和使用KITTI不到一半的标签数量的性能与它的完全监督对手相当或甚至超过了它。进一步的分析证实了每个组件的有效性,并展示了运动中心范式在自动 labeling和无监督域适应方面的潜力。
https://arxiv.org/abs/2303.12535
In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at this https URL.
在本文中,我们研究从同步的2D和3D数据中 jointly estimating optical flow和场景流动的问题。以前的方法和方法要么使用复杂的管道将联合任务划分为独立阶段,要么在“早期融合”或“晚期融合”的方式下将2D和3D信息融合。这种适用于所有情况的方法面临一个困境,即未能充分利用每种模式的特性或最大化它们之间的互补性。为了解决这一问题,我们提出了一种全新的端到端框架,该框架由2D和3D分支,它们在特定的层中具有多个双向融合连接。与以前的工作不同,我们应用基于点基的3D分支来提取LiDAR特征,因为它保持了点云的几何结构。为了融合密集图像特征和稀疏点特征,我们提出了一种可学习的操作名称双向相机-LiDAR融合模块(Bi-CLFM)。我们实例化两种双向融合管道类型,一种基于Pyramidal Fine-to-Fine架构(称为 CamLiPWC),另一种基于循环全部对区域变换(称为 CamLiRAFT)。在飞行物体3D中, CamLiPWC和 CamLiRAFT都超越了所有现有方法,并在最佳公开结果上实现了3D端点误差的47.9%减少。我们的最优模型 CamLiRAFT 在KITTI场景Flow基准测试中实现了4.26%的错误,成为所有提交中参数更少的佼佼者。此外,我们的方法具有强大的泛化性能和处理非定域运动的能力。代码可在本网站的 https URL 中获取。
https://arxiv.org/abs/2303.12017
We propose a deep learning-based LiDAR odometry estimation method called LoRCoN-LO that utilizes the long-term recurrent convolutional network (LRCN) structure. The LRCN layer is a structure that can process spatial and temporal information at once by using both CNN and LSTM layers. This feature is suitable for predicting continuous robot movements as it uses point clouds that contain spatial information. Therefore, we built a LoRCoN-LO model using the LRCN layer, and predicted the pose of the robot through this model. For performance verification, we conducted experiments exploiting a public dataset (KITTI). The results of the experiment show that LoRCoN-LO displays accurate odometry prediction in the dataset. The code is available at this https URL.
我们提出了一种基于深度学习的激光雷达姿态估计方法,名为LoRCoN-LO,该方法利用了长期循环卷积神经网络(LRCN)结构。LRCN层是一种结构,通过同时使用卷积和LSTM层,可以同时处理空间和时间信息。这种特性适合预测连续机器人运动,因为它使用包含空间信息的点云。因此,我们使用LRCN层构建了一个LoRCoN-LO模型,并通过该模型预测机器人的姿态。为了性能验证,我们利用了一个公开数据集(KITTI)进行了实验。实验结果显示,LoRCoN-LO在数据集中显示准确的姿态估计。代码可在该https URL上获取。
https://arxiv.org/abs/2303.11853
Novel class discovery (NCD) for semantic segmentation is the task of learning a model that can segment unlabelled (novel) classes using only the supervision from labelled (base) classes. This problem has recently been pioneered for 2D image data, but no work exists for 3D point cloud data. In fact, the assumptions made for 2D are loosely applicable to 3D in this case. This paper is presented to advance the state of the art on point cloud data analysis in four directions. Firstly, we address the new problem of NCD for point cloud semantic segmentation. Secondly, we show that the transposition of the only existing NCD method for 2D semantic segmentation to 3D data is suboptimal. Thirdly, we present a new method for NCD based on online clustering that exploits uncertainty quantification to produce prototypes for pseudo-labelling the points of the novel classes. Lastly, we introduce a new evaluation protocol to assess the performance of NCD for point cloud semantic segmentation. We thoroughly evaluate our method on SemanticKITTI and SemanticPOSS datasets, showing that it can significantly outperform the baseline. Project page at this link: this https URL.
新的类发现(NCD)语义分割任务的任务是学习一种模型,可以利用标记(基础)类的监督来分割未标记(新)类。这个问题最近在2D图像数据上率先提出,但对于3D点云数据却没有研究。事实上,对于2D的假设在此处并不适用于3D。本文旨在推进点云数据分析的前沿技术,从四个方面推进。首先,我们解决点云语义分割中的NCD新问题。其次,我们表明将仅使用2D语义分割方法中的唯一可用NCD方法应用于3D数据是性能较差的。第三,我们提出了基于在线聚类的NCD新方法,利用不确定性量化生产原型,用于伪标记新类点的原型。最后,我们引入了一种新的评估协议,以评估点云语义分割中的NCD性能。我们对SemanticKITTI和SemanticPOSS数据集进行了充分的评估,表明它可以显著优于基准。该项目页面在此链接:此https URL。
https://arxiv.org/abs/2303.11610
In this paper, we present Smart-Tree, a supervised method for approximating the medial axes of branch skeletons from a tree's point cloud. A sparse voxel convolutional neural network extracts each input point's radius and direction towards the medial axis. A greedy algorithm performs robust skeletonization using the estimated medial axis. The proposed method provides robustness to complex tree structures and improves fidelity when dealing with self-occlusions, complex geometry, touching branches, and varying point densities. We train and test the method using a multi-species synthetic tree data set and perform qualitative analysis on a real-life tree point cloud. Experimentation with synthetic and real-world datasets demonstrates the robustness of our approach over the current state-of-the-art method. Further research will focus on training the method on a broader range of tree species and improving robustness to point cloud gaps. The details to obtain the dataset are at this https URL.
在本文中,我们介绍了Smart-Tree,一种监督方法,从树点云中近似分支骨骼的 Medial 轴。一个稀疏的立方点卷积神经网络提取每个输入点半径和向 Medial 轴的方向。一个贪心算法使用估计的 Medial 轴进行稳定的骨骼化。该方法提供了复杂的树结构的稳定性,并改善了与自我包含、复杂几何、接触分支和不同点密度的精度。我们使用多个物种的人工合成树数据集训练和测试方法,并在真实的树点云上进行定性分析。与合成和实际数据集的实验表明,我们的方法比当前最先进的方法更加稳健。进一步的研究工作将关注训练方法以涵盖更广泛的树物种,并改善点云间隙的稳健性。获取数据集详细信息的 URL 如下:
https://arxiv.org/abs/2303.11560
Robust point cloud classification is crucial for real-world applications, as consumer-type 3D sensors often yield partial and noisy data, degraded by various artifacts. In this work we propose a general ensemble framework, based on partial point cloud sampling. Each ensemble member is exposed to only partial input data. Three sampling strategies are used jointly, two local ones, based on patches and curves, and a global one of random sampling. We demonstrate the robustness of our method to various local and global degradations. We show that our framework significantly improves the robustness of top classification netowrks by a large margin. Our experimental setting uses the recently introduced ModelNet-C database by Ren et al.[24], where we reach SOTA both on unaugmented and on augmented data. Our unaugmented mean Corruption Error (mCE) is 0.64 (current SOTA is 0.86) and 0.50 for augmented data (current SOTA is 0.57). We analyze and explain these remarkable results through diversity analysis.
点云分类的可靠性对于实际应用程序是至关重要的,因为消费类型的三维传感器通常产生不完整和噪声过多的数据,受到各种 artifacts 的退化。在本文中,我们提出了基于部分点云采样的一般群体框架。每个群体成员只暴露到部分输入数据。我们联合使用了三种采样策略,其中两种基于patches和曲线,一种是基于随机采样的全球策略。我们证明了我们方法的鲁棒性对各种局部和全局退化的适应性。我们表明我们的框架极大地提高了顶级分类算法的鲁棒性。我们的实验设置使用Ren等人最近引入的ModelNet-C数据库,在未增强和增强数据上都达到了当前最高水平(SOTA)。我们未增强的平均 corruption 错误(mCE)为0.64(当前SOTA为0.86),而增强数据上的mCE为0.50(当前SOTA为0.57)。我们通过多样性分析分析了这些令人瞩目的结果。
https://arxiv.org/abs/2303.11419
Implicit generative models have been widely employed to model 3D data and have recently proven to be successful in encoding and generating high-quality 3D shapes. This work builds upon these models and alleviates current limitations by presenting the first implicit generative model that facilitates the generation of complex 3D shapes with rich internal geometric details. To achieve this, our model uses unsigned distance fields to represent nested 3D surfaces allowing learning from non-watertight mesh data. We propose a transformer-based autoregressive model for 3D shape generation that leverages context-rich tokens from vector quantized shape embeddings. The generated tokens are decoded into an unsigned distance field which is rendered into a novel 3D shape exhibiting a rich internal structure. We demonstrate that our model achieves state-of-the-art point cloud generation results on popular classes of 'Cars', 'Planes', and 'Chairs' of the ShapeNet dataset. Additionally, we curate a dataset that exclusively comprises shapes with realistic internal details from the `Cars' class of ShapeNet and demonstrate our method's efficacy in generating these shapes with internal geometry.
隐式生成模型被广泛应用于建模三维数据,并且最近成功在编码和生成高质量的三维形状方面取得了成功。这项工作基于这些模型,并减轻当前限制,通过呈现第一个隐式生成模型,促进了生成复杂具有丰富内部几何细节的三维形状。为了实现这一点,我们的模型使用无符号距离场表示嵌套的三维表面,从矢量化形状嵌入中借用丰富的上下文代币。我们提出了一个基于Transformer的自回归模型,用于生成三维形状,该模型利用向量量化形状嵌入中的上下文代币。生成的代币被解码为无符号距离场,并将其渲染为具有丰富内部结构的新的三维形状。我们证明了我们的模型在 ShapeNet 数据集上实现最先进的点云生成结果,其中“汽车”、“飞机”和“椅子”流行类的元素仅包含 ShapeNet `Cars` 类中的具有实际内部细节的形状。此外,我们创建了一个仅包含 ShapeNet `Cars` 类中具有实际内部细节的形状的单独的数据集,并证明了我们的方法在生成具有内部几何形状的形状方面的有效性。
https://arxiv.org/abs/2303.11235