Manipulating deformable objects remains a challenge within robotics due to the difficulties of state estimation, long-horizon planning, and predicting how the object will deform given an interaction. These challenges are the most pronounced with 3D deformable objects. We propose SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes. To the best of our knowledge this is the first real-world method that successfully learns manipulation policies for 3D deformable objects. For sculpting videos and access to our dataset and hardware CAD models, see the project website: this https URL
操作变形对象在机器人领域仍然具有挑战性,由于状态估计、长距离规划以及预测物体在交互过程中的变形困难。这些挑战在3D变形对象上尤为突出。我们提出了SculptDiff,一种基于目标条件扩散的模仿学习框架,可直接从点云状态观测中学习各种目标形状的捏塑策略。据我们所知,这是第一个在现实生活中成功学习3D变形对象操作策略的方法。如果您想观看雕塑视频,访问我们的数据集和硬件CAD模型,请查看项目网站:https:// this https URL。
https://arxiv.org/abs/2403.10401
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
近年来在人体形状学习方面的进步表明,神经隐含模型在从有限视角下生成三维人体表面以及甚至单个RGB图像时非常有效。然而,现有的单目方法仍然很难从仅有的几个视角下恢复细粒度几何细节,如面部、手部或布料皱纹。它们还容易产生深度模糊,导致沿着相机光学轴的扭曲几何。在本文中,我们探讨了在重建过程中将深度观察纳入其中的好处,通过引入ANIM,一种在单目RGB-D图像上重构任意3D人体形状的新方法,前所未有的准确度。我们的模型从多分辨率像素对齐和体素对齐的特征中学习几何细节,以利用深度信息并实现空间关系,减轻深度模糊。我们进一步通过引入深度监督策略来提高重构形状的质量,提高距离场估计点在重构表面上的准确性。实验证明,ANIM在用作输入的RGB、表面法线、点云或RGB-D数据上优于最先进的论文。此外,我们还引入了ANIM-Real,一个新的多模态数据集,包括高质量扫描和消费级RGB-D相机,以及调整ANIM的协议,使其从现实世界的人类捕捉中实现高质量重构。
https://arxiv.org/abs/2403.10357
Surface parameterization is a fundamental geometry processing problem with rich downstream applications. Traditional approaches are designed to operate on well-behaved mesh models with high-quality triangulations that are laboriously produced by specialized 3D modelers, and thus unable to meet the processing demand for the current explosion of ordinary 3D data. In this paper, we seek to perform UV unwrapping on unstructured 3D point clouds. Technically, we propose ParaPoint, an unsupervised neural learning pipeline for achieving global free-boundary surface parameterization by building point-wise mappings between given 3D points and 2D UV coordinates with adaptively deformed boundaries. We ingeniously construct several geometrically meaningful sub-networks with specific functionalities, and assemble them into a bi-directional cycle mapping framework. We also design effective loss functions and auxiliary differential geometric constraints for the optimization of the neural mapping process. To the best of our knowledge, this work makes the first attempt to investigate neural point cloud parameterization that pursues both global mappings and free boundaries. Experiments demonstrate the effectiveness and inspiring potential of our proposed learning paradigm. The code will be publicly available.
表面参数化是一个基本的几何处理问题,具有丰富的下游应用。传统的解决方案旨在操作在良好行为网格模型上的高质量三角形,这些三角形是由专门的3D建模软件生成的,因此无法满足当前普通3D数据处理需求的爆炸性增长。在本文中,我们试图对无结构3D点云进行UV解包。从技术上讲,我们提出了ParaPoint,一种通过自适应变形来构建点对2D UV坐标的无监督神经网络,以实现全局自由边界表面参数化。我们巧妙地构建了几个具有特定功能的几何有意义子网络,并将它们组装成一个双向循环映射框架。我们还为神经映射过程的优化设计了有效的损失函数和辅助差分几何约束。据我们所知,这篇论文是首次研究追求全局映射和自由边界的神经点云参数化。实验证明了我们提出的学习范式的有效性和鼓舞人心的潜力。代码将公开可用。
https://arxiv.org/abs/2403.10349
The value of roadside perception, which could extend the boundaries of autonomous driving and traffic management, has gradually become more prominent and acknowledged in recent years. However, existing roadside perception approaches only focus on the single-infrastructure sensor system, which cannot realize a comprehensive understanding of a traffic area because of the limited sensing range and blind spots. Orienting high-quality roadside perception, we need Roadside Cooperative Perception (RCooper) to achieve practical area-coverage roadside perception for restricted traffic areas. Rcooper has its own domain-specific challenges, but further exploration is hindered due to the lack of datasets. We hence release the first real-world, large-scale RCooper dataset to bloom the research on practical roadside cooperative perception, including detection and tracking. The manually annotated dataset comprises 50k images and 30k point clouds, including two representative traffic scenes (i.e., intersection and corridor). The constructed benchmarks prove the effectiveness of roadside cooperation perception and demonstrate the direction of further research. Codes and dataset can be accessed at: this https URL.
近年来,路边感知的价值,这可以扩展自动驾驶和交通管理的边界,逐渐变得更加突出和被认可。然而,现有的路边感知方法仅关注单一基础设施传感器系统,由于其感知范围有限和盲区,无法全面了解交通区域。因此,我们需要道路合作感知(RCooper)来实现受限制交通区域的道路合作感知。虽然RCooper有自己的领域特定挑战,但由于缺乏数据集,进一步探索受到阻碍。因此,我们发布了第一个真实世界的较大规模的RCooper数据集,以促进关于实际道路合作感知的 research,包括检测和跟踪。手动标注的数据集包括50K张图像和30K个点云,包括两个代表性的交通场景(即路口和走廊)。构建的基准证明了道路合作感知的效果,并表明了进一步研究的方向。代码和数据集可以通过这个链接访问:https:// this URL。
https://arxiv.org/abs/2403.10145
Autonomous driving demands high-quality LiDAR data, yet the cost of physical LiDAR sensors presents a significant scaling-up challenge. While recent efforts have explored deep generative models to address this issue, they often consume substantial computational resources with slow generation speeds while suffering from a lack of realism. To address these limitations, we introduce RangeLDM, a novel approach for rapidly generating high-quality range-view LiDAR point clouds via latent diffusion models. We achieve this by correcting range-view data distribution for accurate projection from point clouds to range images via Hough voting, which has a critical impact on generative learning. We then compress the range images into a latent space with a variational autoencoder, and leverage a diffusion model to enhance expressivity. Additionally, we instruct the model to preserve 3D structural fidelity by devising a range-guided discriminator. Experimental results on KITTI-360 and nuScenes datasets demonstrate both the robust expressiveness and fast speed of our LiDAR point cloud generation.
自动驾驶需要高质量的激光雷达数据,但物理激光雷达传感器的成本存在显著的扩展挑战。虽然最近的努力探索了使用深度生成模型来解决这个问题的方法,但它们通常在生成速度慢的同时,由于缺乏现实感而消耗大量计算资源。为了应对这些限制,我们引入了RangeLDM,一种通过潜在扩散模型迅速生成高质量范围视激光雷达点云的新方法。我们通过通过Hough投票纠正范围视数据分布来准确从点云到范围图像的投影,从而实现了这一点。这对生成学习具有关键影响。然后,我们使用变分自编码器将范围图像压缩到潜在空间,并利用扩散模型增强表现力。此外,我们还通过设计范围指导的判别器来指示模型保留3D结构保真度。在KITTI-360和nuScenes数据集上的实验结果表明,我们的激光雷达点云生成具有稳健的表现力和高速度。
https://arxiv.org/abs/2403.10094
Addressing the challenges posed by the substantial gap in point cloud data collected from diverse sensors, achieving robust cross-source point cloud registration becomes a formidable task. In response, we present a novel framework for point cloud registration with broad applicability, suitable for both homologous and cross-source registration scenarios. To tackle the issues arising from different densities and distributions in cross-source point cloud data, we introduce a feature representation based on spherical voxels. Furthermore, addressing the challenge of numerous outliers and mismatches in cross-source registration, we propose a hierarchical correspondence filtering approach. This method progressively filters out mismatches, yielding a set of high-quality correspondences. Our method exhibits versatile applicability and excels in both traditional homologous registration and challenging cross-source registration scenarios. Specifically, in homologous registration using the 3DMatch dataset, we achieve the highest registration recall of 95.1% and an inlier ratio of 87.8%. In cross-source point cloud registration, our method attains the best RR on the 3DCSR dataset, demonstrating a 9.3 percentage points improvement. The code is available at this https URL.
为了应对来自不同传感器收集的点云数据之间存在的重大差异,实现稳健的跨源点云配准成为一项艰巨的任务。因此,我们提出了一个适用于同源和跨源配准场景的新型点云配准框架。为了应对来自跨源点云数据中不同密度和分布的问题,我们基于球形体素引入了一种特征表示。此外,为了解决跨源注册中大量异常和匹配问题,我们提出了一个分层匹配过滤方法。这种方法在不断地滤除匹配错误,从而得到了一系列高质量的对。我们的方法表现出多才多艺的适用性,在传统同源注册和具有挑战性的跨源注册场景中表现优异。具体来说,在3DMatch数据集上的同源注册中,我们实现了最高的注册召回率为95.1%,同源比率为87.8%。在跨源点云注册中,我们的方法在3DCSR数据集上取得了最佳的匹配准确率,表明了9.3个百分点的改进。代码可在此处访问:https://www.thuji.com/~zhangyan/publication/2022082509303030013207/
https://arxiv.org/abs/2403.10085
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem, we propose a novel contrastive pre-training framework tailored for PCQA (CoPA), which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space, we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors, we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore, in the model fine-tuning stage, we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.
No-reference point cloud quality assessment (NR-PCQA)旨在自动评估没有可用参考点的失真点云的感知质量,这种点云由于利用了深度神经网络,已经取得了巨大的改进。然而,基于学习的NR-PCQA方法由于标注数据的稀缺性,通常在泛化方面表现不佳。为解决这个问题,我们提出了一个专为PCQA(CoPA)设计的全新预训练框架,该框架使预训练模型能够从未标记数据中学习质量感知表示。为了在表示空间中获得锚点,我们将具有不同扭曲程度的点云投影到图像中,并随机混合它们的局部补丁以形成具有多个扭曲的混合图像。利用生成的锚点,我们通过质量感知的对比损失对预训练过程进行约束,这一理念是感知质量与内容和扭曲密切相关。此外,在模型微调阶段,我们提出了一个语义引导的多视角融合模块,以有效地整合来自多个视角的投影图像的特征。大量实验证明,我们的方法在流行基准测试中超过了最先进的PCQA方法。进一步的研究表明,CoPA也可以有益于现有的学习式PCQA模型。
https://arxiv.org/abs/2403.10066
No-reference point cloud quality assessment (NR-PCQA) aims to automatically predict the perceptual quality of point clouds without reference, which has achieved remarkable performance due to the utilization of deep learning-based models. However, these data-driven models suffer from the scarcity of labeled data and perform unsatisfactorily in cross-dataset evaluations. To address this problem, we propose a self-supervised pre-training framework using masked autoencoders (PAME) to help the model learn useful representations without labels. Specifically, after projecting point clouds into images, our PAME employs dual-branch autoencoders, reconstructing masked patches from distorted images into the original patches within reference and distorted images. In this manner, the two branches can separately learn content-aware features and distortion-aware features from the projected images. Furthermore, in the model fine-tuning stage, the learned content-aware features serve as a guide to fuse the point cloud quality features extracted from different perspectives. Extensive experiments show that our method outperforms the state-of-the-art NR-PCQA methods on popular benchmarks in terms of prediction accuracy and generalizability.
No-reference point cloud quality assessment (NR-PCQA) 的目标是自动预测没有参考点的点云的感知质量,由于利用了基于深度学习的模型,已经取得了显著的性能。然而,这些数据驱动的模型由于缺乏标注数据,在跨数据集评估时表现不佳。为了解决这个问题,我们提出了一个自监督的前馈框架,使用遮罩自动编码器(PAME)进行无标签预训练,帮助模型学习无标签的有用表示。具体来说,在将点云投影到图像后,我们的 PAME 采用双分支自动编码器,从扭曲的图像中恢复缺失的补丁,并将其重构为参考和扭曲的原始补丁。这样,两个分支可以从投影的图像中分别学习内容感知特征和扭曲感知特征。此外,在模型微调阶段,学习到的内容感知特征作为将来自不同角度提取的点云质量特征进行融合的指导。大量实验证明,我们的方法在流行基准测试中的预测准确性和泛化性能均优于最先进的 NR-PCQA 方法。
https://arxiv.org/abs/2403.10061
Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
无监督域适应(UDA)对于减轻标注3D点云数据的劳动量并在面对新定义领域时缓解缺乏标签非常重要。最近,出现了许多利用图像增强跨域3D分割性能的方法。然而,由于其固有的噪声问题,伪标签,即从预训练于源域的模型生成的提供给未见领域额外监督信号的模型,在用于3D分割时是不够的,从而限制了神经网络的准确性。随着二维视觉基础模型(VFMs)的出现及其丰富的知识储备,我们提出了一个名为VFMSeg的新管道VFMSeg,通过利用这些模型进一步增强了跨模态的无监督域适应框架。 在这项工作中,我们研究了VFMs通过学习知识储备如何为未标记的目标域产生更准确标签,从而提高整体性能。首先,我们利用预训练的跨模态VFM,该模型在大型图像-文本对上进行预训练,为源域的图像和点云提供监督标签(VFM-PL)。然后,我们选择另一个在细粒度2D掩码上训练的VFM,该模型用于生成语义增强的图像和点云,以提高神经网络的性能,这些数据来自源域和目标域,就像视差(FrustumMixing)一样混合数据。最后,我们将类别级别的预测合并,以产生更准确的无标记目标域的注释。 我们对各种自动驾驶数据集进行了评估,结果表明,我们的方法在3D分割任务上取得了显著的改进。
https://arxiv.org/abs/2403.10001
This paper presents a novel Fully Binary Point Cloud Transformer (FBPT) model which has the potential to be widely applied and expanded in the fields of robotics and mobile devices. By compressing the weights and activations of a 32-bit full-precision network to 1-bit binary values, the proposed binary point cloud Transformer network significantly reduces the storage footprint and computational resource requirements of neural network models for point cloud processing tasks, compared to full-precision point cloud networks. However, achieving a fully binary point cloud Transformer network, where all parts except the modules specific to the task are binary, poses challenges and bottlenecks in quantizing the activations of Q, K, V and self-attention in the attention module, as they do not adhere to simple probability distributions and can vary with input data. Furthermore, in our network, the binary attention module undergoes a degradation of the self-attention module due to the uniform distribution that occurs after the softmax operation. The primary focus of this paper is on addressing the performance degradation issue caused by the use of binary point cloud Transformer modules. We propose a novel binarization mechanism called dynamic-static hybridization. Specifically, our approach combines static binarization of the overall network model with fine granularity dynamic binarization of data-sensitive components. Furthermore, we make use of a novel hierarchical training scheme to obtain the optimal model and binarization parameters. These above improvements allow the proposed binarization method to outperform binarization methods applied to convolution neural networks when used in point cloud Transformer structures. To demonstrate the superiority of our algorithm, we conducted experiments on two different tasks: point cloud classification and place recognition.
本文提出了一种新颖的全二进制点云Transformer(FBPT)模型,该模型在机器人学和移动设备领域具有广泛的应用和扩展潜力。通过将32位全精度网络的权重和激活压缩至1位二进制值,所提出的二进制点云Transformer网络显著减少了用于点云处理任务的神经网络模型的存储空间和计算资源需求,与全精度点云网络相比。然而,实现一个完全二进制点云Transformer网络,其中所有模块都为二进制,在量化注意力的Q、K、V和自注意力模块方面存在挑战和瓶颈,因为它们不遵循简单的概率分布,并且可能随输入数据而变化。此外,在我们的网络中,由于软max操作后出现的均匀分布,二进制注意模块导致自注意力模块的降维。本文的主要目标是在使用二进制点云Transformer模块时解决性能下降问题。我们提出了名为动态-静态分层混合的新的二进制化机制。具体来说,我们的方法将整个网络模型的静态二进制化与数据敏感组件的精细颗粒度动态二进制化相结合。此外,我们还采用了一种新的分层训练方案来获得最优的模型和二进制参数。这些改进使得所提出的二进制化方法在应用于点云Transformer结构时能够优于应用于卷积神经网络的二进制化方法。为了证明我们算法的优越性,我们在两个不同任务上进行了实验:点云分类和地点识别。
https://arxiv.org/abs/2403.09998
Due to their complex spatial structure and diverse geometric features, achieving high-precision and robust point cloud registration for complex Die Castings has been a significant challenge in the die-casting industry. Existing point cloud registration methods primarily optimize network models using well-established high-quality datasets, often neglecting practical application in real scenarios. To address this gap, this paper proposes a high-precision adaptive registration method called Multiscale Efficient Deep Closest Point (MEDPNet) and introduces a die-casting point cloud dataset, DieCastCloud, specifically designed to tackle the challenges of point cloud registration in the die-casting industry. The MEDPNet method performs coarse die-casting point cloud data registration using the Efficient-DCP method, followed by precision registration using the Multiscale feature fusion dual-channel registration (MDR) method. We enhance the modeling capability and computational efficiency of the model by replacing the attention mechanism of the Transformer in DCP with Efficient Attention and implementing a collaborative scale mechanism through the combination of serial and parallel blocks. Additionally, we propose the MDR method, which utilizes multilayer perceptrons (MLP), Normal Distributions Transform (NDT), and Iterative Closest Point (ICP) to achieve learnable adaptive fusion, enabling high-precision, scalable, and noise-resistant global point cloud registration. Our proposed method demonstrates excellent performance compared to state-of-the-art geometric and learning-based registration methods when applied to complex die-casting point cloud data.
由于复杂的空间结构和多样几何特征,在复杂铸件上实现高精度和高鲁棒性的点云配准一直是铸件行业的一个显著挑战。现有的点云配准方法主要利用成熟的高质量数据集来优化网络模型,往往忽视了在实际场景下的实际应用。为了填补这一空白,本文提出了一种高精度自适应配准方法称为多尺度有效深度最近点(MEDPNet)以及针对铸件行业点云配准的点云数据集DeiCastCloud。MEDPNet方法使用高效点云数据集DCP进行粗铸件点云数据配准,然后使用多尺度特征融合双通道配准(MDR)方法进行精度配准。通过用DCP中的Transformer的注意力机制替换,我们增强了模型的建模能力和计算效率。此外,我们提出了MDR方法,该方法利用多层感知器(MLP)、正态分布转换(NDT)和迭代最近点(ICP)实现可学习自适应融合,实现高精度、可扩展性和噪声耐性的全局点云配准。与最先进的基于几何和学习的配准方法相比,我们所提出的方法在复杂铸件点云数据上表现出卓越的性能。
https://arxiv.org/abs/2403.09996
We investigate uncertainty quantification of 6D pose estimation from keypoint measurements. Assuming unknown-but-bounded measurement noises, a pose uncertainty set (PURSE) is a subset of SE(3) that contains all possible 6D poses compatible with the measurements. Despite being simple to formulate and its ability to embed uncertainty, the PURSE is difficult to manipulate and interpret due to the many abstract nonconvex polynomial constraints. An appealing simplification of PURSE is to find its minimum enclosing geodesic ball (MEGB), i.e., a point pose estimation with minimum worst-case error bound. We contribute (i) a dynamical system perspective, and (ii) a fast algorithm to inner approximate the MEGB. Particularly, we show the PURSE corresponds to the feasible set of a constrained dynamical system, and this perspective allows us to design an algorithm to densely sample the boundary of the PURSE through strategic random walks. We then use the miniball algorithm to compute the MEGB of PURSE samples, leading to an inner approximation. Our algorithm is named CLOSURE (enClosing baLl frOm purSe boUndaRy samplEs) and it enables computing a certificate of approximation tightness by calculating the relative size ratio between the inner approximation and the outer approximation. Running on a single RTX 3090 GPU, CLOSURE achieves the relative ratio of 92.8% on the LM-O object pose estimation dataset and 91.4% on the 3DMatch point cloud registration dataset with the average runtime less than 0.2 second. Obtaining comparable worst-case error bound but 398x and 833x faster than the outer approximation GRCC, CLOSURE enables uncertainty quantification of 6D pose estimation to be implemented in real-time robot perception applications.
我们研究了从关键点测量中计算6D姿态估计的不确定性量化。假定未知但有限制测量噪声,姿态不确定性集(PURSE)是SE(3)的子集,包含所有与测量相容的6D姿态。尽管PURSE的表述很简单,并且具有嵌入不确定性的能力,但由于许多抽象非凸多项式约束,PURSE很难操作和解释。PURSE的一个有趣的简化方法是找到其最小包容的极化球(MEGB),即具有最小最坏情况误差的点姿态估计。我们贡献了(i)动态系统视角和(ii)快速算法来内近似MEGB。特别地,我们证明了PURSE对应于受约束的动态系统的可行集,这种视角允许我们通过策略随机行走在PURSE边界上进行稀疏采样。然后我们使用最小球算法计算PURSE样本的MEGB,导致内近似。我们的算法名为CLOSURE(从purSe bound到enClosing baLl frOm purSe boUndaRy samplEs),它通过计算内近似和外近似的相对大小比来计算拟合精度。在单个RTX 3090 GPU上运行,CLOSURE在LM-O对象姿态估计数据集上的相对比例为92.8%,在3DMatch点云配准数据集上的相对比例为91.4%,平均运行时间小于0.2秒。获得与外近似GRCC相似的最坏情况误差 bound,但速度高达398x和833x,CLOSURE使得在实时机器人感知应用中实现6D姿态估计的不确定性量化得以实现。
https://arxiv.org/abs/2403.09990
Human-robot collaborative applications require scene representations that are kept up-to-date and facilitate safe motions in dynamic scenes. In this letter, we present an interactive distance field mapping and planning (IDMP) framework that handles dynamic objects and collision avoidance through an efficient representation. We define \textit{interactive} mapping and planning as the process of creating and updating the representation of the scene online while simultaneously planning and adapting the robot's actions based on that representation. Given depth sensor data, our framework builds a continuous field that allows to query the distance and gradient to the closest obstacle at any required position in 3D space. The key aspect of this work is an efficient Gaussian Process field that performs incremental updates and implicitly handles dynamic objects with a simple and elegant formulation based on a temporary latent model. In terms of mapping, IDMP is able to fuse point cloud data from single and multiple sensors, query the free space at any spatial resolution, and deal with moving objects without semantics. In terms of planning, IDMP allows seamless integration with gradient-based motion planners facilitating fast re-planning for collision-free navigation. The framework is evaluated on both real and synthetic datasets. A comparison with similar state-of-the-art frameworks shows superior performance when handling dynamic objects and comparable or better performance in the accuracy of the computed distance and gradient field. Finally, we show how the framework can be used for fast motion planning in the presence of moving objects. An accompanying video, code, and datasets are made publicly available this https URL.
人类机器人协同应用需要保持场景表示更新并促进动态场景中的安全运动。在这封信中,我们提出了一个交互式距离场映射和规划(IDMP)框架,通过有效的表示处理动态物体和碰撞避免。我们将交互式映射和规划定义为在同时规划并根据场景表示更新场景表示的过程。 基于深度传感器数据,我们的框架构建了一个连续场,使得在3D空间中查询到最近障碍物的距离和梯度。本工作的关键在于一个高效的高斯过程,它通过一个基于临时拉格朗日模型的简单而优雅的公式实现动态物体。在映射方面,IDMP能够将来自单一和多个传感器的点云数据进行融合,查询任意空间分辨率下的空闲空间,并处理运动物体。在规划方面,IDMP与基于梯度的运动规划器无缝集成,实现快速重新规划以实现无碰撞的导航。 该框架在真实和合成数据集上都进行了评估。与类似 state-of-the-art 框架的比较显示,在处理动态物体方面具有卓越的性能。最后,我们展示了如何在有运动物体的情况下使用该框架进行快速运动规划。 附录中提供了视频、代码和数据集。您可以通过以下链接访问:https://www.example.com/
https://arxiv.org/abs/2403.09988
In image-guided liver surgery, 3D-3D non-rigid registration methods play a crucial role in estimating the mapping between the preoperative model and the intraoperative surface represented as point clouds, addressing the challenge of tissue deformation. Typically, these methods incorporate a biomechanical model, represented as a finite element model (FEM), used to regularize a surface matching term. This paper introduces a novel 3D-3D non-rigid registration method. In contrast to the preceding techniques, our method uniquely incorporates the FEM within the surface matching term itself, ensuring that the estimated deformation maintains geometric consistency throughout the registration process. Additionally, we eliminate the need to determine zero-boundary conditions and applied force locations in the FEM. We achieve this by integrating soft springs into the stiffness matrix and allowing forces to be distributed across the entire liver surface. To further improve robustness, we introduce a regularization technique focused on the gradient of the force magnitudes. This regularization imposes spatial smoothness and helps prevent the overfitting of irregular noise in intraoperative data. Optimization is achieved through an accelerated proximal gradient algorithm, further enhanced by our proposed method for determining the optimal step size. Our method is evaluated and compared to both a learning-based method and a traditional method that features FEM regularization using data collected on our custom-developed phantom, as well as two publicly available datasets. Our method consistently outperforms or is comparable to the baseline techniques. Both the code and dataset will be made publicly available.
在图像引导的肝手术中,3D-3D非刚性配准方法在估计术前模型和术中表面之间的映射方面发挥了关键作用,解决了组织变形带来的挑战。通常,这些方法包括一个生物力学模型,表示为有限元模型(FEM),用于约束表面匹配项。本文介绍了一种新颖的3D-3D非刚性配准方法。与前述技术相比,我们的方法独特地将FEM融入表面匹配项中,确保估计的变形在配准过程中保持几何一致性。此外,我们消除了FEM中确定零边界条件和施加力位置的需求。我们通过将软弹簧集成到刚度矩阵中,并允许力量在肝脏表面整个范围内分布,实现了这一目标。为了进一步提高稳健性,我们引入了一种关注力的大小梯度的正则化技术。这一正则化强制保持空间平滑性,有助于防止在术中数据中的不规则噪声过拟合。通过采用加速近前梯度算法进一步增强,我们的方法还通过确定最优步长来优化。我们的方法评估并与基于学习的方法以及使用我们自行开发的标本的公开可用数据集进行了比较。我们的方法 consistently优于或与基线技术相当。代码和数据集将公开发布。
https://arxiv.org/abs/2403.09964
Self-supervised 3D representation learning aims to learn effective representations from large-scale unlabeled point clouds. Most existing approaches adopt point discrimination as the pretext task, which assigns matched points in two distinct views as positive pairs and unmatched points as negative pairs. However, this approach often results in semantically identical points having dissimilar representations, leading to a high number of false negatives and introducing a "semantic conflict" problem. To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning. Segment grouping partitions points into semantically meaningful regions, which enhances semantic coherence and provides semantic guidance for the subsequent contrastive representation learning. Semantic-aware contrastive learning augments the semantic information extracted from segment grouping and helps to alleviate the issue of "semantic conflict". We conducted extensive experiments on multiple 3D scene understanding tasks. The results demonstrate that GroupContrast learns semantically meaningful representations and achieves promising transfer learning performance.
自监督3D表示学习旨在从大规模未标记点云中学习有效的表示。大多数现有方法采用点区分作为预处理任务,将两个不同视图中的匹配点分配为正对,将不匹配的点分配为负对。然而,这种方法通常会导致语义相同的点具有不同的表示,从而导致大量的假阴性结果,并引入了“语义冲突”问题。为了解决这个问题,我们提出了GroupContrast,一种结合 segment grouping 和语义感知对比学习的新颖方法。语义分组部分将点划分为语义上有意义的区域,这增强了语义连贯性并为后续的对比表示学习提供了语义指导。语义感知对比学习增加了从语义分组中提取的语义信息,并有助于减轻“语义冲突”问题。我们在多个3D场景理解任务上进行了广泛的实验。结果表明,GroupContrast能够学习语义上有意义的表示,并取得了鼓舞人心的转移学习效果。
https://arxiv.org/abs/2403.09639
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.
近年来,视觉语言动作(VLA)模型依赖于2D输入,缺乏与3D物理世界范围的集成。此外,它们通过从感知到动作的学习直接映射,忽略了世界的广阔动态和动作与动态之间的关系。相比之下,人类具有描绘未来场景的想象力,以规划相应行动的能力。为此,我们提出了3D-VLA,通过引入一个新的 embodied 基础模型家族,使3D感知、推理和动作无缝集成。具体来说,3D-VLA基于一个基于3D的大语言模型(LLM),并引入了一组交互标记来与嵌入环境互动。此外,为了注入生成能力到模型中,我们训练了一系列嵌入扩散模型,并将它们对齐到LLM,以预测目标图像和点云。为了训练我们的3D-VLA,我们通过从现有机器人数据集中提取大量3D相关信息,创建了一个大规模3D embodied指令数据集。我们在保留数据集的实验中证明,3D-VLA在嵌入环境中显著提高了推理、多模态生成和规划能力,展示了其在现实应用中的潜力。
https://arxiv.org/abs/2403.09631
Autonomous mobile robots are an increasingly integral part of modern factory and warehouse operations. Obstacle detection, avoidance and path planning are critical safety-relevant tasks, which are often solved using expensive LiDAR sensors and depth cameras. We propose to use cost-effective low-resolution ranging sensors, such as ultrasonic and infrared time-of-flight sensors by developing VIRUS-NeRF - Vision, InfraRed, and UltraSonic based Neural Radiance Fields. Building upon Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Instant-NGP), VIRUS-NeRF incorporates depth measurements from ultrasonic and infrared sensors and utilizes them to update the occupancy grid used for ray marching. Experimental evaluation in 2D demonstrates that VIRUS-NeRF achieves comparable mapping performance to LiDAR point clouds regarding coverage. Notably, in small environments, its accuracy aligns with that of LiDAR measurements, while in larger ones, it is bounded by the utilized ultrasonic sensors. An in-depth ablation study reveals that adding ultrasonic and infrared sensors is highly effective when dealing with sparse data and low view variation. Further, the proposed occupancy grid of VIRUS-NeRF improves the mapping capabilities and increases the training speed by 46% compared to Instant-NGP. Overall, VIRUS-NeRF presents a promising approach for cost-effective local mapping in mobile robotics, with potential applications in safety and navigation tasks. The code can be found at this https URL nerf.
自主移动机器人已经成为现代工厂和仓库操作不可或缺的部分。障碍检测、避障和路径规划是关键的安全相关任务,通常使用昂贵的激光雷达传感器和深度相机来解决。我们提出了一种低成本的低分辨率距离传感器,如超声波和红外时间飞行传感器,通过开发VIRUS-NeRF - Vision,Infrared和Ultrasonic基于神经辐射场。在Instant Neural Graphics Primitives with Multiresolution Hash Encoding(Instant-NGP)的基础上,VIRUS-NeRF包含了来自超声和红外传感器的深度测量,并利用它们更新用于光跟踪的占有网格。在二维实验中,VIRUS-NeRF在覆盖方面与LiDAR点云相当。值得注意的是,在小型环境中,其准确性与LiDAR测量相当,而在较大环境中,其精度受到使用的超声传感器数量的限制。通过深入消融研究,我们发现添加超声和红外传感器在处理稀疏数据和低视差变化方面非常有效。此外,VIRUS-NeRF提出的占有网格在Instant-NGP上改进了映射功能,并将训练速度提高了46%。总体而言,VIRUS-NeRF为移动机器人实现成本效益的局部映射提供了有前途的方法,具有在安全和导航任务上的潜在应用。代码可以在此链接找到:https://nerf.
https://arxiv.org/abs/2403.09477
3D Gaussian splatting (3DGS) has recently demonstrated impressive capabilities in real-time novel view synthesis and 3D reconstruction. However, 3DGS heavily depends on the accurate initialization derived from Structure-from-Motion (SfM) methods. When trained with randomly initialized point clouds, 3DGS fails to maintain its ability to produce high-quality images, undergoing large performance drops of 4-5 dB in PSNR. Through extensive analysis of SfM initialization in the frequency domain and analysis of a 1D regression task with multiple 1D Gaussians, we propose a novel optimization strategy dubbed RAIN-GS (Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting), that successfully trains 3D Gaussians from random point clouds. We show the effectiveness of our strategy through quantitative and qualitative comparisons on multiple datasets, largely improving the performance in all settings. Our project page and code can be found at this https URL.
3D Gaussian splatting (3DGS) 最近在实时新视图合成和 3D 重建方面展示了令人印象深刻的 capabilities。然而,3DGS 高度依赖于来自结构从运动(SfM)方法的准确初始化。当用随机初始化的点云进行训练时,3DGS 无法保持其产生高质量图像的能力,性能下降达到 4-5 dB 在 PSNR。通过分析 SfM 初始化在频域中的广泛性和分析具有多个 1D 高斯的情况,我们提出了一个名为 RAIN-GS(放松准确初始化约束 3D Gaussian Splatting)的新优化策略,成功地从随机点云中训练 3D 高斯。我们通过多个数据集的定量和定性比较展示了我们策略的有效性,大大提高了所有设置的性能。我们的项目页和代码可以在这个链接中找到。
https://arxiv.org/abs/2403.09413
Environment maps endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including multimodal retrieval and open-set classes. However, existing open-vocabulary maps are constrained to closed indoor scenarios and VLM features, thereby diminishing their usability and inference capabilities. Moreover, the absence of topological relationships further complicates the accurate querying of specific instances. In this work, we propose OpenGraph, a representation of open-vocabulary hierarchical graph structure designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images using 2D foundation models, encoding the captions with features to enhance textual reasoning. Subsequently, 3D incremental panoramic mapping with feature embedding is achieved by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from real public dataset SemanticKITTI demonstrate that, even without fine-tuning the models, OpenGraph exhibits the ability to generalize to novel semantic classes and achieve the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at this https URL.
具有复杂语义的环境图是促进机器人与人类之间无缝交互的关键,使它们能够有效执行各种任务。基于视觉语言模型(VLMs)的开放词汇图具有优势,包括多模态检索和开放集类。然而,现有的开放词汇图受到室内场景和VLM特征的限制,从而降低了其可用性和推理能力。此外,缺乏拓扑关系还进一步复杂了针对特定实例的准确查询。在本文中,我们提出了OpenGraph,一种为大规模户外环境设计的开放词汇层次图表示。OpenGraph首先使用2D基础模型从视觉图像中提取实例和它们的 caption,并将它们编码为特征以增强文本推理。随后,通过将图像投影到激光雷达点云上实现3D增量全景映射。最后,根据道路图连接性对环境进行分割,构建层次图。来自公共数据集SemanticKITTI的验证结果表明,即使没有对模型进行微调,OpenGraph也表现出将新 semantic 类扩展的能力,并实现最高的分割和查询准确性。OpenGraph的源代码公开在https://这个URL上。
https://arxiv.org/abs/2403.09412
In this work, we present PoIFusion, a simple yet effective multi-modal 3D object detection framework to fuse the information of RGB images and LiDAR point clouds at the point of interest (abbreviated as PoI). Technically, our PoIFusion follows the paradigm of query-based object detection, formulating object queries as dynamic 3D boxes. The PoIs are adaptively generated from each query box on the fly, serving as the keypoints to represent a 3D object and play the role of basic units in multi-modal fusion. Specifically, we project PoIs into the view of each modality to sample the corresponding feature and integrate the multi-modal features at each PoI through a dynamic fusion block. Furthermore, the features of PoIs derived from the same query box are aggregated together to update the query feature. Our approach prevents information loss caused by view transformation and eliminates the computation-intensive global attention, making the multi-modal 3D object detector more applicable. We conducted extensive experiments on the nuScenes dataset to evaluate our approach. Remarkably, our PoIFusion achieves 74.9\% NDS and 73.4\% mAP, setting a state-of-the-art record on the multi-modal 3D object detection benchmark. Codes will be made available via \url{this https URL}.
在这项工作中,我们提出了PoIFusion,一个简单而有效的多模态3D物体检测框架,用于融合感兴趣点(缩写为PoI)的RGB图像和LiDAR点云中的信息。从技术上讲,我们的PoIFusion遵循基于查询的物体检测范式,将物体查询表示为动态3D盒子。PoIs是从每个查询框上生成的自适应生成,作为表示3D物体的关键点,在多模态融合中发挥基本单元的作用。具体来说,我们将PoIs投影到每个模态的视图中,以采样相应特征,并通过动态融合模块在每个PoI上整合多模态特征。此外,从同一个查询框生成的PoIs的特征汇总在一起来更新查询特征。我们的方法防止了视图变换引起的信息损失,从而消除了计算密集型的全局关注,使多模态3D物体检测器更加适用。我们在nuScenes数据集上进行了广泛的实验来评估我们的方法。值得注意的是,我们的PoIFusion在NDS和mAP指标上分别取得了74.9%和73.4%,将状态最佳记录保持了在多模态3D物体检测基准上。代码将通过[这个链接](https://this https URL)提供。
https://arxiv.org/abs/2403.09212