Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. Robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants show a higher vulnerability for the optical flow networks.
深度神经网络在准确估计场景流方面取得了显著的进展,这对于许多应用,如视频分析、动作识别和导航至关重要。然而,这些技术的鲁棒性仍然是一个令人担忧的问题,尤其是在面对已知能够欺骗许多领域中最先进的深度神经网络的对抗攻击的情况下。令人惊讶的是,场景流网络对这种攻击的鲁棒性尚未被充分调查。为解决这个问题,所提出的方法旨在通过引入专门针对场景流网络的对抗白盒攻击来弥合这一差距。实验结果表明,生成的对抗样本在KITTI和FlyingThings3D数据集上的平均端点误差最多可降低33.7。研究还揭示了仅针对一维或颜色通道的点云攻击对平均端点误差的影响。通过分析这些攻击在场景流网络和其2D光流网络变体上的成功率和失败情况,表明光流网络具有更高的漏洞。
https://arxiv.org/abs/2404.13621
Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i.e., up to 90$\sim$160$\times$ faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e.g., one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2.9MB), which is attractive for industrial practitioners.
尽管在点云几何压缩方面已经取得了相当大的进展,但在有效地压缩大规模场景方面仍然存在挑战。另一个关键挑战是降低解码延迟,这是现实应用中的关键要求。在本文中,我们提出了Pointsoup,一种高效的学习基础几何编码,同时实现高性能和极低解码延迟。受到传统Trisoup编码器的启发,采用基于点的建模策略来描述局部表面。具体来说,通过自注意机制嵌入从局部窗口的皮肤特征,并引入平滑窗口作为跨尺度先验来推断量化特征的分布。在解码过程中,特征经历快速的细化,然后是一个基于折叠的点生成器,以相对快速的速度重构点坐标。实验表明,Pointsoup在多个基准测试中实现了最先进的性能,具有显著的较低解码复杂性,即比G-PCCv23 Trisoup解码器在较低端平台上(例如,一个RTX 2080Ti)上的速度快约90$\sim$160倍。此外,它还具有单神经模型(2.9MB)的可变率控制,对于工业实践者来说具有吸引力。
https://arxiv.org/abs/2404.13550
Learning from Interactive Demonstrations has revolutionized the way non-expert humans teach robots. It is enough to kinesthetically move the robot around to teach pick-and-place, dressing, or cleaning policies. However, the main challenge is correctly generalizing to novel situations, e.g., different surfaces to clean or different arm postures to dress. This article proposes a novel task parameterization and generalization to transport the original robot policy, i.e., position, velocity, orientation, and stiffness. Unlike the state of the art, only a set of points are tracked during the demonstration and the execution, e.g., a point cloud of the surface to clean. We then propose to fit a non-linear transformation that would deform the space and then the original policy using the paired source and target point sets. The use of function approximators like Gaussian Processes allows us to generalize, or transport, the policy from every space location while estimating the uncertainty of the resulting policy due to the limited points in the task parameterization point set and the reduced number of demonstrations. We compare the algorithm's performance with state-of-the-art task parameterization alternatives and analyze the effect of different function approximators. We also validated the algorithm on robot manipulation tasks, i.e., different posture arm dressing, different location product reshelving, and different shape surface cleaning.
从交互式演示中学已经彻底颠覆了非专家人类教机器人学习的方式。仅需通过操纵机器人来教授捡取、放置或清洁策略。然而,主要挑战是正确地将原始机器人策略推广到新颖情况下,例如清洁不同的表面或穿戴不同的手臂姿势。本文提出了一个新颖的任务参数化和扩展来传输原始机器人策略,即位置、速度、方向和刚度。与现有技术不同,仅在演示和执行过程中跟踪一组点,例如清洁表面的点云。然后我们提出了一种非线性变换来扭曲空间并使用成对源和目标点集来推广原始策略。使用像Gaussian Processes这样的函数逼近器允许我们在估计由于任务参数化点集有限个点和减少演示次数而导致的策略不确定性。我们还将算法的性能与最先进的任务参数化选项进行了比较并分析了不同函数逼近器的影响。我们还验证了该算法在机器人操作任务上的效果,即不同姿势手臂的洗涤、不同位置产品重新归类和不同形状表面的清洁。
https://arxiv.org/abs/2404.13458
We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and formulate probability of faces to exist on our desired mesh in a differentiable manner based on the WDT. This enables DMesh to represent meshes of various topology in a differentiable way, and allows us to reconstruct the mesh under various observations, such as point cloud and multi-view images using gradient-based optimization. The source code and full paper is available at: this https URL.
我们提出了一个可导的三角形网格表示器DMesh,用于描述任意3D三角网格。DMesh同时考虑了网格的形状和拓扑信息。在我们的设计中,我们首先通过加权Delaunay三角化(WDT)得到一系列凸四边形,然后以WDT的方式将网格域紧凑地仿射理化,并在WDT的基础上以不同的方式计算网格中面存在的概率。这使得DMesh能够以不同方式表示各种拓扑结构的网格,并使用基于梯度的优化方法重构网格。源代码和完整论文可在以下链接找到:https:// this URL。
https://arxiv.org/abs/2404.13445
Despite recent advances in reconstructing an organic model with the neural signed distance function (SDF), the high-fidelity reconstruction of a CAD model directly from low-quality unoriented point clouds remains a significant challenge. In this paper, we address this challenge based on the prior observation that the surface of a CAD model is generally composed of piecewise surface patches, each approximately developable even around the feature line. Our approach, named NeurCADRecon, is self-supervised, and its loss includes a developability term to encourage the Gaussian curvature toward 0 while ensuring fidelity to the input points. Noticing that the Gaussian curvature is non-zero at tip points, we introduce a double-trough curve to tolerate the existence of these tip points. Furthermore, we develop a dynamic sampling strategy to deal with situations where the given points are incomplete or too sparse. Since our resulting neural SDFs can clearly manifest sharp feature points/lines, one can easily extract the feature-aligned triangle mesh from the SDF and then decompose it into smooth surface patches, greatly reducing the difficulty of recovering the parametric CAD design. A comprehensive comparison with existing state-of-the-art methods shows the significant advantage of our approach in reconstructing faithful CAD shapes.
尽管在最近,使用带神经签名距离函数(SDF)重构有机模型取得了进展,但直接从低质量无方向点云中重构高级别CAD模型仍然是一个重要的挑战。在本文中,我们根据先前的观察,即CAD模型的表面通常由局部表面补丁组成,每个补丁都可以在特征线附近开发,来解决这个问题。我们的方法称为NeurCADRecon,是一种自监督的方法,其损失包括一个开发性项,以鼓励高斯曲线向0发展,同时确保对输入点的忠实性。注意到在尖点处高斯曲线不为零,我们引入了一个双孔曲线来容忍这些尖点存在。此外,我们还开发了一种动态采样策略来处理输入点不完整或过于稀疏的情况。由于我们得到的神经SDF可以明显地表现出尖点/线,因此可以很容易地从SDF中提取特征对齐的三角形网格,然后将其分解成平滑的表面补丁,从而大大减少了从参数化CAD设计中恢复的难度。与现有最先进的方法进行全面的比较表明,我们方法在重构忠实CAD形状方面具有显著优势。
https://arxiv.org/abs/2404.13420
Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.
使大型语言模型(LLMs)与3D环境进行交互具有挑战性。现有的方法从地面真实(GT)几何或由辅助模型重构的3D场景中提取点云。然后将CLIP中的文本图像对齐的2D特征提升到点云中,作为LLMs的输入。然而,这种解决方案缺乏3D点对点连接的建立,导致空间结构信息不足。同时,场景的几何和语义表示之间的同步缺失导致了3D场景理解水平的降低。在本文中,我们证明了在3D场景中实现统一场景表示和重建框架对LLM的重要性。具体来说,我们引入了通过冻解除预训练2D基础模型(如CLIP和SAM)提取3D几何和语义感知表示特征的多尺度聚合3D解码器。我们的学习到的3D表示不仅有助于重建过程,还为LLMs提供了宝贵的知识。实验结果证实,我们的Uni3DR^2在3D重建数据集ScanNet上取得了显著的提高(将F- Score提高+1.8)。当应用于LLM时,我们在3D视觉语言理解数据集ScanQA上的表现超过了基线。此外,它在ScanQA和3DMV-VQA上的表现都超过了最先进的采用额外GT点云的方法。
https://arxiv.org/abs/2404.13044
Part segmentation is a crucial task for 3D curvilinear structures like neuron dendrites and blood vessels, enabling the analysis of dendritic spines and aneurysms with scientific and clinical significance. However, their diversely winded morphology poses a generalization challenge to existing deep learning methods, which leads to labor-intensive manual correction. In this work, we propose FreSeg, a framework of part segmentation tasks for 3D curvilinear structures. With Frenet-Frame-based point cloud transformation, it enables the models to learn more generalizable features and have significant performance improvements on tasks involving elongated and curvy geometries. We evaluate FreSeg on 2 datasets: 1) DenSpineEM, an in-house dataset for dendritic spine segmentation, and 2) IntrA, a public 3D dataset for intracranial aneurysm segmentation. Further, we will release the DenSpineEM dataset, which includes roughly 6,000 spines from 69 dendrites from 3 public electron microscopy (EM) datasets, to foster the development of effective dendritic spine instance extraction methods and, consequently, large-scale connectivity analysis to better understand mammalian brains.
部分分割是对类似于神经元突触和血管的三维曲面结构的关键任务,它使得可以使用科学和临床方法分析突触脊和动脉瘤,具有科学和临床重要性。然而,它们错综复杂的形态学提出了对现有深度学习方法的泛化挑战,导致劳动密集的手动修复。在这项工作中,我们提出了FreSeg,一个用于3D曲面结构的部分分割任务框架。通过Frenet-Frame点云变换,它使模型能够学习更通用的特征,并在涉及拉伸和弯曲的几何形状的任务上取得显著的性能提升。我们对FreSeg在2个数据集上的性能进行了评估:1)DenSpineEM,用于神经元突触分割的内部数据集;2)IntrA,用于颅内动脉瘤分割的公共3D数据集。此外,我们将发布DenSpineEM数据集,该数据集包括来自3个公共电子显微镜(EM)数据集的约6,000个突起,以促进有效的突起实例提取方法的发展,进而实现对大尺度连接分析的深入研究,更好地了解哺乳动物的大脑。
https://arxiv.org/abs/2404.14435
We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are unconditional models that do not integrate scene or other contextual information. As a consequence, they have limited applicability and rely on costly training data. To address these limitations, we propose a new method ,dubbed Purposer, based on neural discrete representation learning. Our model is capable of exploiting, in a flexible manner, different types of information already present in open access large-scale datasets such as AMASS. First, we encode unconditional human motion into a discrete latent space. Second, an autoregressive generative model, conditioned with key contextual information, either with prompting or additive tokens, and trained for next-step prediction in this space, synthesizes sequences of latent indices. We further design a novel conditioning block to handle future conditioning information in such a causal model by using a network with two branches to compute separate stacks of features. In this manner, Purposer can generate realistic motion sequences in diverse test scenes. Through exhaustive evaluation, we demonstrate that our multi-contextual solution outperforms existing specialized approaches for specific contextual information, both in terms of quality and diversity. Our model is trained with short sequences, but a byproduct of being able to use various conditioning signals is that at test time different combinations can be used to chain short sequences together and generate long motions within a context scene.
我们提出了一种名为Purposer的新方法,基于神经离散表示学习。我们的模型能够以灵活的方式利用开放访问的大规模数据集AMASS中已经存在的不同类型的信息。首先,我们将无条件的人类运动编码到一个离散的潜在空间中。然后,一个条件生成模型,通过关键的上下文信息条件,以提示或添加标记的方式进行训练,并在该空间中进行下一步预测,合成了一系列的潜在索引。我们进一步设计了一个新的条件模块,用于在具有因果关系的模型中处理未来的条件信息,通过使用具有两个分支的网络计算不同的特征栈。这样,Purposer可以在各种测试场景中生成逼真的运动序列。通过彻底的评估,我们证明了我们的多上下文解决方案在特定上下文信息方面的现有专业方法中具有优越性,无论是质量还是多样性。我们的模型使用短序列进行训练,但能够使用各种上下文信号的原因是,在测试时可以使用不同的组合将短序列串联起来并在上下文场景中生成长动作。
https://arxiv.org/abs/2404.12942
As point cloud provides a natural and flexible representation usable in myriad applications (e.g., robotics and self-driving cars), the ability to synthesize point clouds for analysis becomes crucial. Recently, Xie et al. propose a generative model for unordered point sets in the form of an energy-based model (EBM). Despite the model achieving an impressive performance for point cloud generation, one separate model needs to be trained for each category to capture the complex point set distributions. Besides, their method is unable to classify point clouds directly and requires additional fine-tuning for classification. One interesting question is: Can we train a single network for a hybrid generative and discriminative model of point clouds? A similar question has recently been answered in the affirmative for images, introducing the framework of Joint Energy-based Model (JEM), which achieves high performance in image classification and generation simultaneously. This paper proposes GDPNet, the first hybrid Generative and Discriminative PointNet that extends JEM for point cloud classification and generation. Our GDPNet retains strong discriminative power of modern PointNet classifiers, while generating point cloud samples rivaling state-of-the-art generative approaches.
点云作为一种自然且灵活的表示形式,可用于各种应用场景(例如机器人学和自动驾驶汽车),因此生成点云用于分析的能力变得至关重要。最近,Xie等人提出了一种基于能量的点集生成模型(EBM)来解决无序点集的生成问题。尽管该模型在点云生成方面取得了令人印象深刻的性能,但每个类别都需要单独训练一个模型来捕获复杂的点集分布。此外,他们的方法无法直接对点云进行分类,需要进行额外的微调来进行分类。一个有趣的问题就是:我们能否为点云的混合生成和判别模型训练一个单一的神经网络?与图像类似,最近已经有人回答了这个问题,引入了基于联合能量的模型(JEM)框架,该框架在同时实现图像分类和生成方面取得了高绩效。本文提出了GDPNet,这是第一个将JEM扩展到点云分类和生成的混合生成和判别点网络。我们的GDPNet保留了现代点网分类器的强大判别能力,同时生成点云样本,与最先进的生成方法媲美。
https://arxiv.org/abs/2404.12925
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
The sim-to-real gap poses a significant challenge in RL-based multi-agent exploration due to scene quantization and action discretization. Existing platforms suffer from the inefficiency in sampling and the lack of diversity in Multi-Agent Reinforcement Learning (MARL) algorithms across different scenarios, restraining their widespread applications. To fill these gaps, we propose MAexp, a generic platform for multi-agent exploration that integrates a broad range of state-of-the-art MARL algorithms and representative scenarios. Moreover, we employ point clouds to represent our exploration scenarios, leading to high-fidelity environment mapping and a sampling speed approximately 40 times faster than existing platforms. Furthermore, equipped with an attention-based Multi-Agent Target Generator and a Single-Agent Motion Planner, MAexp can work with arbitrary numbers of agents and accommodate various types of robots. Extensive experiments are conducted to establish the first benchmark featuring several high-performance MARL algorithms across typical scenarios for robots with continuous actions, which highlights the distinct strengths of each algorithm in different scenarios.
模拟-现实差距在基于强化学习的多智能体探索中提出了一个重大的挑战,由于场景量化和解码动作的离散化,现有的平台在采样效率和多智能体强化学习(MARL)算法在不同场景下的多样性方面存在低效,限制了它们在各个领域的广泛应用。为了填补这些空白,我们提出了MAexp,一个通用的多智能体探索平台,整合了最先进的MARL算法和代表性的场景。此外,我们还使用点云来表示我们的探索场景,导致高保真度环境映射和采样速度约比现有平台快40倍。此外,配备了基于注意力的多智能体目标生成器和单智能体运动规划器,MAexp可以与任意数量的智能体一起工作,并可以容纳各种类型的机器人。为了确定机器人连续行动场景中多个高性能MARL算法的第一个基准,我们进行了大量实验。这些实验突出了每个算法在不同场景中的独特优势。
https://arxiv.org/abs/2404.12824
Multi-task networks can potentially improve performance and computational efficiency compared to single-task networks, facilitating online deployment. However, current multi-task architectures in point cloud perception combine multiple task-specific point cloud representations, each requiring a separate feature encoder and making the network structures bulky and slow. We propose PAttFormer, an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds that only relies on a point-based representation. The network builds on transformer-based feature encoders using neighborhood attention and grid-pooling and a query-based detection decoder using a novel 3D deformable-attention detection head design. Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for multiple task-specific point cloud representations, resulting in a network that is 3x smaller and 1.4x faster while achieving competitive performance on the nuScenes and KITTI benchmarks for autonomous driving perception. Our extensive evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP on the nuScenes benchmark compared to the single-task models.
多任务网络相对于单任务网络可能会提高性能和计算效率,从而实现在线部署。然而,当前的点云感知中的多任务架构结合了多个任务特定的点云表示,每个表示都需要单独的特征编码器,使得网络结构变得庞大且运行缓慢。我们提出了PAttFormer,一种仅依赖于点基表示的多任务架构,用于联合语义分割和目标检测。该网络基于基于Transformer的特征编码器使用邻域注意力和池化,以及一种新颖的3D可形变注意检测头设计,用于查询基检测器。与其它基于激光雷达的多任务架构不同,我们的PAttFormer不需要为多个任务特定的点云表示分别编写单独的特征编码器,导致网络规模减小了3倍,同时速度提高了1.4倍,在nuScenes和KITTI基准测试中实现了与单任务模型相当竞争力的性能。我们进行的全面评估显示,多任务学习带来了显著的提高,在nuScenes基准测试中,LIDAR语义分割提高了+1.7%,而在3D目标检测中,提高了+1.7%。
https://arxiv.org/abs/2404.12798
LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code of this work will be made publicly available at this https URL.
LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. 虽然之前MOS方法取得了一些令人满意的结果,但仍然需要进一步研究一些关键问题,比如时间信息与空间信息的弱耦合,以及忽略的时间线索问题。在本文中,我们提出了一个新颖的基于LiDAR的3D移动对象分割,称为MambaMOS。首先,我们开发了一个新的嵌入模块,称为时间提示贝叶斯嵌入(TCBE),以增强点云中时间和空间信息之间的耦合,并减轻忽略时间线索的问题。其次,我们引入了运动感知状态空间模型(MSSM)来赋予模型理解同一物体在不同时间步之间的时间相关性的能力。具体来说,MSSM通过两个不同的时间建模和相关步骤强调同一物体的不同时间步的运动状态。我们利用改进的状态空间模型来表示这些运动差异,从而显著建模运动状态。最后,在SemanticKITTI-MOS和KITTI-Road基准上进行广泛的实验证明,所提出的MambaMOS实现了最先进的性能。本文的工作源代码将公开发布在以下链接处:https://www.kns.cnki.net/kns/brief/result.aspx?dbprefix=SCOD
https://arxiv.org/abs/2404.12794
Facial biometrics are an essential components of smartphones to ensure reliable and trustworthy authentication. However, face biometric systems are vulnerable to Presentation Attacks (PAs), and the availability of more sophisticated presentation attack instruments such as 3D silicone face masks will allow attackers to deceive face recognition systems easily. In this work, we propose a novel Presentation Attack Detection (PAD) algorithm based on 3D point clouds captured using the frontal camera of a smartphone to detect presentation attacks. The proposed PAD algorithm, VoxAtnNet, processes 3D point clouds to obtain voxelization to preserve the spatial structure. Then, the voxelized 3D samples were trained using the novel convolutional attention network to detect PAs on the smartphone. Extensive experiments were carried out on the newly constructed 3D face point cloud dataset comprising bona fide and two different 3D PAIs (3D silicone face mask and wrap photo mask), resulting in 3480 samples. The performance of the proposed method was compared with existing methods to benchmark the detection performance using three different evaluation protocols. The experimental results demonstrate the improved performance of the proposed method in detecting both known and unknown face presentation attacks.
面部生物识别是确保智能手机可靠且值得信赖的认证的重要组成部分。然而,面部生物识别系统容易受到展示攻击(PAs)的影响,而且比例如3D硅胶面部口罩等更复杂的展示攻击工具将使攻击者轻松欺骗面部识别系统。在这项工作中,我们提出了一种基于智能手机前摄像头捕获的3D点云的新型展示攻击检测(PAD)算法来检测展示攻击。所提出的PAD算法,VoxAtnNet,对3D点云进行处理以实现体素化以保留空间结构。然后,使用新颖的卷积注意网络对体素化的3D样本进行训练,以检测智能手机上的PAs。在构建了包含真实和两种不同3D PPI(3D硅胶面部口罩和贴纸照片面具)的新建3D面部点云数据集上进行了大量实验,结果产生了3480个样本。将所提出的方法与现有方法进行比较,以通过三种不同的评估协议 benchmark检测性能。实验结果表明,与已知和未知面部展示攻击相比,所提出的方法在检测方面都取得了显著改进。
https://arxiv.org/abs/2404.12680
3D Gaussian Splatting has recently been embraced as a versatile and effective method for scene reconstruction and novel view synthesis, owing to its high-quality results and compatibility with hardware rasterization. Despite its advantages, Gaussian Splatting's reliance on high-quality point cloud initialization by Structure-from-Motion (SFM) algorithms is a significant limitation to be overcome. To this end, we investigate various initialization strategies for Gaussian Splatting and delve into how volumetric reconstructions from Neural Radiance Fields (NeRF) can be utilized to bypass the dependency on SFM data. Our findings demonstrate that random initialization can perform much better if carefully designed and that by employing a combination of improved initialization strategies and structure distillation from low-cost NeRF models, it is possible to achieve equivalent results, or at times even superior, to those obtained from SFM initialization.
3D高斯平铺最近被认为是一种多才多艺且有效的场景重建和 novel 视角合成方法,得益于其高质量的结果和与硬件光栅化相容。然而,高斯平铺对通过结构从运动(SFM)算法高质量点云初始化的依赖是一个重要的限制。为此,我们研究了各种高斯平铺的初始化策略,并深入探讨了如何利用 Neural Radiance Fields (NeRF) 的体积重构来绕过对 SFM 数据依赖的局限。我们的研究结果表明,如果仔细设计,随机初始化可以获得更好的性能,而且通过结合改进的高斯平铺策略和低成本 NeRF 模型结构蒸馏,可以实现与 SFM 初始化获得的结果或有时更好的效果。
https://arxiv.org/abs/2404.12547
In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: this https URL.
近年来,深度学习和大规模数据集技术在3D实例分割、抓持姿态估计和机器人领域取得了令人瞩目的进展。这使得可以在直接在3D场景中准确检测物体的准确度,实现物体和环境感知的抓取预测,以及稳健且可重复的机器人操作。本文旨在将这些最近的方法整合到一个以人为中心的环境中机器人交互和操作的全面框架中。具体来说,我们利用商品3D扫描器的3D重构进行开放式词汇实例分割,并辅以抓持姿态估计,展示了动态抓取物体和打开抽屉。我们在包括动态物体检索和抽屉打开的两种真实世界实验中分别评估了我们模型的性能和鲁棒性,报告的成功率分别为51%和82%。我们框架的代码和视频可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.12440
With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.
随着大型模型在多样数据集上的出现,上下文学习已经成为多任务处理的有前景的范式,尤其是在自然语言处理和图像处理领域。然而,在3D点云任务中,它的应用仍然有很大一部分没有被探索。在这项工作中,我们引入了Point-In-Context (PIC),一种通过上下文学习对3D点云进行理解的全新框架。我们通过引入联合采样模块并提出了一个名为Point-In-Context-Generalist (PIC-G) 的普通版本解决了在3D点云上有效扩展遮罩点模型的技术挑战。PIC-G被设计成一个通用的3D点云任务模型,其输入和输出建模为坐标。在这种范式下,通过为每个类别分配标记点的XYZ坐标,实现了挑战性的分割任务;然后根据预测中距离最近的标签点进行最终选择。为了克服固定标签坐标分配的局限性,即在新类上表现不佳,我们提出了两种新的训练策略,称为In-Context Labeling和In-Context Enhancing,作为PIC的扩展版本,针对提高动态上下文标注和模型训练。通过利用动态上下文标签和额外的上下文对,PIC-S在各种3D点云任务上实现了增强的性能和泛化能力。PIC是一个通用框架,以便将其他任务或数据集轻松地引入到我们的PIC中,实现统一的数据格式。我们进行了广泛的实验来验证我们所提出的方法的多样性和适应性在处理各种任务和分割多数据集方面。我们的PIC-S能够通过自定义提示进行个性化扩展,实现新的部分分割。
https://arxiv.org/abs/2404.12352
Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.
在往返旅行中,从对方面临识别地点是一个常见的人类驾驶者的经历。然而,具有有限视野相机的视场机器人学能力(VPR)在实现方面被证明具有挑战性。为解决这个问题,本文提出了 Same Place Opposing Trajectory(SPOT),一种基于立体视觉惯性测量(VO)的反对观点VPR技术。该方法扩展了最近在激光描述符和双距离矩阵序列匹配方面的最新进展,并采用了一种新颖的double(相似和反对)距离矩阵序列匹配方法。我们在各种光照条件下,使用公开可用的数据集对SPOT进行了评估。与最先进的实现相比,所提出的算法在反对观点情况下实现了显著的提高,达到91.7%的召回率,而在100%精确度时,所需存储比所有测试基线都要少,并且比所有基线都要快。此外,所提出的假设没有预先知识来确定视点的相似性或反对性,并且在相似观点情况下也具有竞争力的性能。
https://arxiv.org/abs/2404.12339
Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: this http URL.
精确机器人操作需要丰富的空间信息在模仿学习中。基于图像的政策模型从固定的相机中建模物体位置,这些相机对相机视角的变化非常敏感。使用3D点云的政策通常预测关键帧,这使得在动态和接触丰富的场景中实现高效操作具有困难。为了有效地利用3D感知,我们提出了RISE,一个端到端的实世界模仿学习基准,它直接从单视点云中预测连续动作。它使用稀疏的3D编码器压缩点云。添加稀疏位置编码后,点被用Transformer特征化。最后,通过扩散头将特征解码为机器人动作。为每个现实世界任务训练50个演示,RISE在现有2D和3D策略的基础上优势明显,展示了在准确性和效率方面的显著优势。实验还表明,与之前的基础相比,RISE对环境变化的适应性更强。项目网站:this http URL。
https://arxiv.org/abs/2404.12281
Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km$^2$ across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.
树木物种分布的了解是管理森林的基础。新的深度学习方法预计将在森林制图方面显著提高准确性,并成为放大规模绘制多种树种的 critical 工具。为了推动该领域的发展,深度学习研究人员需要大型高质量标注的数据集。为此,我们提出了 PureForest 数据集:一个大规模、开放、多模态的数据集,旨在从空域激光扫描(ALS)点云和非常高分辨率(VHR)航空图像中对树木物种进行分类。目前,大多数公开的 Lidar 数据集用于树木物种分类时具有较低的多样性,因为它们只覆盖了少数 annotated 公顷的土地。相比之下,PureForest 把 18 种树木分成了 13 个语义类,跨越了 449 个不同的单一森林,迄今为止是最大的、最全面的 Lidar 数据集,用于识别树木物种。通过将 PureForest 公开发布,我们希望为开发从 Lidar 和/或航空影像中识别树木物种的深度学习方法提供一个具有挑战性的基准数据集。在本文的数据论文中,我们描述了标注工作流程、数据集、推荐的评估方法和从 3D 和 2D 模态中建立基准性能。
https://arxiv.org/abs/2404.12064