This paper presents a novel scheme to efficiently compress Light Detection and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives, and such archives pave the way for a detailed understanding of the corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight format for representing 3D LiDAR observations. Although conventional image compression techniques can be adapted to improve compression efficiency for RIs, their practical performance is expected to be limited due to differences in bit precision and the distinct pixel value distribution characteristics between natural images and RIs. We propose a novel implicit neural representation~(INR)--based RI compression method that effectively handles floating-point valued pixels. The proposed method divides RIs into depth and mask images and compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization, respectively. Experiments on the KITTI dataset show that the proposed method outperforms existing image, point cloud, RI, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates and decoding latency.
本文提出了一种新的方案,旨在高效压缩光探测与测距(LiDAR)点云数据,从而实现高精度的三维场景档案。这样的档案为详细了解相应的三维场景铺平了道路。我们专注于二维范围图像(RIs),作为一种轻量级格式来表示三维LiDAR观测数据。尽管传统的图像压缩技术可以适应以提高RIs的压缩效率,但由于自然图像和RIs在位精度以及像素值分布特征上的差异,它们的实际性能预计会受到限制。 为此,我们提出了一种基于隐式神经表征(INR)的新方法来压缩范围图像,该方法能够有效处理浮点数值像素。所提出的方案将范围图像划分为深度图和掩码图像,并分别采用分块的和逐像素的INR架构进行压缩,并且进行了模型剪枝与量化。 在KITTI数据集上的实验表明,在低比特率和解码延迟方面,我们提出的方法在三维重建和检测质量上超过了现有的基于图像、点云、范围图像以及基于INR的压缩方法。
https://arxiv.org/abs/2504.17229
We present an open-source, low-cost photogrammetry system for 3D plant modeling and phenotyping. The system uses a structure-from-motion approach to reconstruct 3D representations of the plants via point clouds. Using wheat as an example, we demonstrate how various phenotypic traits can be computed easily from the point clouds. These include standard measurements such as plant height and radius, as well as features that would be more cumbersome to measure by hand, such as leaf angles and convex hull. We further demonstrate the utility of the system through the investigation of specific metrics that may yield objective classifications of erectophile versus planophile wheat canopy architectures.
我们提出了一种开源且低成本的摄影测量系统,用于三维植物建模和表型分析。该系统采用基于运动结构的方法通过点云来重建植物的三维表示。以小麦为例,我们展示了如何轻松地从点云中计算出各种表型特征,包括标准测量如植株高度和半径,以及那些手动测量更为困难的特征,例如叶片角度和凸包(convex hull)。此外,我们还通过研究可能用于客观分类直立倾向性与平面倾向性小麦冠层结构的具体指标,进一步展示了该系统的实用性。
https://arxiv.org/abs/2504.16840
The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown significant potential in enabling robots to grasp target objects. However, most existing methods are based on the point clouds (2.5D points) generated from single-view depth images. These point clouds only have one surface side of the object providing incomplete geometry information, which mislead the grasping algorithm to judge the shape of the target object, resulting in low grasping accuracy. Humans can accurately grasp objects from a single view by leveraging their geometry experience to estimate object shapes. Inspired by humans, we propose a novel 6-DoF grasping framework that converts the point completion results as object shape features to train the 6-DoF grasp network. Here, point completion can generate approximate complete points from the 2.5D points similar to the human geometry experience, and converting it as shape features is the way to utilize it to improve grasp efficiency. Furthermore, due to the gap between the network generation and actual execution, we integrate a score filter into our framework to select more executable grasp proposals for the real robot. This enables our method to maintain a high grasp quality in any camera viewpoint. Extensive experiments demonstrate that utilizing complete point features enables the generation of significantly more accurate grasp proposals and the inclusion of a score filter greatly enhances the credibility of real-world robot grasping. Our method achieves a 17.8\% success rate higher than the state-of-the-art method in real-world experiments.
基于点云的六自由度(6-Dof)抓取方法已显示出使机器人能够抓取目标物体的巨大潜力。然而,大多数现有的方法都依赖于单视图深度图像生成的点云数据(2.5D 点)。这些点云仅提供物体一个表面侧的信息,导致几何信息不完整,从而误导了抓取算法对目标物体形状的判断,降低了抓取精度。 人类可以从单一视角准确地抓握物体,他们利用自身的几何经验来估算物体的形状。受到这一启发,我们提出了一种新颖的六自由度抓取框架,该框架将点补全结果转换为物体形状特征以训练6-Dof 抓取网络。其中,点补全能够从2.5D 点生成类似人类几何经验的近似完整点云,而将其作为形状特征则是利用这一信息来提高抓取效率的方法。 此外,由于网络生成与实际执行之间存在差距,我们还集成了一种评分过滤器到我们的框架中,用于选择更适合真实机器人执行的抓取提议。这使得我们的方法能够在任何相机视角下保持较高的抓取质量。 广泛的实验表明,利用完整的点特征能够产生更准确的抓取建议,并且引入一个评分过滤器大大增强了现实世界中机器人抓取的可靠性。在实际应用测试中,我们的方法比当前最先进的方法高出17.8%的成功率。
https://arxiv.org/abs/2504.16320
Shape completion networks have been used recently in real-world robotic experiments to complete the missing/hidden information in environments where objects are only observed in one or few instances where self-occlusions are bound to occur. Nowadays, most approaches rely on deep neural networks that handle rich 3D point cloud data that lead to more precise and realistic object geometries. However, these models still suffer from inaccuracies due to its nondeterministic/stochastic inferences which could lead to poor performance in grasping scenarios where these errors compound to unsuccessful grasps. We present an approach to calculate the uncertainty of a 3D shape completion model during inference of single view point clouds of an object on a table top. In addition, we propose an update to grasp pose algorithms quality score by introducing the uncertainty of the completed point cloud present in the grasp candidates. To test our full pipeline we perform real world grasping with a 7dof robotic arm with a 2 finger gripper on a large set of household objects and compare against previous approaches that do not measure uncertainty. Our approach ranks the grasp quality better, leading to higher grasp success rate for the rank 5 grasp candidates compared to state of the art.
最近,形状补全网络已被用于真实世界的机器人实验中,以完成环境中对象的缺失或隐藏信息。在这些环境中,物体仅被观察到一次或几次,并且自我遮挡是不可避免的。如今,大多数方法依赖于能够处理丰富3D点云数据的深度神经网络,这有助于生成更精确和现实的对象几何形状。然而,这些模型仍然受到其非确定性/随机推理导致的不准确性的困扰,这可能在抓取场景中引发性能下降,因为错误会累积并导致失败的抓取。 我们提出了一种方法来计算3D形状补全模型在对桌面上物体单一视图点云进行推断时的不确定性。此外,我们还建议通过引入候选抓取中的完整点云的不确定性来改进抓取姿态算法的质量评分。为了测试我们的整个流程,我们在一个具有7个自由度和2指夹持器的真实世界环境中进行了抓取实验,针对一组家庭常用物品,并与以前不测量不确定性的方法进行了比较。 我们所提出的方法在评估抓取质量方面表现更好,使得前五名候选的抓取成功率相较于现有最佳技术有了显著提高。
https://arxiv.org/abs/2504.16183
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: this https URL.
点云的自监督表示学习在提高预训练模型在各种任务上的性能方面展示了有效性。然而,随着预训练模型复杂性的增加,完全微调这些模型以适应下游应用需要大量的计算和存储资源。参数效率微调(PEFT)方法提供了一种有前景的解决方案来缓解这些资源需求问题,但大多数当前的方法依赖于复杂的适配器和提示机制,这增加了可调节参数的数量。在本文中,我们提出了PointLoRA,这是一种简单而有效的方法,它结合了低秩适应(LoRA)与多尺度标记选择,以高效地微调点云模型。我们的方法将LoRA层嵌入到点云变换器中最耗参数的组件内,减少了可调节参数的数量,同时增强了全局特征捕捉能力。此外,多尺度标记选择提取关键的局部信息作为提示,用于下游微调,有效地补充了由LoRA捕获的全局上下文。在多个预训练模型和三个具有挑战性的公共数据集上的实验结果表明,我们的方法仅使用3.43%的可训练参数就能达到竞争性性能,使其非常适合资源受限的应用程序。源代码可在以下网址获取:this https URL。
https://arxiv.org/abs/2504.16023
Object classification models utilizing point cloud data are fundamental for 3D media understanding, yet they often struggle with unseen or out-of-distribution (OOD) scenarios. Existing point cloud unsupervised domain adaptation (UDA) methods typically employ a multi-task learning (MTL) framework that combines primary classification tasks with auxiliary self-supervision tasks to bridge the gap between cross-domain feature distributions. However, our further experiments demonstrate that not all gradients from self-supervision tasks are beneficial and some may negatively impact the classification performance. In this paper, we propose a novel solution, termed Saliency Map-based Data Sampling Block (SM-DSB), to mitigate these gradient conflicts. Specifically, our method designs a new scoring mechanism based on the skewness of 3D saliency maps to estimate gradient conflicts without requiring target labels. Leveraging this, we develop a sample selection strategy that dynamically filters out samples whose self-supervision gradients are not beneficial for the classification. Our approach is scalable, introducing modest computational overhead, and can be integrated into all the point cloud UDA MTL frameworks. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches. In addition, we provide a new perspective on understanding the UDA problem through back-propagation analysis.
基于点云数据的对象分类模型是理解3D媒体的基础,但它们通常在未见过或分布外(OOD)的场景中表现不佳。现有的无监督领域适应(UDA)方法通常采用多任务学习(MTL)框架,该框架结合了主要分类任务和辅助自监督任务来弥合跨域特征分布之间的差距。然而,进一步的实验表明,并非所有来自自监督任务的梯度都是有益的,有些甚至可能对分类性能产生负面影响。为此,在本文中我们提出了一种新的解决方案,称为基于显著图的数据采样模块(SM-DSB),以缓解这些梯度冲突问题。 具体而言,我们的方法设计了一个新的评分机制,该机制基于3D显著图的偏度来估计没有目标标签的情况下梯度冲突的情况。利用这一评分机制,我们开发了一种样本选择策略,可以动态地过滤掉那些自监督任务产生的梯度对分类不利的样本。这种方法具有可扩展性,并且只需要适度的计算开销,在所有点云UDA MTL框架中均可轻松集成。 广泛的实验结果表明,我们的方法在性能上超过了现有的最先进的方法。此外,我们通过反向传播分析提供了理解UDA问题的新视角。
https://arxiv.org/abs/2504.15796
This work introduces a novel method for surface normal estimation from rectified stereo image pairs, leveraging affine transformations derived from disparity values to achieve fast and accurate results. We demonstrate how the rectification of stereo image pairs simplifies the process of surface normal estimation by reducing computational complexity. To address noise reduction, we develop a custom algorithm inspired by convolutional operations, tailored to process disparity data efficiently. We also introduce adaptive heuristic techniques for efficiently detecting connected surface components within the images, further improving the robustness of the method. By integrating these methods, we construct a surface normal estimator that is both fast and accurate, producing a dense, oriented point cloud as the final output. Our method is validated using both simulated environments and real-world stereo images from the Middlebury and Cityscapes datasets, demonstrating significant improvements in real-time performance and accuracy when implemented on a GPU. Upon acceptance, the shader source code will be made publicly available to facilitate further research and reproducibility.
这项工作介绍了一种新颖的方法,用于从校准立体图像对中估算表面法线。该方法利用由视差值衍生的仿射变换来实现快速且准确的结果。我们展示了如何通过减少计算复杂度来简化立体图像对校准时的表面法线估计过程。 为了应对噪声减少的问题,我们开发了一种受卷积操作启发的自定义算法,专门用于高效处理视差数据。此外,我们还引入了自适应启发式技术,以有效地检测图像中的连接表面组件,进一步提高了方法的鲁棒性。 通过整合这些方法,我们构建了一个既快速又准确的表面法线估计器,最终输出密集且定向的点云。我们的方法在模拟环境和来自Middlebury及Cityscapes数据集的真实世界立体图像上进行了验证,并在GPU实现时显示了实时性能和准确性方面的显著改进。 一旦被接受,我们将公开发布着色器源代码,以促进进一步的研究和结果可重复性。
https://arxiv.org/abs/2504.15121
Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.
相机和人类运动控制在视频生成中的研究已经非常广泛,但现有的方法通常分别处理这两方面的问题,并且缺乏高质量的双重视频数据。为了解决这个问题,我们提出了Uni3C,这是一种统一的三维增强框架,用于精确地控制视频生成中相机和人体动作。 Uni3C有两个关键贡献。首先,我们提出了一种即插即用的控制模块PCDController,该模块在冻结的视频生成骨干网络上进行训练,并利用单目深度信息中的未投影点云来实现准确的相机控制。通过利用点云的强大三维先验和视频基础模型的强大能力,PCDController展示了出色的泛化性能,在推理时无论背后网络是冻结还是微调都能表现出色。这种灵活性使Uni3C的不同模块可以在特定领域进行训练,即要么专注于相机控制,要么专注于人体动作控制,从而减少对共同注释数据的依赖。 其次,我们提出了一个在推理阶段使用的联合对齐三维世界指导方法,该方法无缝地整合了风景点云和SMPL-X角色模型,统一了相机运动和人类运动的控制信号。广泛的实验确认了PCDController在驱动视频生成微调后骨骼的相机运动方面具有强大的鲁棒性。 我们的Uni3C框架在相机可控性和人体动作质量上显著优于竞争对手。此外,我们收集了一套专门用于验证复杂相机移动和人类行为的数据集,以证明我们方法的有效性。
https://arxiv.org/abs/2504.14899
LiDAR place recognition (LPR) plays a vital role in autonomous navigation. However, existing LPR methods struggle to maintain robustness under adverse weather conditions such as rain, snow, and fog, where weather-induced noise and point cloud degradation impair LiDAR reliability and perception accuracy. To tackle these challenges, we propose an Iterative Task-Driven Framework (ITDNet), which integrates a LiDAR Data Restoration (LDR) module and a LiDAR Place Recognition (LPR) module through an iterative learning strategy. These modules are jointly trained end-to-end, with alternating optimization to enhance performance. The core rationale of ITDNet is to leverage the LDR module to recover the corrupted point clouds while preserving structural consistency with clean data, thereby improving LPR accuracy in adverse weather. Simultaneously, the LPR task provides feature pseudo-labels to guide the LDR module's training, aligning it more effectively with the LPR task. To achieve this, we first design a task-driven LPR loss and a reconstruction loss to jointly supervise the optimization of the LDR module. Furthermore, for the LDR module, we propose a Dual-Domain Mixer (DDM) block for frequency-spatial feature fusion and a Semantic-Aware Generator (SAG) block for semantic-guided restoration. In addition, for the LPR module, we introduce a Multi-Frequency Transformer (MFT) block and a Wavelet Pyramid NetVLAD (WPN) block to aggregate multi-scale, robust global descriptors. Finally, extensive experiments on the Weather-KITTI, Boreas, and our proposed Weather-Apollo datasets demonstrate that, demonstrate that ITDNet outperforms existing LPR methods, achieving state-of-the-art performance in adverse weather. The datasets and code will be made publicly available at this https URL.
LiDAR 地点识别(LPR)在自主导航中扮演着至关重要的角色。然而,现有的 LPR 方法在恶劣天气条件下(如雨、雪和雾)难以保持鲁棒性,因为这些条件下的天气诱导噪声和点云退化会损害 LiDAR 的可靠性和感知精度。为了解决这些问题,我们提出了一种迭代任务驱动框架(ITDNet),该框架通过迭代学习策略集成了LiDAR 数据恢复(LDR)模块和LiDAR 地点识别(LPR)模块,并进行端到端联合训练及交替优化以增强性能。ITDNet 的核心思想是利用 LDR 模块来恢复受损的点云,同时保持与干净数据的一致性结构,从而在恶劣天气条件下提高 LPR 精度。与此同时,LPR 任务提供了特征伪标签,指导 LDR 模块训练,使其更好地适应 LPR 任务。 为了实现这一目标,我们首先设计了一种任务驱动的 LPR 损失和重建损失,共同监督 LDR 模块的优化。此外,对于 LDR 模块,我们提出了一种频域-空域混合器(DDM)模块用于频率-空间特征融合,以及一种语义感知生成器(SAG)模块用于语义引导恢复。另外,针对 LPR 模块,我们引入了一个多频段变压器(MFT)模块和一个小波金字塔 NetVLAD(WPN)模块来聚合多尺度、鲁棒的全局描述符。 最后,在 Weather-KITTI、Boreas 以及我们提出的 Weather-Apollo 数据集上进行了广泛的实验,结果表明 ITDNet 在恶劣天气条件下超越了现有的 LPR 方法,并实现了最先进的性能。数据集和代码将在 [此链接](https://this_https_URL.com) 公开发布。
https://arxiv.org/abs/2504.14806
Implicit Neural Representations (INRs), also known as neural fields, have emerged as a powerful paradigm in deep learning, parameterizing continuous spatial fields using coordinate-based neural networks. In this paper, we propose \textbf{PICO}, an INR-based framework for static point cloud compression. Unlike prevailing encoder-decoder paradigms, we decompose the point cloud compression task into two separate stages: geometry compression and attribute compression, each with distinct INR optimization objectives. Inspired by Kolmogorov-Arnold Networks (KANs), we introduce a novel network architecture, \textbf{LeAFNet}, which leverages learnable activation functions in the latent space to better approximate the target signal's implicit function. By reformulating point cloud compression as neural parameter compression, we further improve compression efficiency through quantization and entropy coding. Experimental results demonstrate that \textbf{LeAFNet} outperforms conventional MLPs in INR-based point cloud compression. Furthermore, \textbf{PICO} achieves superior geometry compression performance compared to the current MPEG point cloud compression standard, yielding an average improvement of $4.92$ dB in D1 PSNR. In joint geometry and attribute compression, our approach exhibits highly competitive results, with an average PCQM gain of $2.7 \times 10^{-3}$.
隐式神经表示(INRs),也被称为神经场,在深度学习领域中作为一种强大的范例出现,利用基于坐标的神经网络来参数化连续的空间字段。在这篇论文中,我们提出了**PICO**,一个用于静态点云压缩的基于INR的框架。与现有的编码器-解码器范式不同,我们将点云压缩任务分解为两个独立阶段:几何压缩和属性压缩,每个阶段都有不同的INR优化目标。受到Kolmogorov-Arnold Networks(KANs)的启发,我们引入了一种新颖的网络架构**LeAFNet**,该架构在潜在空间中利用可学习的激活函数来更好地逼近目标信号的隐式函数。通过将点云压缩重新表述为神经参数压缩,并进一步通过量化和熵编码提高压缩效率。实验结果表明,**LeAFNet** 在基于INR的点云压缩方面优于传统的多层感知机(MLP)。此外,**PICO** 的几何压缩性能优于当前的MPEG点云压缩标准,在D1 PSNR上平均提高了4.92 dB。在联合几何和属性压缩中,我们的方法展示了极有竞争力的结果,PCQM增益平均为2.7 x 10^-3。 简而言之: - **PICO** 是一种用于静态点云压缩的基于INR的新框架。 - 它将任务分解为几何和属性两个独立阶段。 - 引入了新的网络架构 LeAFNet,利用可学习激活函数改进信号逼近。 - PICO 在几何压缩方面优于现有的MPEG标准,在D1 PSNR上平均提高4.92 dB。 - 联合几何和属性压缩中,PICO 表现优异,PCQM增益为2.7 x 10^-3。
https://arxiv.org/abs/2504.14471
Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. Due to the poor performance of simply transferring Mamba to 3D SNNs, SPM is designed to utilize both the sequence modeling capabilities of Mamba and the temporal feature extraction of SNNs. Specifically, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism, thereby facilitating temporal interactions. Then, we propose a Spiking Mamba Block (SMB), which builds upon Mamba while learning inter-time-step features and minimizing information loss caused by spikes. Finally, to further enhance model performance, we adopt an asymmetric SNN-ANN architecture for spike-based pre-training and finetune. Compared with the previous state-of-the-art SNN models, SPM improves OA by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at least 3.5x lower than that of its ANN counterpart. The code will be made publicly available.
基于生物启发的脉冲神经网络(SNNs)提供了一种高效节能的方式以提取3D空间-时间特征。然而,现有的3D SNN在处理长距离依赖性方面一直存在困难,直到最近Mamba的出现才提供了优越的计算效率和序列建模能力。在这项工作中,我们提出了Spiking Point Mamba(SPM),这是首个基于Mamba且适用于3D领域的SNN架构。由于直接将Mamba应用到3D SNN中会表现不佳,因此SPM旨在充分利用Mamba在序列建模方面的优势以及SNN在时间特征提取方面的能力。 具体而言,我们首先引入了层次动态编码(HDE),这是一种改进的直接编码方法,能够有效引入动态时间机制,从而促进时间交互。然后,我们提出了一种脉冲Mamba块(SMB),该结构基于Mamba,并在此基础上学习跨时步特征,同时尽量减少由于脉冲产生的信息损失。最后,为了进一步提高模型性能,我们在脉冲基的预训练和微调过程中采用了不对称的SNN-ANN架构。 与之前的最先进的SNN模型相比,在ScanObjectNN三个变体上,SPM分别提高了OA +6.2%、+6.1% 和 +7.4%,在ShapeNetPart上的实例mIOU也提升了+1.9%。同时,它的能耗至少比其对应的ANN架构低3.5倍。代码将公开发布。
https://arxiv.org/abs/2504.14371
Climate-smart and biodiversity-preserving forestry demands precise information on forest resources, extending to the individual tree level. Multispectral airborne laser scanning (ALS) has shown promise in automated point cloud processing and tree segmentation, but challenges remain in identifying rare tree species and leveraging deep learning techniques. This study addresses these gaps by conducting a comprehensive benchmark of machine learning and deep learning methods for tree species classification. For the study, we collected high-density multispectral ALS data (>1000 pts/m$^2$) at three wavelengths using the FGI-developed HeliALS system, complemented by existing Optech Titan data (35 pts/m$^2$), to evaluate the species classification accuracy of various algorithms in a test site located in Southern Finland. Based on 5261 test segments, our findings demonstrate that point-based deep learning methods, particularly a point transformer model, outperformed traditional machine learning and image-based deep learning approaches on high-density multispectral point clouds. For the high-density ALS dataset, a point transformer model provided the best performance reaching an overall (macro-average) accuracy of 87.9% (74.5%) with a training set of 1065 segments and 92.0% (85.1%) with 5000 training segments. The best image-based deep learning method, DetailView, reached an overall (macro-average) accuracy of 84.3% (63.9%), whereas a random forest (RF) classifier achieved an overall (macro-average) accuracy of 83.2% (61.3%). Importantly, the overall classification accuracy of the point transformer model on the HeliALS data increased from 73.0% with no spectral information to 84.7% with single-channel reflectance, and to 87.9% with spectral information of all the three channels.
气候智能型和生物多样性保护林业需要对森林资源进行精确的信息收集,具体到每棵树的水平。多光谱航空激光扫描(ALS)在自动点云处理和树木分割方面显示出巨大潜力,但在识别稀有树种以及利用深度学习技术方面仍存在挑战。本研究通过全面评估机器学习和深度学习方法在树木分类中的表现来解决这些不足。为此,我们使用FGI开发的HeliALS系统收集了高密度多光谱ALS数据(>1000点/平方米),并采用了35个波段的数据,以三种不同波长进行评估,在芬兰南部的一个测试地点对各种算法在树种分类准确性方面进行了评价。此外,还利用了Optech Titan现有的数据。 基于5261个测试片段的分析显示,基于点的深度学习方法,尤其是点变换模型,在高密度多光谱点云中表现出色,优于传统的机器学习和图像深度学习方法。对于高密度ALS数据集,点变换模型在具有1065个训练样本时达到了87.9%(宏平均)的整体分类准确率,在拥有5000个训练样本时达到92.0%(宏平均)。而最佳的基于图像的深度学习方法DetailView则实现了84.3%(宏平均)的整体分类准确度,随机森林(RF)分类器则为83.2%(宏平均)。 值得注意的是,点变换模型在HeliALS数据上的整体分类准确性从无光谱信息时的73.0%,提高到单通道反射率时的84.7%,最后达到三种波长全部光谱信息下的87.9%。
https://arxiv.org/abs/2504.14337
We introduce a novel representation for learning and generating Computer-Aided Design (CAD) models in the form of $\textit{boundary representations}$ (B-Reps). Our representation unifies the continuous geometric properties of B-Rep primitives in different orders (e.g., surfaces and curves) and their discrete topological relations in a $\textit{holistic latent}$ (HoLa) space. This is based on the simple observation that the topological connection between two surfaces is intrinsically tied to the geometry of their intersecting curve. Such a prior allows us to reformulate topology learning in B-Reps as a geometric reconstruction problem in Euclidean space. Specifically, we eliminate the presence of curves, vertices, and all the topological connections in the latent space by learning to distinguish and derive curve geometries from a pair of surface primitives via a neural intersection network. To this end, our holistic latent space is only defined on surfaces but encodes a full B-Rep model, including the geometry of surfaces, curves, vertices, and their topological relations. Our compact and holistic latent space facilitates the design of a first diffusion-based generator to take on a large variety of inputs including point clouds, single/multi-view images, 2D sketches, and text prompts. Our method significantly reduces ambiguities, redundancies, and incoherences among the generated B-Rep primitives, as well as training complexities inherent in prior multi-step B-Rep learning pipelines, while achieving greatly improved validity rate over current state of the art: 82% vs. $\approx$50%.
我们提出了一种新颖的表示方法,用于学习和生成以$\textit{边界表示}$(B-Reps)形式的计算机辅助设计(CAD)模型。我们的表示法在所谓的$\textit{整体潜在}$ (HoLa) 空间中统一了不同顺序中的 B-Rep 原始几何体(例如,表面和曲线)的连续几何属性及其离散拓扑关系。这一表示基于一个基本观察:两个表面之间的拓扑连接本质上与其交线的几何结构相联系。这种先验知识使我们能够将B-Reps中的拓扑学习重新表述为欧几里得空间中的几何重建问题。具体来说,我们通过神经交叉网络从一对曲面原始体中区分并推导出曲线几何形状来消除潜在空间中存在的曲线、顶点和所有拓扑连接的存在。因此,我们的整体潜在空间仅定义在表面上,但编码了完整的B-Rep模型,包括表面、曲线、顶点及其拓扑关系的几何结构。 这种紧凑且整体化的潜在空间促进了首个基于扩散的生成器的设计,能够处理各种输入类型,包括点云、单视图/多视图图像、2D草图以及文本提示。我们的方法显著减少了在先前多步骤B-Rep学习流水线中固有的生成B-Rep原始体间的歧义性、冗余性和不一致性,并实现了超过当前最先进技术的极大有效性率:82%对约50%。
https://arxiv.org/abs/2504.14257
Point cloud data is pivotal in applications like autonomous driving, virtual reality, and robotics. However, its substantial volume poses significant challenges in storage and transmission. In order to obtain a high compression ratio, crucial semantic details usually confront severe damage, leading to difficulties in guaranteeing the accuracy of downstream tasks. To tackle this problem, we are the first to introduce a novel Region of Interest (ROI)-guided Point Cloud Geometry Compression (RPCGC) method for human and machine vision. Our framework employs a dual-branch parallel structure, where the base layer encodes and decodes a simplified version of the point cloud, and the enhancement layer refines this by focusing on geometry details. Furthermore, the residual information of the enhancement layer undergoes refinement through an ROI prediction network. This network generates mask information, which is then incorporated into the residuals, serving as a strong supervision signal. Additionally, we intricately apply these mask details in the Rate-Distortion (RD) optimization process, with each point weighted in the distortion calculation. Our loss function includes RD loss and detection loss to better guide point cloud encoding for the machine. Experiment results demonstrate that RPCGC achieves exceptional compression performance and better detection accuracy (10% gain) than some learning-based compression methods at high bitrates in ScanNet and SUN RGB-D datasets.
点云数据在自动驾驶、虚拟现实和机器人技术等应用中至关重要,但其庞大的体积给存储和传输带来了巨大挑战。为了获得较高的压缩比,通常会严重损害关键的语义细节,这使得保证下游任务的准确性变得困难。为了解决这个问题,我们首次提出了一种新颖的基于区域兴趣(ROI)引导点云几何压缩(RPCGC)方法,适用于人类和机器视觉。我们的框架采用了一个双支并行结构:基础层编码和解码简化后的点云版本,而增强层则通过聚焦于几何细节来改进这一过程。此外,增强层中的残差信息经过一个ROI预测网络的进一步细化处理,该网络生成掩模信息,并将其整合到残差中,作为强有力的监督信号。我们还在率失真(RD)优化过程中巧妙地应用了这些掩模详情,在计算失真时为每个点加权。 我们的损失函数包括RD损失和检测损失,以更好地指导机器的点云编码工作。实验结果表明,RPCGC在ScanNet和SUN RGB-D数据集上实现了卓越的压缩性能,并且与一些基于学习的方法相比,在高比特率下检测精度提高了10%。
https://arxiv.org/abs/2504.14240
Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.
视觉基础模型(VFMs)已成为许多下游视觉任务的默认选择,例如图像分类、图像分割和目标定位。然而,它们也可以为能够利用跨模态信息(如配对的图像数据)的下游3D任务提供重要的效用。在我们的研究中,我们进一步探索了VFMs将标记源数据适应到未标记的目标数据以进行基于LiDAR的3D语义分割任务的实用性。我们的方法使用配对的2D-3D(图像和点云)数据,并依赖于来自VFM的强大跨域特征,在标记的源数据和未标记的目标数据混合中训练3D主干网络。我们提出的方法的核心是一个融合网络,该网络同时受到图像流和点云流的影响,且它们相对贡献会根据目标领域进行调整。 我们在多种设置下广泛地将我们提出的方案与不同的最新方法进行了比较,并取得了显著的性能提升。例如,在与之前的最新技术相比时,我们的方法在所有任务上平均提高了6.5 mIoU(交并比)。
https://arxiv.org/abs/2504.14231
The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. Real-IAD D3 features finer defects, diverse anomalies, and greater scale across 20 categories, providing a challenging benchmark for multimodal IAD Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The dataset and code are publicly accessible for research purposes at this https URL D3
工业异常检测(IAD)的复杂性不断增加,使多模态检测方法成为机器视觉研究的一个重点领域。然而,专门针对IAD设计的多模态数据集仍然相对有限。先驱性的数据集如MVTec 3D通过整合RGB+3D数据为多模态IAD奠定了基础工作,但受限于规模和分辨率的问题,仍难以与真实的工业环境无缝对接。为了应对这些挑战,我们推出了Real-IAD D3,这是一个高精度的多模态数据集,它独特地结合了通过光度立体成像生成的伪3D模式、高分辨率RGB图像以及微米级的3D点云。Real-IAD D3涵盖20个类别,并且包含更细小的缺陷和多样化的异常情况,为多模态IAD提供了具有挑战性的基准测试。此外,我们还提出了一种有效的方法,整合了RGB、点云和伪-3D深度信息,利用每种模式的独特优势来提升检测性能。我们的实验强调了这些模式在增强检测鲁棒性和整体IAD性能中的重要性。此数据集和代码可在以下URL公开获取以供研究使用:[请将D3部分替换为正确的链接]
https://arxiv.org/abs/2504.14221
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
我们介绍了LOCATE 3D,这是一种从诸如“沙发和台灯之间的那张小咖啡桌”这样的指代表达式中定位三维场景中物体的模型。LOCATE 3D在标准参考基准测试上设立了新的最先进水平,并展示了强大的泛化能力。值得注意的是,LOCATE 3D直接操作传感器观察流(带姿势的RGB-D帧),从而可以在机器人和AR设备上实现真实世界的部署。 我们方法的关键是3D-JEPA,这是一种新颖的无监督学习(SSL)算法,适用于传感器点云。该算法以使用二维基础模型(如CLIP、DINO)特征化的三维点云为输入。随后,在潜在空间中采用掩码预测作为预训练任务,帮助上下文化点云特性的自我监督学习。在训练完成后,3D-JEPA编码器将与一个受语言条件限制的解码器一起微调,以联合预测3D遮罩和边界框。 此外,我们还推出了LOCATE 3D 数据集,这是一个新的用于三维参考定位的数据集,涵盖了多个捕捉设置,并包含超过130K的标注。这使得系统地研究泛化能力以及开发更强模型成为可能。
https://arxiv.org/abs/2504.14151
Self-supervised learning (SSL) has demonstrated remarkable success in 3D point cloud analysis, particularly through masked autoencoders (MAEs). However, existing MAE-based methods lack rotation invariance, leading to significant performance degradation when processing arbitrarily rotated point clouds in real-world scenarios. To address this limitation, we introduce Handcrafted Feature-Based Rotation-Invariant Masked Autoencoder (HFBRI-MAE), a novel framework that refines the MAE design with rotation-invariant handcrafted features to ensure stable feature learning across different orientations. By leveraging both rotation-invariant local and global features for token embedding and position embedding, HFBRI-MAE effectively eliminates rotational dependencies while preserving rich geometric structures. Additionally, we redefine the reconstruction target to a canonically aligned version of the input, mitigating rotational ambiguities. Extensive experiments on ModelNet40, ScanObjectNN, and ShapeNetPart demonstrate that HFBRI-MAE consistently outperforms existing methods in object classification, segmentation, and few-shot learning, highlighting its robustness and strong generalization ability in real-world 3D applications.
自监督学习(SSL)在三维点云分析中取得了显著的成功,特别是通过掩码自动编码器(MAE)。然而,现有的基于MAE的方法缺乏旋转不变性,在处理现实场景中任意旋转的点云时会导致性能大幅下降。为了解决这一限制,我们引入了一种新的框架——基于手工特征的旋转不变掩码自动编码器(HFBRI-MAE),该框架通过添加旋转不变的手工设计特征来改进MAE的设计,以确保在不同方向上都能进行稳定的特性学习。通过利用旋转不变的局部和全局特征来进行标记嵌入和位置嵌入,HFBRI-MAE有效地消除了旋转依赖性,并保持了丰富的几何结构。此外,我们将重建目标重新定义为输入的规范对齐版本,从而缓解了旋转模糊问题。 在ModelNet40、ScanObjectNN以及ShapeNetPart等数据集上的大量实验表明,HFBRI-MAE在对象分类、分割和少样本学习方面均优于现有方法,在实际应用中的三维场景中表现出其鲁棒性和强大的泛化能力。
https://arxiv.org/abs/2504.14132
The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clouds of generic 3D objects. In this paper, we propose a novel unpaired point cloud completion framework, namely the Reference-guided Completion (RefComp) framework, which attains strong performance in both the class-aware and class-agnostic training settings. The RefComp framework transforms the unpaired completion problem into a shape translation problem, which is solved in the latent feature space of the partial point clouds. To this end, we introduce the use of partial-complete point cloud pairs, which are retrieved by using the partial point cloud to be completed as a template. These point cloud pairs are used as reference data to guide the completion process. Our RefComp framework uses a reference branch and a target branch with shared parameters for shape fusion and shape translation via a Latent Shape Fusion Module (LSFM) to enhance the structural features along the completion pipeline. Extensive experiments demonstrate that the RefComp framework achieves not only state-of-the-art performance in the class-aware training setting but also competitive results in the class-agnostic training setting on both virtual scans and real-world datasets.
未配对点云补全任务旨在通过使用没有真实标签数据训练的模型来完成部分点云。现有的未配对点云补全方法是类别感知型,即需要为每个物体类别分别建立一个单独的模型。由于这些方法具有有限的泛化能力,在面对各种通用3D对象的点云时,它们在实际场景中的表现不佳。在这篇论文中,我们提出了一种新颖的未配对点云补全框架,名为参考引导完成(RefComp)框架,它在类别感知和非类别感知训练设置下均表现出色。 RefComp框架将未配对的补全问题转化为形状转换问题,并通过部分点云的潜在特征空间来解决。为此,我们引入了使用待完成的部分点云作为模板检索出部分-完整点云对的方法,这些点云对被用作参考数据以指导补全过程。我们的RefComp框架利用了一个共享参数的目标分支和参考分支以及一个潜形状融合模块(LSFM)来进行形状融合和形状转换,以此来增强整个完成过程中结构特征的表现。 广泛的实验表明,RefComp框架不仅在类别感知训练设置下达到了最先进的性能,在非类别感知训练设置下也分别在虚拟扫描数据集和真实世界数据集中实现了具有竞争力的结果。
https://arxiv.org/abs/2504.13788
Multi-Layer Perceptrons (MLPs) have become one of the fundamental architectural component in point cloud analysis due to its effective feature learning mechanism. However, when processing complex geometric structures in point clouds, MLPs' fixed activation functions struggle to efficiently capture local geometric features, while suffering from poor parameter efficiency and high model redundancy. In this paper, we propose PointKAN, which applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis tasks to investigate their efficacy in hierarchical feature representation. First, we introduce a Geometric Affine Module (GAM) to transform local features, improving the model's robustness to geometric variations. Next, in the Local Feature Processing (LFP), a parallel structure extracts both group-level features and global context, providing a rich representation of both fine details and overall structure. Finally, these features are combined and processed in the Global Feature Processing (GFP). By repeating these operations, the receptive field gradually expands, enabling the model to capture complete geometric information of the point cloud. To overcome the high parameter counts and computational inefficiency of standard KANs, we develop Efficient-KANs in the PointKAN-elite variant, which significantly reduces parameters while maintaining accuracy. Experimental results demonstrate that PointKAN outperforms PointMLP on benchmark datasets such as ModelNet40, ScanObjectNN, and ShapeNetPart, with particularly strong performance in Few-shot Learning task. Additionally, PointKAN achieves substantial reductions in parameter counts and computational complexity (FLOPs). This work highlights the potential of KANs-based architectures in 3D vision and opens new avenues for research in point cloud understanding.
多层感知器(MLP)由于其有效的特征学习机制,已成为点云分析中的一种基础架构组件。然而,在处理点云中的复杂几何结构时,MLP的固定激活函数难以高效地捕捉局部几何特征,并且在参数效率和模型冗余方面表现不佳。在这篇论文中,我们提出了PointKAN,它将科洛莫戈罗夫-阿诺尔德网络(KAN)应用于点云分析任务,以探究其在层次化特征表示中的有效性。 首先,我们引入了一个几何仿射模块(GAM),用于转换局部特征,从而提高了模型对几何变化的鲁棒性。接下来,在局部特征处理(LFP)中,一个并行结构同时提取组级特征和全局上下文信息,为细节和整体结构提供了丰富的表示。最后,在全局特征处理(GFP)阶段,这些特征被组合并进一步处理。通过重复执行这些操作,感受野逐渐扩大,使模型能够捕捉到点云的完整几何信息。 为了克服标准KAN高参数计数和计算效率低下的问题,我们在PointKAN-elite变体中开发了高效的Efficient-KANs,这显著减少了参数数量同时保持了准确性。实验结果表明,在ModelNet40、ScanObjectNN和ShapeNetPart等基准数据集上,PointKAN在模型性能方面优于PointMLP,特别是在少量样本学习任务中表现尤为突出。此外,PointKAN实现了大量的参数计数和计算复杂度(FLOPs)的减少。 这项工作强调了基于KAN架构在三维视觉中的潜力,并为点云理解的研究开辟了新的途径。
https://arxiv.org/abs/2504.13593