Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy's effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.
构建一个强大的感知模块对于视动策略学习至关重要。虽然最近的方法将预训练的2D基础模型集成到机器人感知模块中,以利用其强大的语义理解能力,但它们难以捕捉3D空间信息,并且在不同相机视角下泛化的能力有限。这些限制阻碍了政策的有效性,尤其是在精细的机器人操作场景中。为了解决这些问题,我们提出了CL3R,这是一个新颖的3D预训练框架,旨在增强机器人的操作策略。我们的方法通过使用点云Masked Autoencoder来学习丰富的3D表示,并利用对比学习将预训练2D基础模型中的语义知识高效转移,从而集成了空间意识和语义理解。 此外,我们还提出了一种用于机器人任务的3D视觉表示预训练框架。通过统一不同数据集之间的坐标系统并引入多视角点云的随机融合,我们缓解了相机视图模糊性,并提高了泛化能力,在测试时能够从新的视角实现稳健感知。 在模拟和现实世界的广泛实验中,我们的方法显示出了其优越性,突显了它在机器人操作中的视动策略学习方面的有效性。
https://arxiv.org/abs/2507.08262
We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.
我们研究了在驾驶环境中利用基于图像的视觉-语言模型(VLMs)进行激光雷达扫描开放词汇语义分割的方法。传统上,可以将图像语义反向投影到3D点云中。然而,由此产生的点标签是嘈杂且稀疏的。为了确保时空一致性和对图像级别增强的鲁棒性,我们整合了这些标签。随后,基于这些优化后的标签训练了一个3D网络。这种方法被称为LOSC,在nuScenes和SemanticKITTI数据集上,无论是零样本开放词汇语义分割还是全景分割任务中,都显著优于现有最佳方法(SOTA)。
https://arxiv.org/abs/2507.07605
In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1\% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.
在工业点云分析中,检测细微异常需要高分辨率的空间数据,然而现有的基准测试主要侧重于低分辨率的输入。为解决这一差距,我们提出了一种可扩展的数据生成管道,用于创建现实且微妙的3D异常。利用这条管道,我们开发了MiniShift——首个专注于高分辨率的3D异常检测数据集,包含2,577个点云,每个点云有50万个点,并且异常只占总点数的大约1%。此外,我们还推出了Simple3D,这是一个高效的框架,结合多尺度邻域描述符(MSND)和局部特征空间聚合(LFSA),能够在保持较低计算开销的同时捕捉复杂的几何细节,在实时推理速度方面超过20 fps。在MiniShift及现有基准上的广泛评估表明,与最先进的方法相比,Simple3D在准确性和速度上均取得了超越,这凸显了高分辨率数据和有效特征聚合在推进实用的3D异常检测中的关键作用。
https://arxiv.org/abs/2507.07435
Semantic scene understanding, including the perception and classification of moving agents, is essential to enabling safe and robust driving behaviours of autonomous vehicles. Cameras and LiDARs are commonly used for semantic scene understanding. However, both sensor modalities face limitations in adverse weather and usually do not provide motion information. Radar sensors overcome these limitations and directly offer information about moving agents by measuring the Doppler velocity, but the measurements are comparably sparse and noisy. In this paper, we address the problem of panoptic segmentation in sparse radar point clouds to enhance scene understanding. Our approach, called SemRaFiner, accounts for changing density in sparse radar point clouds and optimizes the feature extraction to improve accuracy. Furthermore, we propose an optimized training procedure to refine instance assignments by incorporating a dedicated data augmentation. Our experiments suggest that our approach outperforms state-of-the-art methods for radar-based panoptic segmentation.
语义场景理解,包括移动代理的感知和分类,对于使自动驾驶车辆具备安全可靠的驾驶行为至关重要。相机和激光雷达通常用于语义场景理解。然而,在恶劣天气条件下,这两种传感器都存在局限性,并且一般情况下无法提供运动信息。相比之下,雷达传感器克服了这些限制,通过测量多普勒速度直接提供了关于移动代理的信息,但其测量数据往往较为稀疏且噪声较多。在本文中,我们针对稀疏的雷达点云中的全景分割问题提出了改进方法,以增强场景理解能力。我们的方法称为SemRaFiner,在处理稀疏雷达点云时考虑了密度变化,并优化特征提取过程以提高精度。此外,我们还提出了一种优化训练程序,通过引入特定的数据增强技术来细化实例分配。实验结果表明,相较于基于雷达的全景分割领域的现有最佳方法,我们的方法表现出更好的性能。
https://arxiv.org/abs/2507.06906
Manipulating articulated tools, such as tweezers or scissors, has rarely been explored in previous research. Unlike rigid tools, articulated tools change their shape dynamically, creating unique challenges for dexterous robotic hands. In this work, we present a hierarchical, goal-conditioned reinforcement learning (GCRL) framework to improve the manipulation capabilities of anthropomorphic robotic hands using articulated tools. Our framework comprises two policy layers: (1) a low-level policy that enables the dexterous hand to manipulate the tool into various configurations for objects of different sizes, and (2) a high-level policy that defines the tool's goal state and controls the robotic arm for object-picking tasks. We employ an encoder, trained on synthetic pointclouds, to estimate the tool's affordance states--specifically, how different tool configurations (e.g., tweezer opening angles) enable grasping of objects of varying sizes--from input point clouds, thereby enabling precise tool manipulation. We also utilize a privilege-informed heuristic policy to generate replay buffer, improving the training efficiency of the high-level policy. We validate our approach through real-world experiments, showing that the robot can effectively manipulate a tweezer-like tool to grasp objects of diverse shapes and sizes with a 70.8 % success rate. This study highlights the potential of RL to advance dexterous robotic manipulation of articulated tools.
操纵如镊子或剪刀之类的关节工具,在以往的研究中很少被探索过。与刚性工具不同,关节工具能够动态改变其形状,为灵巧的机器人手提出了独特的挑战。在这项工作中,我们提出了一种分层、目标导向强化学习(GCRL)框架,以提高使用关节工具的人形机器人手的操作能力。我们的框架包含两个策略层次:(1) 低级策略使灵巧的手能够根据不同大小物体的需求调整工具的形状;(2) 高级策略定义了工具的目标状态,并控制机械臂执行拾取任务。 我们采用了一个经过合成点云训练的编码器,从输入点云中估计工具的功效状态——具体来说,不同的工具配置(例如,镊子开口角度)如何允许抓取不同大小的物体。这使精确实现工具操作成为可能。此外,我们利用一种基于特权信息的启发式策略生成回放缓存区,提高了高级策略的训练效率。 通过真实世界的实验验证了我们的方法的有效性:机器人可以有效地使用类似镊子的工具,以70.8%的成功率抓取形状和大小各异的对象。这项研究突显了RL(强化学习)在推进关节工具灵巧操作方面的潜力。
https://arxiv.org/abs/2507.06822
This paper presents StixelNExT++, a novel approach to scene representation for monocular perception systems. Building on the established Stixel representation, our method infers 3D Stixels and enhances object segmentation by clustering smaller 3D Stixel units. The approach achieves high compression of scene information while remaining adaptable to point cloud and bird's-eye-view representations. Our lightweight neural network, trained on automatically generated LiDAR-based ground truth, achieves real-time performance with computation times as low as 10 ms per frame. Experimental results on the Waymo dataset demonstrate competitive performance within a 30-meter range, highlighting the potential of StixelNExT++ for collective perception in autonomous systems.
本文介绍了StixelNExT++,一种用于单目感知系统场景表示的新型方法。该方法基于现有的Stixel表示法,通过推断三维Stixels并利用聚类更小的3D Stixel单元来增强物体分割,从而提高了场景信息的压缩率,并且能够适应点云和鸟瞰图等不同的表示方式。我们的轻量级神经网络在自动生成的LiDAR地面真值数据上进行训练,在每帧图像上的计算时间低至10毫秒的情况下实现了实时性能。实验结果表明,在Waymo数据集中的30米范围内,StixelNExT++表现出具有竞争力的表现,突显了该方法在自主系统集体感知领域的潜在应用价值。
https://arxiv.org/abs/2507.06687
Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{this https URL}{this https URL}.
基于类别的物体姿态估计,即在没有具体实例先验知识的情况下预测已知类别内物体的姿态,在仓库自动化和制造业等应用中至关重要。现有方法依赖于RGB图像或点云数据时,通常难以处理遮挡问题以及跨不同实例和类别的泛化能力。本文提出了一种基于多模态的关键点学习框架(MK-Pose),该框架整合了RGB图像、点云和基于类别的文本描述信息。模型采用了一个自监督关键点检测模块,并通过注意力机制生成查询,结合软热图匹配以及基于图形的关系建模。此外,设计了一个增强的特征融合模块以集成局部几何信息与全局上下文。 MK-Pose在CAMERA25和REAL275数据集上进行了评估,并进一步在HouseCat6D数据集上测试了其跨数据集的能力。实验结果表明,在没有形状先验的情况下,MK-Pose在交并比(IoU)和平均精度方面优于现有的最先进方法。 相关代码将在[此链接](https://this%20URL)发布。
https://arxiv.org/abs/2507.06662
Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose Diff$^2$I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff$^2$I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark.
学习跨模态对应关系对于图像到点云(I2P)注册至关重要。现有的方法主要通过度量学习来强制执行不同模态之间的特征对齐,但忽视了图像和点数据之间固有的模态差距。因此,这种范式难以确保准确的跨模态对应关系。为此,受到最近大型扩散模型在跨模态生成方面的成功启发,我们提出了Diff$^2$I2P,这是一个完全可微分的I2P注册框架,利用了一种新颖且有效的扩散先验来弥合模态差距。 具体而言,我们提出了一种控制侧得分蒸馏(CSD)技术,从条件深度扩散模型中提取知识,直接优化预测的转换。然而,由于对应关系检索和透视-内点求解器(PnP solver)的非可微性,转换上的梯度无法反向传播到跨模态特征上。为此,我们进一步提出了一种可变形对应调整(DCT)模块以可微的方式估计对应关系,并使用一个可微的透视-内点求解器来估算变换。 通过这两种设计,扩散模型作为一个强有力的先验引导图像和点云之间的跨模态特征学习形成稳健的对应关系,从而显著改善注册效果。广泛的实验结果表明,Diff$^2$I2P在I2P注册方法中始终优于现有最佳方法,在7-Scenes基准测试中的注册召回率上提高了超过7%。
https://arxiv.org/abs/2507.06651
In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.
在这篇文章中,我们提出了一种视图依赖投影(View-Dependent Projection, VDP)方法,以促进点云分割。该方法设计了高效的三维到二维映射,能够根据视角变化动态适应空间几何结构。现有的基于投影的方法使用视图无关的投影在复杂场景中工作,它们依靠直线来生成直接光线或向上曲线减少遮挡。然而,这些方法由于其视图独立性,所生成的投影光束仅限于由人工设定的预定义参数,从而限制了点云的认知,并且无法捕捉不同视角下的充分投影多样性。 尽管多视角平面中的多次投影通常被用来增强空间变化,但由于映射冗余导致计算开销过大和图像处理效率低下。为了解决这些问题,我们设计了一个VDP框架,从三维点分布中生成数据驱动的投影,通过预测烟花适应行为启发式的光线来产生高信息量的单幅图像输入。此外,我们构建了色彩正则化以优化该框架,强调语义像素中的关键特征并抑制非语义黑像素中的特性,从而最大化在映射图中的二维空间利用率。 因此,我们的方法PointVDP能够以边际计算成本生成轻量级投影。实验结果表明,在S3DIS和ScanNet基准测试中,我们的方法取得了具有竞争力的结果,并提供了一种资源高效的解决方案用于语义理解。
https://arxiv.org/abs/2507.06618
This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at this https URL.
这篇论文提出了一种针对点云的三维语义分割自适应边缘对比学习方法。大多数现有的方法使用等量惩罚的目标函数,这忽略了过渡区域产生的每个点的模糊性和较少区分性的特征。然而,由于高度模糊的点即使对人类来说也可能难以区分,其手动标注的标签不够可靠,严格的约束条件会导致次优模型。为了解决这一问题,我们首先设计了AMContrast3D方法,该方法将对比学习融入到模糊度估计框架中,并针对每个点的不同模糊级别制定了自适应的目标函数。因此,我们的方法促进了模型训练,在确保低模糊度点正确性的同时允许高模糊度点出现错误。由于模糊度是基于标签间位置差异来制定的,推断期间的优化受制于所有未标记点在模糊度上都是均匀不明确这一假设,缺乏对模糊度的认识。受到联合训练思想的启发,我们进一步提出了AMContrast3D++,通过两个平行训练的分支进行整合,并引入了一个新颖的模糊预测模块,同时从生成的嵌入中学习点的模糊性。为此,我们设计了一种掩码细化机制,利用预测出的模糊度使模棱两可的嵌入更加可靠,从而提高分割性能并增强鲁棒性。在3D室内场景数据集S3DIS和ScanNet上的实验结果证明了所提出方法的有效性。代码可以在提供的链接中获取。
https://arxiv.org/abs/2507.06592
Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency. To overcome this, we propose HyperGCT, a flexible dynamic Hyper-GNN-learned geometric ConstrainT that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, HyperGCT is robust to graph noise, demonstrating a significant advantage in terms of generalization.
几何约束在三维点云配准问题中的特征匹配之间是至关重要的。现有方法通常将无序的匹配建模为一致性图,并通过采样一致性的匹配来生成假设。然而,显式的图形构造会引入噪声,这给手工设计的几何约束带来了很大的挑战,使其难以实现一致性。为了克服这一问题,我们提出了HyperGCT,这是一种灵活且动态的学习型高阶几何约束方法,利用超图中的高阶一致性来处理三维对应关系。 据我们所知,HyperGCT是第一个从动态超图中挖掘鲁棒性几何约束的方法,以进行三维配准。通过动态优化顶点和边特征的聚合,HyperGCT能够有效地捕捉到匹配之间的关联,从而准确地生成假设。在3DMatch、3DLoMatch、KITTI-LC和ETH等数据集上的广泛实验表明,HyperGCT达到了最先进的性能水平。此外,HyperGCT对图噪声具有鲁棒性,在泛化方面表现出明显的优势。
https://arxiv.org/abs/2503.02195
This paper presents a framework for mapping underwater caves. Underwater caves are crucial for fresh water resource management, underwater archaeology, and hydrogeology. Mapping the cave's outline and dimensions, as well as creating photorealistic 3D maps, is critical for enabling a better understanding of this underwater domain. In this paper, we present the mapping of an underwater cave segment (the catacombs) of the Devil's Eye cave system at Ginnie Springs, FL. We utilized a set of inexpensive action cameras in conjunction with a dive computer to estimate the trajectories of the cameras together with a sparse point cloud. The resulting reconstructions are utilized to produce a one-dimensional retract of the cave passages in the form of the average trajectory together with the boundaries (top, bottom, left, and right). The use of the dive computer enables the observability of the z-dimension in addition to the roll and pitch in a visual/inertial framework (SVIn2). In addition, the keyframes generated by SVIn2 together with the estimated camera poses for select areas are used as input to a global optimization (bundle adjustment) framework -- COLMAP -- in order to produce a dense reconstruction of those areas. The same cave segment is manually surveyed using the MNemo V2 instrument, providing an additional set of measurements validating the proposed approach. It is worth noting that with the use of action cameras, the primary components of a cave map can be constructed. Furthermore, with the utilization of a global optimization framework guided by the results of VI-SLAM package SVIn2, photorealistic dense 3D representations of selected areas can be reconstructed.
本文提出了一种水下洞穴测绘的框架。水下洞穴对于淡水资源管理、水下考古学和水文地质研究至关重要。绘制洞穴轮廓及尺寸,并创建逼真的三维地图,有助于更好地了解这一独特的水下领域。本文展示了对佛罗里达州金尼泉“魔鬼之眼”洞穴系统的其中一个部分(即“地窟”)进行的测绘工作。 我们使用了一组低成本的动作相机与潜水电脑相结合,以估算相机轨迹并生成稀疏点云。由此产生的重建成果被用来制作该水下隧道的一维轮廓图,包括平均路径及其边界(上、下、左、右)。潜水电脑的使用不仅允许在视觉/惯性框架中观察z轴维度,还能够观测到俯仰角和滚转角的变化(SVIn2)。 此外,通过将由SVIn2生成的关键帧与选定区域的相机估计姿态作为输入,我们采用了一种全局优化(束调整)框架——COLMAP,以此来生成这些区域的密集重建。同一洞穴段落还使用MNemo V2仪器进行了手动测绘,提供了另一组验证本方法有效性的测量数据。 值得注意的是,在使用动作相机的情况下,可以构建出洞穴地图的主要组成部分。此外,利用由VI-SLAM包SVIn2提供的结果指导下的全局优化框架,可以选择性地重建出逼真的密集三维表示图。
https://arxiv.org/abs/2507.06397
Conservation and decision-making regarding forest resources necessitate regular forest inventory. Light detection and ranging (LiDAR) in laser scanning systems has gained significant attention over the past two decades as a remote and non-destructive solution to streamline the labor-intensive and time-consuming procedure of forest inventory. Advanced multispectral (MS) LiDAR systems simultaneously acquire three-dimensional (3D) spatial and spectral information across multiple wavelengths of the electromagnetic spectrum. Consequently, MS-LiDAR technology enables the estimation of both the biochemical and biophysical characteristics of forests. Forest component segmentation is crucial for forest inventory. The synergistic use of spatial and spectral laser information has proven to be beneficial for achieving precise forest semantic segmentation. Thus, this study aims to investigate the potential of MS-LiDAR data, captured by the HeliALS system, providing high-density multispectral point clouds to segment forests into six components: ground, low vegetation, trunks, branches, foliage, and woody debris. Three point-wise 3D deep learning models and one machine learning model, including kernel point convolution, superpoint transformer, point transformer V3, and random forest, are implemented. Our experiments confirm the superior accuracy of the KPConv model. Additionally, various geometric and spectral feature vector scenarios are examined. The highest accuracy is achieved by feeding all three wavelengths (1550 nm, 905 nm, and 532 nm) as the initial features into the deep learning model, resulting in improvements of 33.73% and 32.35% in mean intersection over union (mIoU) and in mean accuracy (mAcc), respectively. This study highlights the excellent potential of multispectral LiDAR for improving the accuracy in fully automated forest component segmentation.
森林资源的保护和决策需要定期进行森林调查。在过去二十年中,激光扫描系统中的光探测与测距(LiDAR)作为一种远程且非破坏性的解决方案受到了广泛关注,能够简化劳动密集型且耗时的森林调查过程。高级多光谱(MS)LiDAR系统能够在电磁波谱的不同波长上同时获取三维空间和光谱信息。因此,MS-LiDAR技术可以估算出森林的生化和生物物理特性。森林成分分割对于森林调查至关重要。结合使用空间和光谱激光信息已被证明有助于实现精确的森林语义分割。因此,本研究旨在探讨由HeliALS系统捕获的高密度多光谱点云数据在将森林分割成六个组成部分(地面、低植被、树干、树枝、树叶和木屑)方面的潜力。 我们实施了三种基于点的3D深度学习模型以及一种机器学习模型,包括核点卷积(Kernel Point Convolution, KPConv)、超级点变换器(Superpoint Transformer)、点变换器V3(Point Transformer V3),以及随机森林。我们的实验确认了KPConv模型具有优越的准确性。此外,我们还研究了几何和光谱特征向量的不同场景。将所有三种波长(1550 nm、905 nm 和 532 nm)作为初始特征输入深度学习模型时,实现了最高的准确率,在平均交并比(mIoU)和平均精度(mAcc)上分别提高了33.73% 和 32.35%。这项研究表明多光谱LiDAR具有改善全自动森林成分分割精确度的巨大潜力。
https://arxiv.org/abs/2507.08025
Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.
基于局部邻域平滑信号是机器学习和几何处理中的核心操作。在结构良好的领域,如向量空间和流形上,由微分几何导出的拉普拉斯算子提供了一种通过热扩散来进行平滑的原则性方法,并具有强大的理论保证。然而,构建这样的拉普拉斯算子需要仔细定义域结构,而这种结构并不总是可用的。因此,大多数从业者依赖于简单的卷积核和消息传递层,这些工具在处理领域的边界时存在偏见。 为了弥合这一差距,我们引入了一类由一般相似性或邻接矩阵衍生出的平滑算子,并展示了它们可以被归一化为类似于扩散操作符的形式,从而继承拉普拉斯算子的有利特性。我们的方法基于Sinkhorn算法的一种对称变体,它通过对正平滑算子进行缩放来匹配热扩散的行为。这种构造使得能够处理不规则数据(如点云、稀疏体积网格或高斯混合物)时执行类似拉普拉斯算子的平滑和处理操作。 我们证明了生成的操作符不仅近似于热扩散,还保留了拉普拉斯算子本身的频谱信息,从而在形状分析和匹配等领域具有广泛的应用。
https://arxiv.org/abs/2507.06161
Accurate geo-registration of LiDAR point clouds presents significant challenges in GNSS signal denied urban areas with high-rise buildings and bridges. Existing methods typically rely on real-time GNSS and IMU data, that require pre-calibration and assume stable positioning during data collection. However, this assumption often fails in dense urban areas, resulting in localization errors. To address this, we propose a structured geo-registration and spatial correction method that aligns 3D point clouds with satellite images, enabling frame-wise recovery of GNSS information and reconstruction of city scale 3D maps without relying on prior localization. The proposed approach employs a pre-trained Point Transformer model to segment the road points and then extracts the road skeleton and intersection points from the point cloud as well as the target map for alignment. Global rigid alignment of the two is performed using the intersection points, followed by local refinement using radial basis function (RBF) interpolation. Elevation correction is then applied to the point cloud based on terrain information from SRTM dataset to resolve vertical discrepancies. The proposed method was tested on the popular KITTI benchmark and a locally collected Perth (Western Australia) CBD dataset. On the KITTI dataset, our method achieved an average planimetric alignment standard deviation (STD) of 0.84~m across sequences with intersections, representing a 55.3\% improvement over the original dataset. On the Perth dataset, which lacks GNSS information, our method achieved an average STD of 0.96~m compared to the GPS data extracted from Google Maps API. This corresponds to a 77.4\% improvement from the initial alignment. Our method also resulted in elevation correlation gains of 30.5\% on the KITTI dataset and 50.4\% on the Perth dataset.
在高楼大厦和桥梁密集的城市区域,激光雷达点云的精确地理配准面临重大挑战,尤其是在没有GNSS信号的情况下。现有的方法通常依赖于实时GNSS和IMU数据,这些数据需要预先校准,并且假设在数据采集过程中定位是稳定的。然而,在密集城市环境中,这一假设经常失败,导致定位误差。为此,我们提出了一种结构化的地理配准和空间矫正方法,该方法将3D点云与卫星图像对齐,从而能够在不依赖于先前定位信息的情况下恢复帧级的GNSS数据并重建城市规模的三维地图。 所提出的方案采用预训练的Point Transformer模型来分割道路点,并从点云以及目标图中提取道路骨架和交叉路口点以进行配准。通过使用交叉路口点执行全局刚性对齐,然后利用径向基函数(RBF)插值法在局部范围内进行细化。接着基于SRTM数据集中的地形信息对点云进行高程修正,以此解决垂直方向上的偏差问题。 我们的方法已在流行的KITTI基准测试和本地收集的珀斯(澳大利亚西岸)CBD数据集上进行了测试。在KITTI数据集上,我们的方法实现了交叉路口序列平均平面配准标准差(Std)为0.84米,相比原始数据集提高了55.3%。在缺乏GNSS信息的珀斯数据集中,我们的方法实现了与从Google Maps API提取的GPS数据对比下的平均标准差为0.96米,这相当于初始对齐的77.4%改进。此外,在KITTI数据集上和珀斯数据集上的高度相关性分别提高了30.5%和50.4%。
https://arxiv.org/abs/2507.05999
Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge. Recent advances in 3D Gaussian Splatting (3DGS) and its dynamic extensions have enabled high-fidelity scene modeling. However, existing methods often couple scene reconstruction with optimization-dependent coding, which limits generalizability. This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences. Our approach introduces a Group-of-Frames (GoF) structure with I-P frame coding, where inter-frame motions are extracted via sparse control points. The resulting motion tensors are compressed in a feedforward manner using a dual prior-aware entropy model that combines hyperprior and spatial-temporal priors for accurate rate estimation. For reconstruction, we perform control-point-guided motion compensation and employ a refinement network to enhance view-consistent fidelity. Trained on multi-view video-derived Gaussian frames, D-FCGS generalizes across scenes without per-scene optimization. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds while preserving visual quality across viewpoints. This work advances feedforward compression for dynamic 3DGS, paving the way for scalable FVV transmission and storage in immersive applications.
自由视角视频(FVV)能够提供沉浸式的三维体验,但高效压缩动态三维表示仍然是一个主要挑战。近期关于三维高斯点云渲染(3DGSR)及其动态扩展的技术进步已经实现了高质量的场景建模。然而,现有的方法通常将场景重建与依赖于优化编码的过程结合在一起,这限制了其通用性。本文介绍了一种新的前馈框架——动态高斯点云序列压缩(D-FCGS),专门用于压缩具有时间相关性的三维高斯点云序列。 我们的方法引入了一个基于帧组(GoF)的结构,并使用I-P帧编码方案,其中通过稀疏控制点提取了帧间的运动。这些生成的运动张量采用一种前馈方式压缩,该方式结合了超先验和时空先验的双重先验感知熵模型进行准确的速率估计。 在重建过程中,我们采用了控制点引导的运动补偿,并使用一个细化网络来增强视角一致性的保真度。D-FCGS通过多视图视频衍生出的高斯帧进行训练,在没有每场景优化的情况下即可跨场景泛化。实验表明,它能够匹配基于优化方法的率失真性能,在不到两秒钟的时间内实现超过40倍的压缩,并在不同视角下保持视觉质量。 这项工作推进了动态3DGSR前馈压缩技术的发展,为沉浸式应用中的可扩展FVV传输和存储铺平了道路。
https://arxiv.org/abs/2507.05859
As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.
作为关键的交通运输基础设施,桥梁面临着因老化和恶化而导致的问题日益严峻的情况,而传统的手动检查方法效率低下。尽管三维点云技术提供了一种新的数据驱动范式,但其应用潜力往往受到现实世界中由于缺少标签和扫描遮挡所导致的数据不完整性的限制。为了解决现有合成数据方法普遍化不足的瓶颈问题,本文提出了一种用于生成3D桥梁数据的系统框架。该框架能够自动产生包含组件级实例注释、高保真色彩和精确法向量的完整点云数据。此外,它还可以扩展以模拟创建多样化且物理上真实的不完整的点云,旨在支持分割网络和补全网络的训练任务。 实验结果表明,在现实世界的桥梁语义分割中,使用我们合成数据训练的PointNet++模型实现了84.2%的平均交并比(mIoU)。同时,经过微调的KT-Net在组件补全任务上表现出色。这项研究提供了一种创新的方法和基础数据集用于三维视觉分析桥梁结构,并为基础设施自动化管理和维护的进步提供了重要启示。
https://arxiv.org/abs/2507.05814
Robots struggle to understand object properties like shape, material, and semantics due to limited prior knowledge, hindering manipulation in unstructured environments. In contrast, humans learn these properties through interactive multi-sensor exploration. This work proposes fusing visual and tactile observations into a unified Gaussian Process Distance Field (GPDF) representation for active perception of object properties. While primarily focusing on geometry, this approach also demonstrates potential for modeling surface properties beyond geometry. The GPDF encodes signed distance using point cloud, analytic gradient and Hessian, and surface uncertainty estimates, which are attributes that common neural network shape representation lack. By utilizing a point cloud to construct a distance function, GPDF does not need extensive pretraining on large datasets and can incorporate observations by aggregation. Starting with an initial visual shape estimate, the framework iteratively refines the geometry by integrating dense vision measurements using differentiable rendering and tactile measurements at uncertain surface regions. By quantifying multi-sensor uncertainties, it plans exploratory motions to maximize information gain for recovering precise 3D structures. For the real-world robot experiment, we utilize the Franka Research 3 robot manipulator, which is fixed on a table and has a customized DIGIT tactile sensor and an Intel Realsense D435 RGBD camera mounted on the end-effector. In these experiments, the robot explores the shape and properties of objects assumed to be static and placed on the table. To improve scalability, we investigate approximation methods like inducing point method for Gaussian Processes. This probabilistic multi-modal fusion enables active exploration and mapping of complex object geometries, extending potentially beyond geometry.
机器人难以理解物体的属性,如形状、材质和语义,主要是因为它们缺乏先验知识,这在非结构化环境中阻碍了它们的操作能力。相比之下,人类通过多感官交互探索来学习这些属性。这项工作提出了一种将视觉和触觉观测融合为统一的高斯过程距离场(GPDF)表示的方法,用于主动感知物体的性质。虽然主要关注几何特性,但该方法也展示了超出几何特性的表面特征建模潜力。 GPDF 使用点云、解析梯度和海森矩阵编码有符号距离,并估计表面不确定性,这些都是常见的神经网络形状表征所缺乏的属性。通过利用点云来构建距离函数,GPDF 不需要在大型数据集上进行大量预训练,并且可以聚合观测信息以进行更新。 从初始视觉形状估计开始,该框架通过使用可微渲染技术整合密集视觉测量并结合不确定表面区域中的触觉测量,迭代地细化几何特性。通过量化多传感器不确定性,它计划探索性运动以最大化获取关于恢复精确3D结构的信息量。 在真实世界的机器人实验中,我们采用Franka Research 3机械臂,该设备固定在桌面上,并配备有定制的DIGIT触觉传感器和Intel RealSense D435 RGBD相机。在此类实验中,假设放置于桌子上的物体是静止不动的,机器人探索这些物体的形状与特性。 为了提高可扩展性,我们研究了像高斯过程中的诱导点方法这样的近似技术。这种概率多模态融合使得主动探索和绘制复杂对象几何结构成为可能,并且可能会超出仅限于几何特性的范畴。
https://arxiv.org/abs/2507.05522
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at this https URL.
三维点云的统一分割对于场景理解至关重要,但其稀疏结构、有限标注以及在复杂环境中区分细粒度物体类别的难度构成了挑战。现有方法由于监督不足和缺乏多样化的多模态线索,往往难以捕捉丰富的语义和上下文信息,导致类别和实例之间的差异不够明显。为了解决这些问题,我们提出了VDG-Uni3DSeg,这是一种新型框架,它将预训练的视觉-语言模型(如CLIP)与大型语言模型(LLMs)结合起来,以增强三维分割的效果。通过利用由LLM生成的文本描述和互联网上的参考图像,我们的方法可以融合丰富的多模态线索,从而促进细粒度类别和实例的分离。我们还设计了一种语义-视觉对比损失函数来对齐点特征与多模态查询,并且开发了一个空间增强模块以高效地建模场景级的关系。VDG-Uni3DSeg在封闭集假设下操作,利用离线生成的多模态知识,在语义分割、实例分割和全景分割方面实现了最先进的结果,为三维理解提供了一种可扩展和实用的解决方案。我们的代码可以在[此处](https://example.com)获取。
https://arxiv.org/abs/2507.05211
Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at this https URL
大多数掩码点云建模(MPM)方法遵循回归范式,以重建被遮蔽区域的坐标或特征。然而,这些方法倾向于过度约束模型去学习被遮蔽区域的细节信息,导致无法捕捉到通用性更强的特征。为了克服这一局限性,我们提出了**PointGAC**——一种基于聚类的新颖MPM方法,旨在对被遮罩区域的特征分布进行对齐。特别地,它采用了一种在线代码书指导下的教师-学生框架。 首先,PointGAC提出了一种几何感知分割策略来提取初始补丁。然后,教师模型通过基于完整补丁特征的在线k-means更新代码书,使得代码书向量成为聚类中心。之后,未被遮罩的特征将被分配到对应的聚类中心,而学生模型则对重建后的掩码区域特征进行该分配规则的学习和调整。这一策略专注于识别哪些聚类中心与被遮蔽特征相关联,从而使模型能够学习更具通用性的特征表示。 得益于所提出的代码书维护机制,在线更新的代码向量进一步提高了语义特征学习效率。实验结果验证了PointGAC在各种下游任务中的有效性。源码可在[这里](https://this-url.com)获取。
https://arxiv.org/abs/2507.04801