Autonomous docking remains one of the most challenging maneuvers in marine robotics, requiring precise control and robust perception in confined spaces. This paper presents a novel approach integrating Model Predictive Path Integral(MPPI) control with real-time LiDAR-based dock detection for autonomous surface vessel docking. Our framework uniquely combines probabilistic trajectory optimization with a multiobjective cost function that simultaneously considers docking precision, safety constraints, and motion efficiency. The MPPI controller generates optimal trajectories by intelligently sampling control sequences and evaluating their costs based on dynamic clearance requirements, orientation alignment, and target position objectives. We introduce an adaptive dock detection pipeline that processes LiDAR point clouds to extract critical geometric features, enabling real-time updates of docking parameters. The proposed method is extensively validated in a physics-based simulation environment that incorporates realistic sensor noise, vessel dynamics, and environmental constraints. Results demonstrate successful docking from various initial positions while maintaining safe clearances and smooth motion characteristics.
自主对接仍然是海洋机器人技术中最具挑战性的操作之一,要求在狭小空间内进行精确控制和稳健感知。本文提出了一种新颖的方法,将模型预测路径积分(MPPI)控制与实时LiDAR-based船坞检测相结合,用于自主水面船舶的靠泊。我们的框架独特地结合了概率轨迹优化和一个多目标成本函数,同时考虑了对接精度、安全约束以及运动效率。 MPPI控制器通过智能抽样控制序列并根据动态避碰要求、方向对齐及目标位置目标来评估其成本,从而生成最优轨迹。我们引入了一种自适应船坞检测流水线,该流程处理LiDAR点云以提取关键几何特征,使对接参数能够在实时中更新。 所提出的方法在物理基础的仿真环境中进行了广泛的验证,该环境包括了现实传感器噪声、船舶动力学以及环境约束等要素。结果表明,从各种初始位置成功实现靠泊,并且保持安全距离和流畅运动特性。
https://arxiv.org/abs/2501.09668
Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
视觉空间系统在混凝土裂缝检测中变得越来越重要。然而,现有方法往往缺乏对多样场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且难以处理曲线或复杂的几何形状。为了克服这些局限性,本研究提出了一种结合计算机视觉技术和多模态同步定位与地图构建(SLAM)的新框架,用于二维(2D)裂缝检测、三维(3D)重建和自动测量。首先,在DeepLabv3+分割模型的基础上进行改进,并利用基础模型Segment Anything Model (SAM) 进行特定的优化,我们开发了一种在不熟悉场景中具有强泛化的裂缝分割方法,能够生成精确的2D裂缝掩模。为了提高三维重建的准确性和鲁棒性,本研究结合了激光雷达点云、图像数据和分割掩模的数据。通过利用图像SLAM和激光雷达SLAM,我们开发了一个多帧和多模态融合框架,产生密集且彩色化的点云,在3D现实尺度上有效地捕捉裂缝语义信息。此外,还在三维稠密的点云空间内直接自动测量了裂缝几何属性,超出了传统二维图像基方法的限制。这一进展使得该方法适用于具有曲线及复杂三维几何形状的结构部件。各种混凝土结构上的实验结果强调了所提方法在实际应用中的显著改进和独特优势,证明其有效、准确且鲁棒性良好。
https://arxiv.org/abs/2501.09203
For Minimally Invasive Surgical (MIS) robots, accurate haptic interaction force feedback is essential for ensuring the safety of interacting with soft tissue. However, most existing MIS robotic systems cannot facilitate direct measurement of the interaction force with hardware sensors due to space limitations. This letter introduces an effective vision-based scheme that utilizes a One-Shot structured light projection with a designed pattern on soft tissue coupled with haptic information processing through a trained image-to-force neural network. The images captured from the endoscopic stereo camera are analyzed to reconstruct high-resolution 3D point clouds for soft tissue deformation. Based on this, a modified PointNet-based force estimation method is proposed, which excels in representing the complex mechanical properties of soft tissue. Numerical force interaction experiments are conducted on three silicon materials with different stiffness. The results validate the effectiveness of the proposed scheme.
对于微创手术(MIS)机器人而言,精确的触觉交互力反馈是确保与软组织安全互动的关键。然而,由于空间限制,大多数现有的MIS机器人系统无法通过硬件传感器直接测量交互力。本文介绍了一种有效的基于视觉的方法,该方法利用一次性结构光投影结合设计图案在软组织上,并且通过训练过的图像到力的神经网络处理触觉信息。从内窥镜立体相机捕捉到的图像被分析以重建高分辨率的3D点云来表示软组织变形。在此基础上,提出了一种改进的基于PointNet的力量估计方法,该方法擅长于表示软组织复杂的机械性质。在三种不同刚度的硅材料上进行了数值力交互实验,结果验证了所提方案的有效性。
https://arxiv.org/abs/2501.08593
We propose GOTLoc, a robust localization method capable of operating even in outdoor environments where GPS signals are unavailable. The method achieves this robust localization by leveraging comparisons between scene graphs generated from text descriptions and maps. Existing text-based localization studies typically represent maps as point clouds and identify the most similar scenes by comparing embeddings of text and point cloud data. However, point cloud maps have limited scalability as it is impractical to pre-generate maps for all outdoor spaces. Furthermore, their large data size makes it challenging to store and utilize them directly on actual robots. To address these issues, GOTLoc leverages compact data structures, such as scene graphs, to store spatial information, enabling individual robots to carry and utilize large amounts of map data. Additionally, by utilizing publicly available map data, such as OpenStreetMap, which provides global information on outdoor spaces, we eliminate the need for additional effort to create custom map data. For performance evaluation, we utilized the KITTI360Pose dataset in conjunction with corresponding OpenStreetMap data to compare the proposed method with existing approaches. Our results demonstrate that the proposed method achieves accuracy comparable to algorithms relying on point cloud maps. Moreover, in city-scale tests, GOTLoc required significantly less storage compared to point cloud-based methods and completed overall processing within a few seconds, validating its applicability to real-world robotics. Our code is available at this https URL.
我们提出了一种名为GOTLoc的鲁棒定位方法,该方法能够在GPS信号不可用的户外环境中运行。通过利用从文本描述生成的场景图与地图之间的比较,这种方法实现了稳健的位置定位。现有的基于文本的定位研究通常将地图表示为点云,并通过比较文本和点云数据的嵌入来识别最相似的场景。然而,点云地图在可扩展性方面存在限制,因为预先生成所有户外空间的地图是不切实际的。此外,由于其庞大的数据量,直接存储和利用这些数据对实际机器人来说具有挑战性。 为了解决这些问题,GOTLoc采用了紧凑的数据结构(如场景图)来存储空间信息,这使得单个机器人能够携带并使用大量的地图数据。此外,通过利用公开的地图数据,例如OpenStreetMap,该方法消除了创建自定义地图数据的额外工作量,并提供了关于户外空间的全球信息。 为了评估性能,我们结合了KITTI360Pose数据集和相应的OpenStreetMap数据来比较所提出的方法与现有方法。实验结果显示,我们的方法在精度上可以媲美依赖于点云地图的算法。此外,在城市规模测试中,GOTLoc所需的存储量远小于基于点云的方法,并且能够在几秒钟内完成全部处理过程,证明了其在现实世界机器人技术中的适用性。 我们的代码可以在[此链接](https://example.com)获取。
https://arxiv.org/abs/2501.08575
Radar is a low-cost and ubiquitous automotive sensor, but is limited by array resolution and sensitivity when performing direction of arrival analysis. Synthetic Aperture Radar (SAR) is a class of techniques to improve azimuth resolution and sensitivity for radar. Interferometric SAR (InSAR) can be used to extract elevation from the variations in phase measurements in SAR images. Utilizing InSAR we show that a typical, low-resolution radar array mounted on a vehicle can be used to accurately localize detections in 3D space for both urban and agricultural environments. We generate point clouds in each environment by combining InSAR with a signal processing scheme tailored to automotive driving. This low-compute approach allows radar to be used as a primary sensor to map fine details in complex driving environments, and be used to make autonomous perception decisions.
雷达是一种低成本且普遍的汽车传感器,但在进行到达方向分析时会受到阵列分辨率和灵敏度的限制。合成孔径雷达(SAR)是一类可以提高雷达方位分辨率和灵敏度的技术。干涉合成孔径雷达(InSAR)可以通过在SAR图像中的相位测量变化来提取高度信息。通过使用InSAR,我们展示了安装在车辆上的典型的低分辨率雷达阵列可以用于在城市和农业环境中准确地将检测定位在三维空间中。结合InSAR与为自动驾驶设计的信号处理方案,我们在每个环境中生成点云。这种计算量小的方法使得雷达能够作为主要传感器来绘制复杂驾驶环境中的细节,并可用于做出自主感知决策。
https://arxiv.org/abs/2501.08495
Birds Eye View perception models require extensive data to perform and generalize effectively. While traditional datasets often provide abundant driving scenes from diverse locations, this is not always the case. It is crucial to maximize the utility of the available training data. With the advent of large foundation models such as DINOv2 and Metric3Dv2, a pertinent question arises: can these models be integrated into existing model architectures to not only reduce the required training data but surpass the performance of current models? We choose two model architectures in the vehicle segmentation domain to alter: Lift-Splat-Shoot, and Simple-BEV. For Lift-Splat-Shoot, we explore the implementation of frozen DINOv2 for feature extraction and Metric3Dv2 for depth estimation, where we greatly exceed the baseline results by 7.4 IoU while utilizing only half the training data and iterations. Furthermore, we introduce an innovative application of Metric3Dv2's depth information as a PseudoLiDAR point cloud incorporated into the Simple-BEV architecture, replacing traditional LiDAR. This integration results in a +3 IoU improvement compared to the Camera-only model.
鸟瞰视角感知模型需要大量的数据才能有效运行和泛化。虽然传统的数据集通常提供来自各种地点的丰富驾驶场景,但这并不总是足够的。最大化利用现有的训练数据至关重要。随着大型基础模型(如DINOv2和Metric3Dv2)的出现,一个相关的问题出现了:这些模型是否可以集成到现有模型架构中,不仅减少所需的数据量,还能超越当前模型的表现?我们选择了车辆分割领域的两个模型架构进行修改:Lift-Splat-Shoot 和 Simple-BEV。 对于 Lift-Splat-Shoot 模型,我们将探索使用冻结的 DINOv2 进行特征提取和 Metric3Dv2 进行深度估计的方法。这种方法在利用一半训练数据和迭代次数的情况下,比基线结果超出了 7.4 的交并比(IoU)。 此外,我们还引入了将 Metric3Dv2 深度信息作为伪激光雷达点云集成到 Simple-BEV 架构中的创新应用,以替代传统的激光雷达。这种整合与仅使用相机的模型相比,提高了 3 IoU 的性能。
https://arxiv.org/abs/2501.08118
Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160K , to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability in navigating the complexities of the physical world. 3UR-LLM directly receives 3D point cloud as input and project 3D features fused with text instructions into a manageable set of tokens. Considering the computation burden derived from these hybrid tokens, we design a 3D compressor module to cohesively compress the 3D spatial cues and textual narrative. 3UR-LLM achieves promising performance with respect to the previous SOTAs, for instance, 3UR-LLM exceeds its counterparts by 7.1\% CIDEr on ScanQA, while utilizing fewer training resources. The code and model weights for 3UR-LLM and the 3DS-160K benchmark are available at 3UR-LLM.
多模态大型语言模型(MLLMs)在处理二维任务时表现出色,但在从二维过渡到三维表示时,在辨别场景中的空间位置、相互关系和因果逻辑方面面临挑战。我们发现这些限制主要源自两方面:i) 高昂的标注成本限制了大规模三维场景数据的增长;ii) 缺乏一种直接且有效的方式来感知三维信息,导致训练时间延长,并使模型结构复杂化。 为此,我们在开源2D MLLMs和LLMs的基础上开发了一条生成高质量3D-文本对的流水线,并构建了包含160K组数据的3DS-160K数据集。这一举措旨在优化预训练过程。利用这些高质量的预训练数据,我们引入了一个端到端的三维MLLM模型——3UR-LLM,专为精确解释三维场景设计,展示了在物理世界复杂性导航方面卓越的能力。 3UR-LLM可以直接接收三维点云作为输入,并将融合了文本指令的三维特征转化为一组可管理的令牌。考虑到这些混合令牌带来的计算负担,我们设计了一个3D压缩器模块来综合压缩空间提示和文字叙述信息。 与之前的最先进方法(SOTAs)相比,3UR-LLM在多个指标上表现出色。例如,在ScanQA任务中,3UR-LLM的CIDEr得分比其他模型高出7.1%,同时使用更少的训练资源。3UR-LLM及其基准数据集3DS-160K的相关代码和模型权重可在该项目页面获取。
https://arxiv.org/abs/2501.07819
The discriminative feature is crucial for point cloud registration. Recent methods improve the feature discriminative by distinguishing between non-overlapping and overlapping region points. However, they still face challenges in distinguishing the ambiguous structures in the overlapping regions. Therefore, the ambiguous features they extracted resulted in a significant number of outlier matches from overlapping regions. To solve this problem, we propose a prior-guided SMoE-based registration method to improve the feature distinctiveness by dispatching the potential correspondences to the same experts. Specifically, we propose a prior-guided SMoE module by fusing prior overlap and potential correspondence embeddings for routing, assigning tokens to the most suitable experts for processing. In addition, we propose a registration framework by a specific combination of Transformer layer and prior-guided SMoE module. The proposed method not only pays attention to the importance of locating the overlapping areas of point clouds, but also commits to finding more accurate correspondences in overlapping areas. Our extensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art registration recall (95.7\%/79.3\%) on the 3DMatch/3DLoMatch benchmark. Moreover, we also test the performance on ModelNet40 and demonstrate excellent performance.
区分特征对于点云配准至关重要。最近的方法通过区分非重叠区域和重叠区域的点来提升特征的区分性,然而,在处理重叠区域中的模棱两可结构时仍然面临挑战,导致提取到的模棱两可特征产生了大量来自重叠区域的异常匹配项。为了解决这个问题,我们提出了一种基于先验指导的SMoE(Switchable Module Experts)方法来通过将潜在对应关系分配给相同的专家来提高特征的独特性。 具体而言,我们提出了一个融合了先验重叠和潜在对应嵌入进行路由,并将令牌分配给最适合处理它们的专家的先验引导式SMoE模块。此外,我们还提出了一种特定组合了Transformer层与先验引导式SMoE模块的配准框架。 我们的方法不仅注重确定点云重叠区域的重要性,而且致力于在这些区域内寻找更准确的对应关系。广泛的实验验证了该方法的有效性,在3DMatch/3DLoMatch基准测试中实现了最先进的配准召回率(95.7%/79.3%)。此外,我们在ModelNet40数据集上也进行了性能测试,并展示了卓越的表现。
https://arxiv.org/abs/2501.07762
We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360$^{\circ}$ coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.
我们介绍了Uncommon Objects in 3D (uCO3D),这是一个新的以对象为中心的数据集,用于三维深度学习和三维生成式人工智能。uCO3D是目前公开可用的、包含高分辨率物体视频并提供完整360度三维注释的最大集合。与MVImgNet和CO3Dv2相比,uCO3D在多样性上显著增加,涵盖了超过1,000个不同的物体类别,并且由于对收集到的视频和三维注释进行了严格的质量检查,因此数据集的质量也更高。类似其他同类数据集,uCO3D包含了三维相机姿态、深度图以及稀疏点云的标注信息。此外,每个对象还配备了描述性文字说明及三维高斯光束重建模型(3D Gaussian Splat reconstruction)。我们在MVImgNet、CO3Dv2和uCO3D上训练了几种大规模的三维模型,并发现在使用uCO3D进行训练时可以获得更好的结果,这表明uCO3D对于学习应用来说更加优越。
https://arxiv.org/abs/2501.07574
3D Gaussian Splatting (3DGS) excels at producing highly detailed 3D reconstructions, but these scenes often require specialised renderers for effective visualisation. In contrast, point clouds are a widely used 3D representation and are compatible with most popular 3D processing software, yet converting 3DGS scenes into point clouds is a complex challenge. In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds. We sample points probabilistically from each Gaussian as a 3D density function. We additionally threshold new points using the Mahalanobis distance to the Gaussian centre, preventing extreme outliers. The result is a point cloud that closely represents the shape encoded into the 3D Gaussian scene. Individual Gaussians use spherical harmonics to adapt colours depending on view, and each point may contribute only subtle colour hints to the resulting rendered scene. To avoid spurious or incorrect colours that do not fit with the final point cloud, we recalculate Gaussian colours via a customised image rendering approach, assigning each Gaussian the colour of the pixel to which it contributes most across all views. 3DGS-to-PC also supports mesh generation through Poisson Surface Reconstruction, applied to points sampled from predicted surface Gaussians. This allows coloured meshes to be generated from 3DGS scenes without the need for re-training. This package is highly customisable and capability of simple integration into existing 3DGS pipelines. 3DGS-to-PC provides a powerful tool for converting 3DGS data into point cloud and surface-based formats.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)在生成高度详细的三维重建方面表现出色,但这些场景通常需要专门的渲染器来进行有效的可视化。相比之下,点云是一种广泛使用的三维表示形式,并且与大多数流行的三维处理软件兼容;然而,将3DGS场景转换为点云是一个复杂的挑战。在这项工作中,我们介绍了3DGS-to-PC,这是一个灵活且高度可定制的框架,能够将3DGS场景转换成密集、高精度的点云。 该方法从每个高斯函数作为三维密度分布中概率性地采样点。此外,还利用马氏距离(Mahalanobis distance)对新生成的点进行阈值处理,以防止极端离群点出现。结果是生成了一个紧密代表3DGS场景形状的点云。每一个高斯函数使用球谐函数来根据视角适应颜色,并且每个点可以为最终渲染场景提供细微的颜色提示。 为了避免不符合最终点云的不恰当或错误颜色,在通过自定义图像渲染方法重新计算高斯色后,我们给每个高斯分配其贡献最大的像素的颜色。3DGS-to-PC还支持通过泊松表面重建法(Poisson Surface Reconstruction)生成网格,该方法应用于从预测表面高斯函数中采样的点上。这允许直接从3DGS场景生成彩色网格而无需重新训练。此工具包高度可定制,并且可以轻松地集成到现有的3DGS工作流程中。 总之,3DGS-to-PC提供了一种强大的工具来将3DGS数据转换为点云和基于表面的格式。
https://arxiv.org/abs/2501.07478
Consistent maps are key for most autonomous mobile robots. They often use SLAM approaches to build such maps. Loop closures via place recognition help maintain accurate pose estimates by mitigating global drift. This paper presents a robust loop closure detection pipeline for outdoor SLAM with LiDAR-equipped robots. The method handles various LiDAR sensors with different scanning patterns, field of views and resolutions. It generates local maps from LiDAR scans and aligns them using a ground alignment module to handle both planar and non-planar motion of the LiDAR, ensuring applicability across platforms. The method uses density-preserving bird's eye view projections of these local maps and extracts ORB feature descriptors from them for place recognition. It stores the feature descriptors in a binary search tree for efficient retrieval, and self-similarity pruning addresses perceptual aliasing in repetitive environments. Extensive experiments on public and self-recorded datasets demonstrate accurate loop closure detection, long-term localization, and cross-platform multi-map alignment, agnostic to the LiDAR scanning patterns, fields of view, and motion profiles.
一致性地图对于大多数自主移动机器人至关重要。这些机器人通常使用SLAM(即时定位与地图构建)方法来创建这样的地图。通过地点识别进行循环闭合有助于减少全局漂移,从而维持准确的姿态估计。本文提出了一种适用于室外环境的、配备激光雷达的机器人的鲁棒性循环闭合检测流水线。 该方法能够处理不同扫描模式、视场和分辨率的各种激光雷达传感器。它从激光雷达扫描中生成局部地图,并通过地面对齐模块将它们进行配准,以应对激光雷达在平面和非平面上的不同运动方式,确保跨平台的适用性。该方法利用局部地图的密度保持俯视图投影提取ORB特征描述符用于地点识别。 特征描述符存储在一个二进制搜索树中,以便高效检索,并且自相似度修剪解决了重复环境中感知别名的问题。在公共和自行记录的数据集上进行的大量实验验证了准确的循环闭合检测、长期定位以及跨平台多图对齐的能力,这些能力不依赖于激光雷达扫描模式、视场或运动配置文件的不同。
https://arxiv.org/abs/2501.07399
3D single object tracking (3DSOT) in LiDAR point clouds is a critical task for outdoor perception, enabling real-time perception of object location, orientation, and motion. Despite the impressive performance of current 3DSOT methods, evaluating them on clean datasets inadequately reflects their comprehensive performance, as the adverse weather conditions in real-world surroundings has not been considered. One of the main obstacles is the lack of adverse weather benchmarks for the evaluation of 3DSOT. To this end, this work proposes a challenging benchmark for LiDAR-based 3DSOT in adverse weather, which comprises two synthetic datasets (KITTI-A and nuScenes-A) and one real-world dataset (CADC-SOT) spanning three weather types: rain, fog, and snow. Based on this benchmark, five representative 3D trackers from different tracking frameworks conducted robustness evaluation, resulting in significant performance degradations. This prompts the question: What are the factors that cause current advanced methods to fail on such adverse weather samples? Consequently, we explore the impacts of adverse weather and answer the above question from three perspectives: 1) target distance; 2) template shape corruption; and 3) target shape corruption. Finally, based on domain randomization and contrastive learning, we designed a dual-branch tracking framework for adverse weather, named DRCT, achieving excellent performance in benchmarks.
基于激光雷达点云的单个物体三维跟踪(3DSOT)是室外感知的关键任务,它能够实现目标对象位置、姿态和运动的实时感知。尽管目前的3DSOT方法表现出色,但仅在清洁数据集上进行评估无法全面反映其性能,因为现实世界中的恶劣天气条件未被充分考虑。其中一个主要障碍是没有为3DSOT评估设计的恶劣天气基准测试。 为此,这项工作提出了一个具有挑战性的基于激光雷达的3DSOT恶劣天气基准测试,包括两个合成数据集(KITTI-A和nuScenes-A)以及一个真实世界的数据集(CADC-SOT),涵盖了雨、雾和雪三种类型的恶劣天气。根据这一基准,来自不同跟踪框架的五个代表性的3D追踪器进行了鲁棒性评估,结果表明性能显著下降。这引发了问题:是什么因素导致当前先进的方法在这种恶劣天气样本上表现不佳?因此,我们从三个方面探讨了恶劣天气的影响,并回答了上述问题:1)目标距离;2)模板形状损坏;和3)目标形状损坏。 最后,在领域随机化和对比学习的基础上,我们设计了一个用于恶劣天气的双分支跟踪框架DRCT(Domain Randomization and Contrastive Learning-based Dual-branch Tracker),在基准测试中取得了卓越的成绩。
https://arxiv.org/abs/2501.07133
In recent years, point cloud upsampling has been widely applied in fields such as 3D reconstruction. Our study investigates the factors influencing point cloud upsampling on both global and local levels through representation learning. Specifically, the paper inputs global and local information of the same point cloud model object into two encoders to extract these features, fuses them, and then feeds the combined features into an upsampling decoder. The goal is to address issues of sparsity and noise in point clouds by leveraging prior knowledge from both global and local inputs. And the proposed framework can be applied to any state-of-the-art point cloud upsampling neural network. Experiments were conducted on a series of autoencoder-based models utilizing deep learning, yielding interpretability for both global and local inputs, and it has been proven in the results that our proposed framework can further improve the upsampling effect in previous SOTA works. At the same time, the Saliency Map reflects the differences between global and local feature inputs, as well as the effectiveness of training with both inputs in parallel.
近年来,点云上采样技术在3D重建等领域得到了广泛应用。我们的研究通过表示学习方法探讨了影响点云全局和局部层面上采样的因素。具体来说,论文中提出了将同一点云模型对象的全局信息和局部信息分别输入到两个编码器中提取特征,并将这些特征融合后送入一个上采样解码器的过程。该方法旨在通过利用来自全局和局部输入的先验知识来解决点云稀疏性和噪声问题。而且,提出的框架可以应用于任何最先进的点云上采样神经网络。 实验在一系列基于自编码器的深度学习模型上进行了验证,并且为全局和局部输入提供了可解释性。结果显示我们的方法能够进一步提升先前最先进工作的上采样效果。同时,显著图(Saliency Map)反映了全局与局部特征输入之间的差异,以及同时使用这两种输入进行训练的有效性。
https://arxiv.org/abs/2501.07076
Achieving high-fidelity 3D reconstruction from monocular video remains challenging due to the inherent limitations of traditional methods like Structure-from-Motion (SfM) and monocular SLAM in accurately capturing scene details. While differentiable rendering techniques such as Neural Radiance Fields (NeRF) address some of these challenges, their high computational costs make them unsuitable for real-time applications. Additionally, existing 3D Gaussian Splatting (3DGS) methods often focus on photometric consistency, neglecting geometric accuracy and failing to exploit SLAM's dynamic depth and pose updates for scene refinement. We propose a framework integrating dense SLAM with 3DGS for real-time, high-fidelity dense reconstruction. Our approach introduces SLAM-Informed Adaptive Densification, which dynamically updates and densifies the Gaussian model by leveraging dense point clouds from SLAM. Additionally, we incorporate Geometry-Guided Optimization, which combines edge-aware geometric constraints and photometric consistency to jointly optimize the appearance and geometry of the 3DGS scene representation, enabling detailed and accurate SLAM mapping reconstruction. Experiments on the Replica and TUM-RGBD datasets demonstrate the effectiveness of our approach, achieving state-of-the-art results among monocular systems. Specifically, our method achieves a PSNR of 36.864, SSIM of 0.985, and LPIPS of 0.040 on Replica, representing improvements of 10.7%, 6.4%, and 49.4%, respectively, over the previous SOTA. On TUM-RGBD, our method outperforms the closest baseline by 10.2%, 6.6%, and 34.7% in the same metrics. These results highlight the potential of our framework in bridging the gap between photometric and geometric dense 3D scene representations, paving the way for practical and efficient monocular dense reconstruction.
从单目视频实现高保真的三维重建仍然具有挑战性,这是由于传统方法如基于运动的结构恢复(SfM)和单目SLAM固有的限制所导致的。尽管神经辐射场(NeRF)等可微渲染技术解决了部分问题,但其高昂的计算成本使其不适合实时应用。此外,现有的3D高斯点着色(3DGS)方法通常侧重于光度一致性,忽视了几何准确性,并且未能利用SLAM动态深度和姿态更新来优化场景重建。 我们提出了一种框架,将密集型SLAM与3DGS结合,实现实时、高保真的稠密重建。我们的方法引入了由SLAM信息引导的自适应密度化(SLAM-Informed Adaptive Densification),该方法通过利用来自SLAM的密集点云动态更新和增加高斯模型的密度。此外,我们采用了几何指导优化(Geometry-Guided Optimization)技术,结合边缘感知的几何约束与光度一致性来共同优化3DGS场景表示的外观和几何形状,从而实现详细而准确的SLAM映射重建。 在Replica和TUM-RGBD数据集上的实验表明了我们方法的有效性,并取得了单目系统中最新的结果。具体来说,在Replica上,我们的方法实现了PSNR为36.864、SSIM为0.985以及LPIPS为0.040,分别比以前的最新技术提高了10.7%、6.4%和49.4%。在TUM-RGBD数据集上,我们的方法以相同的指标优于最近的方法达10.2%、6.6%和34.7%。 这些结果突显了我们框架在连接光度与几何稠密三维场景表示之间的差距方面的潜力,并为实用且高效的单目密集重建铺平了道路。
https://arxiv.org/abs/2501.07015
In this paper, we present a large-scale fine-grained dataset using high-resolution images captured from locations worldwide. Compared to existing datasets, our dataset offers a significantly larger size and includes a higher level of detail, making it uniquely suited for fine-grained 3D applications. Notably, our dataset is built using drone-captured aerial imagery, which provides a more accurate perspective for capturing real-world site layouts and architectural structures. By reconstructing environments with these detailed images, our dataset supports applications such as the COLMAP format for Gaussian Splatting and the Structure-from-Motion (SfM) method. It is compatible with widely-used techniques including SLAM, Multi-View Stereo, and Neural Radiance Fields (NeRF), enabling accurate 3D reconstructions and point clouds. This makes it a benchmark for reconstruction and segmentation tasks. The dataset enables seamless integration with multi-modal data, supporting a range of 3D applications, from architectural reconstruction to virtual tourism. Its flexibility promotes innovation, facilitating breakthroughs in 3D modeling and analysis.
在这篇论文中,我们介绍了一个大规模的精细粒度数据集,该数据集使用从全球各地采集的高分辨率图像构建而成。与现有的数据集相比,我们的数据集规模更大,并且包含更高的细节水平,使其特别适合用于细粒度3D应用。值得注意的是,我们的数据集是基于无人机拍摄的空中影像构建的,这提供了捕捉真实世界场地布局和建筑结构更准确视角的能力。 通过使用这些详细图像重构环境,我们的数据集支持诸如用于高斯点置的COLMAP格式以及从运动中恢复结构(SfM)方法的应用。该数据集与包括同时定位与地图构建(SLAM)、多视图立体匹配和神经辐射场(NeRF)在内的广泛使用的技术兼容,从而能够进行准确的3D重建和点云生成。这使其成为重构和分割任务的基准。 此外,该数据集支持无缝集成多种模式的数据,适用于从建筑重建到虚拟旅游等各种3D应用。它的灵活性促进了创新,在3D建模和分析方面推动了突破性进展。
https://arxiv.org/abs/2501.06927
LiDAR and photogrammetry are active and passive remote sensing techniques for point cloud acquisition, respectively, offering complementary advantages and heterogeneous. Due to the fundamental differences in sensing mechanisms, spatial distributions and coordinate systems, their point clouds exhibit significant discrepancies in density, precision, noise, and overlap. Coupled with the lack of ground truth for large-scale scenes, integrating the heterogeneous point clouds is a highly challenging task. This paper proposes a self-supervised registration network based on a masked autoencoder, focusing on heterogeneous LiDAR and photogrammetric point clouds. At its core, the method introduces a multi-scale masked training strategy to extract robust features from heterogeneous point clouds under self-supervision. To further enhance registration performance, a rotation-translation embedding module is designed to effectively capture the key features essential for accurate rigid transformations. Building upon the robust representations, a transformer-based architecture seamlessly integrates local and global features, fostering precise alignment across diverse point cloud datasets. The proposed method demonstrates strong feature extraction capabilities for both LiDAR and photogrammetric point clouds, addressing the challenges of acquiring ground truth at the scene level. Experiments conducted on two real-world datasets validate the effectiveness of the proposed method in solving heterogeneous point cloud registration problems.
LiDAR和摄影测量分别是获取点云数据的主动和被动遥感技术,它们各自提供互补的优势,并且是异构的。由于在感知机制、空间分布和坐标系统上的基本差异,这些技术所获得的点云在外密度、精度、噪声及重叠度等方面存在显著区别。加之大规模场景中缺乏地面实况数据(ground truth),整合这些异构点云成为一项极具挑战性的任务。 本文提出了一种基于掩码自编码器(masked autoencoder)的自我监督注册网络,专注于处理异构LiDAR和摄影测量点云。该方法的核心在于引入一种多尺度掩码训练策略,在自我监督条件下从异构点云中提取出鲁棒性特征。为了进一步提升注册性能,设计了一种旋转平移嵌入模块(rotation-translation embedding module),以有效捕捉实现精确刚体变换所需的关键特征。基于这些稳健的表示形式,一种基于Transformer架构被开发出来,能够无缝整合局部和全局特性,从而促进不同点云数据集间的精准对齐。 该方法展示了在提取LiDAR及摄影测量点云特征方面的强大能力,并解决了在场景级别获取地面实况数据这一挑战。通过对两个真实世界数据集进行实验验证,证明了所提出的方法能够有效解决异构点云注册问题。
https://arxiv.org/abs/2501.05669
Unmanned surface vehicles (USVs) and boats are increasingly important in maritime operations, yet their deployment is limited due to costly sensors and complexity. LiDAR, radar, and depth cameras are either costly, yield sparse point clouds or are noisy, and require extensive calibration. Here, we introduce a novel approach for approximate distance estimation in USVs using supervised object detection. We collected a dataset comprising images with manually annotated bounding boxes and corresponding distance measurements. Leveraging this data, we propose a specialized branch of an object detection model, not only to detect objects but also to predict their distances from the USV. This method offers a cost-efficient and intuitive alternative to conventional distance measurement techniques, aligning more closely with human estimation capabilities. We demonstrate its application in a marine assistance system that alerts operators to nearby objects such as boats, buoys, or other waterborne hazards.
无人水面车辆(USV)和船只在海上操作中越来越重要,但由于昂贵的传感器和复杂性,它们的应用受到限制。LiDAR、雷达和深度相机要么价格高昂,要么生成稀疏的点云或噪声较多,并且需要进行大量的校准工作。在这里,我们介绍了一种新的方法,使用监督式目标检测来估算USV的距离。我们收集了一个数据集,其中包含带有手动标注边界框以及相应距离测量值的图像。利用这些数据,我们提出了一种对象检测模型的专业分支,不仅可以检测物体,还可以预测它们与USV之间的距离。这种方法提供了一种成本效益高且直观的方法,替代传统的距离测量技术,并且更接近于人类估算能力。我们在一个海洋辅助系统中展示了该方法的应用,当有诸如船只、浮标或其他水上障碍物等附近物体时,它会向操作员发出警报。
https://arxiv.org/abs/2501.05567
The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \url{this https URL}.
预训练和微调的范式已经彻底革新了卫星遥感应用。然而,这种方法在航空激光扫描(ALS)领域仍被很大程度上忽视,而ALS对于森林管理和城市规划等应用场景来说是一项关键技术。在这项研究中,我们通过构建大规模的ALS点云数据集并评估其对下游应用程序的影响来填补这一空白。我们的数据集包括了从美国地质调查局3D高程计划提供的覆盖整个美利坚合众国连续区域内的ALS点云数据。 为了在确保高效的数据采集的同时捕捉多样化的土地覆盖和地形类型,我们引入了一种基于土地利用图和数字高程模型的地理空间采样方法来选择点云地块。作为基线的自监督学习模型,我们采用最先进的用于3D室外点云的掩码自动编码器BEV-MAE,并在构建的数据集上对其进行预训练。随后,我们将这些预训练模型微调以应用于下游任务,包括树种分类、地形场景识别和点云语义分割。 我们的研究结果表明,在所有下游任务中,预训练模型的表现均显著优于从头开始训练的模型,这证明了我们所提出数据集中学习到表示形式的可迁移性。此外,我们观察到使用我们的地理空间采样方法扩大量化规模会一致地提升性能,而基于随机采样的数据集进行预训练则无法达到类似的改善效果。这些发现强调了构建的数据集和我们的采样策略在预训练与微调范式中的实用性。 源代码和预训练模型将在[此处](此URL)公开提供。
https://arxiv.org/abs/2501.05095
3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
3D Referring Expression Segmentation (3D-RES) 的目标是根据给定的表达式对点云场景进行分割。然而,现有的3D-RES方法面临着两个主要挑战:特征模糊性和意图模糊性。特征模糊性来源于由于光照和视角等限制,在获取点云过程中导致的信息丢失或失真。意图模糊性指的是模型在解码过程中对所有查询一视同仁,缺乏特定任务的自上而下的指导信息。 在这篇论文中,我们引入了一种图像增强提示解码网络(IPDN),该网络利用多视角图像和任务驱动信息来提升模型的理解能力。为了应对特征模糊性问题,我们提出了一种多视图语义嵌入(MSE)模块,它将2D图像的多视角信息注入到3D场景中,并补偿潜在的空间信息损失。为了解决意图模糊性问题,我们设计了一个提示感知解码器(PAD),该解码器通过表达式与视觉特征之间交互产生的任务驱动信号来指导解码过程。 综合实验表明,在3D-RES和3D-GRES任务上,IPDN在mIoU度量标准下分别比最先进的方法高出1.9分和4.2分。
https://arxiv.org/abs/2501.04995
Computer-aided design (CAD) tools empower designers to design and modify 3D models through a series of CAD operations, commonly referred to as a CAD sequence. In scenarios where digital CAD files are not accessible, reverse engineering (RE) has been used to reconstruct 3D CAD models. Recent advances have seen the rise of data-driven approaches for RE, with a primary focus on converting 3D data, such as point clouds, into 3D models in boundary representation (B-rep) format. However, obtaining 3D data poses significant challenges, and B-rep models do not reveal knowledge about the 3D modeling process of designs. To this end, our research introduces a novel data-driven approach with an Image2CADSeq neural network model. This model aims to reverse engineer CAD models by processing images as input and generating CAD sequences. These sequences can then be translated into B-rep models using a solid modeling kernel. Unlike B-rep models, CAD sequences offer enhanced flexibility to modify individual steps of model creation, providing a deeper understanding of the construction process of CAD models. To quantitatively and rigorously evaluate the predictive performance of the Image2CADSeq model, we have developed a multi-level evaluation framework for model assessment. The model was trained on a specially synthesized dataset, and various network architectures were explored to optimize the performance. The experimental and validation results show great potential for the model in generating CAD sequences from 2D image data.
计算机辅助设计(CAD)工具使设计师能够通过一系列的CAD操作来创建和修改3D模型,这些操作通常被称为CAD序列。在无法获取数字CAD文件的情况下,逆向工程(RE)被用来重建3D CAD模型。最近的技术进步推动了数据驱动方法在RE领域的应用,主要集中在将点云等三维数据转换成边界表示法(B-rep)格式的三维模型上。然而,获取三维数据面临着重大挑战,并且B-rep模型未能揭示关于设计3D建模过程的知识。 为了解决这些问题,我们的研究提出了一种新颖的数据驱动方法,采用名为Image2CADSeq的神经网络模型。该模型旨在通过处理图像输入来逆向工程CAD模型并生成CAD序列。这些序列可以进一步被转换成使用实体建模内核创建的B-rep模型。与仅提供最终结果的B-rep模型不同,CAD序列提供了增强的灵活性以修改模型构建过程中的各个步骤,从而能够更深入地理解CAD模型的构造流程。 为了定量和严格评估Image2CADSeq模型的预测性能,我们开发了一个多层次的评价框架来进行模型评估。该模型是在特别合成的数据集上训练的,并且探索了多种网络架构来优化性能。实验和验证结果显示,该模型在从二维图像数据生成CAD序列方面具有巨大的潜力。
https://arxiv.org/abs/2501.04928