SLAM is a foundational technique with broad applications in robotics and AR/VR. SLAM simulations evaluate new concepts, but testing on resource-constrained devices, such as VR HMDs, faces challenges: high computational cost and restricted sensor data access. This work proposes a sparse framework using mesh geometry projections as features, which improves efficiency and circumvents direct sensor data access, advancing SLAM research as we demonstrate in VR and through numerical evaluation.
SLAM( simultaneous localization and mapping,即同步定位与地图构建)是一项在机器人技术和AR/VR领域具有广泛应用的基础技术。SLAM模拟可以用来评估新的概念,但将其应用于资源受限的设备上,如VR头显,则面临高昂计算成本和有限传感器数据访问权的问题。这项工作提出了一种稀疏框架,该框架利用网格几何投影作为特征,从而提高效率并绕过直接获取传感器数据的需求。我们在虚拟现实环境中的实验证明了这一点,并通过数值评估展示了其在推进SLAM研究方面的进展。
https://arxiv.org/abs/2501.09600
This article presents a comparative analysis of a mobile robot trajectories computed by various ROS-based SLAM systems. For this reason we developed a prototype of a mobile robot with common sensors: 2D lidar, a monocular and ZED stereo cameras. Then we conducted experiments in a typical office environment and collected data from all sensors, running all tested SLAM systems based on the acquired dataset. We studied the following SLAM systems: (a) 2D lidar-based: GMapping, Hector SLAM, Cartographer; (b) monocular camera-based: Large Scale Direct monocular SLAM (LSD SLAM), ORB SLAM, Direct Sparse Odometry (DSO); and (c) stereo camera-based: ZEDfu, Real-Time Appearance-Based Mapping (RTAB map), ORB SLAM, Stereo Parallel Tracking and Mapping (S-PTAM). Since all SLAM methods were tested on the same dataset we compared results for different SLAM systems with appropriate metrics, demonstrating encouraging results for lidar-based Cartographer SLAM, Monocular ORB SLAM and Stereo RTAB Map methods.
本文对基于ROS的SLAM系统计算出的不同移动机器人轨迹进行了比较分析。为此,我们开发了一个配备有常见传感器(2D激光雷达、单目摄像头和ZED立体摄像头)的移动机器人原型。然后在典型的办公室环境中进行了实验,并从所有传感器收集了数据,在所获得的数据集上运行了所有测试过的SLAM系统。本文研究了以下几种SLAM系统: (a) 基于2D激光雷达:GMapping、Hector SLAM和Cartographer; (b) 基于单目摄像头:大范围直接单目SLAM(LSD SLAM)、ORB SLAM以及直接稀疏里程计(DSO); (c) 基于立体摄像头:ZEDfu、实时基于外观的映射(RTAB map)、ORB SLAM和立体并行跟踪与地图构建(S-PTAM)。 由于所有SLAM方法都是在同一数据集上进行测试,因此我们使用适当的指标对不同SLAM系统的结果进行了比较。结果显示基于激光雷达的Cartographer SLAM、单目ORB SLAM以及立体RTAB Map方法取得了令人鼓舞的结果。
https://arxiv.org/abs/2501.09490
Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
视觉空间系统在混凝土裂缝检测中变得越来越重要。然而,现有方法往往缺乏对多样场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且难以处理曲线或复杂的几何形状。为了克服这些局限性,本研究提出了一种结合计算机视觉技术和多模态同步定位与地图构建(SLAM)的新框架,用于二维(2D)裂缝检测、三维(3D)重建和自动测量。首先,在DeepLabv3+分割模型的基础上进行改进,并利用基础模型Segment Anything Model (SAM) 进行特定的优化,我们开发了一种在不熟悉场景中具有强泛化的裂缝分割方法,能够生成精确的2D裂缝掩模。为了提高三维重建的准确性和鲁棒性,本研究结合了激光雷达点云、图像数据和分割掩模的数据。通过利用图像SLAM和激光雷达SLAM,我们开发了一个多帧和多模态融合框架,产生密集且彩色化的点云,在3D现实尺度上有效地捕捉裂缝语义信息。此外,还在三维稠密的点云空间内直接自动测量了裂缝几何属性,超出了传统二维图像基方法的限制。这一进展使得该方法适用于具有曲线及复杂三维几何形状的结构部件。各种混凝土结构上的实验结果强调了所提方法在实际应用中的显著改进和独特优势,证明其有效、准确且鲁棒性良好。
https://arxiv.org/abs/2501.09203
Current visual SLAM systems face significant challenges in balancing computational efficiency with robust loop closure handling. Traditional approaches require careful manual tuning and incur substantial computational overhead, while learning-based methods either lack explicit loop closure capabilities or implement them through computationally expensive methods. We present AutoLoop, a novel approach that combines automated curriculum learning with efficient fine-tuning for visual SLAM systems. Our method employs a DDPG (Deep Deterministic Policy Gradient) agent to dynamically adjust loop closure weights during training, eliminating the need for manual hyperparameter search while significantly reducing the required training steps. The approach pre-computes potential loop closure pairs offline and leverages them through an agent-guided curriculum, allowing the model to adapt efficiently to new scenarios. Experiments conducted on TartanAir for training and validated across multiple benchmarks including KITTI, EuRoC, ICL-NUIM and TUM RGB-D demonstrate that AutoLoop achieves comparable or superior performance while reducing training time by an order of magnitude compared to traditional approaches. AutoLoop provides a practical solution for rapid adaptation of visual SLAM systems, automating the weight tuning process that traditionally requires multiple manual iterations. Our results show that this automated curriculum strategy not only accelerates training but also maintains or improves the model's performance across diverse environmental conditions.
当前的视觉SLAM(同步定位与地图构建)系统在计算效率和鲁棒循环闭合处理之间面临着显著挑战。传统方法需要精心的手动调整,并且会产生大量的计算开销,而基于学习的方法要么缺少明确的循环闭合能力,要么通过耗费大量计算资源的方式实现这种能力。我们提出了一种名为AutoLoop的新颖方法,它结合了自动课程学习和高效微调策略来优化视觉SLAM系统。 我们的方法采用深度确定性政策梯度(DDPG)代理,在训练过程中动态调整循环闭合权重,从而消除了手动超参数搜索的需要,并大幅减少了所需的训练步骤。该方法预先计算潜在的循环闭合对并在离线环境下使用代理人引导的课程来利用这些数据,使模型能够高效地适应新场景。 在TartanAir上进行的训练实验,并通过KITTI、EuRoC、ICL-NUIM和TUM RGB-D等多组基准测试验证了AutoLoop的有效性。结果显示,与传统方法相比,AutoLoop不仅实现了可比或更优的表现,而且将训练时间减少了多个数量级。 AutoLoop提供了一种实用的方法来快速适应视觉SLAM系统,自动化以往需要多次手动迭代的权重调整过程。我们的研究结果表明,这种自动课程策略不仅能加速模型的训练进程,还能保持甚至改善其在各种环境条件下的性能表现。
https://arxiv.org/abs/2501.09160
Targeting the notorious cumulative drift errors in NeRF SLAM, we propose a Semantic-guided Loop Closure with Shared Latent Code, dubbed SLC$^2$-SLAM. Especially, we argue that latent codes stored in many NeRF SLAM systems are not fully exploited, as they are only used for better reconstruction. In this paper, we propose a simple yet effective way to detect potential loops using the same latent codes as local features. To further improve the loop detection performance, we use the semantic information, which are also decoded from the same latent codes to guide the aggregation of local features. Finally, with the potential loops detected, we close them with a graph optimization followed by bundle adjustment to refine both the estimated poses and the reconstructed scene. To evaluate the performance of our SLC$^2$-SLAM, we conduct extensive experiments on Replica and ScanNet datasets. Our proposed semantic-guided loop closure significantly outperforms the pre-trained NetVLAD and ORB combined with Bag-of-Words, which are used in all the other NeRF SLAM with loop closure. As a result, our SLC$^2$-SLAM also demonstrated better tracking and reconstruction performance, especially in larger scenes with more loops, like ScanNet.
针对NeRF SLAM中的累积漂移误差问题,我们提出了一种基于语义引导的循环闭合方法,并结合共享潜在代码(Semantic-guided Loop Closure with Shared Latent Code),简称SLC$^2$-SLAM。特别地,我们认为许多NeRF SLAM系统中存储的潜在代码没有得到充分利用,因为它们仅用于更好的重建。在本文中,我们提出了一种简单而有效的方法来利用这些相同的潜在代码作为局部特征来检测潜在循环。 为了进一步提高循环检测性能,我们使用从同一组潜在代码解码出的语义信息来指导局部特征的聚合过程。最后,在确定了潜在循环后,我们通过图优化和随后的束调整(bundle adjustment)来闭合这些循环,并以此细化估计的姿态和重建场景。 为评估我们的SLC$^2$-SLAM方法的效果,我们在Replica和ScanNet数据集上进行了广泛的实验。我们提出的基于语义引导的循环闭合法显著优于所有其他采用预训练NetVLAD与Bag-of-Words结合ORB的方法在NeRF SLAM中的表现。因此,在像ScanNet这样包含更多循环的大场景中,我们的SLC$^2$-SLAM方法展示了更佳的跟踪和重建性能。 通过这种方法,不仅解决了累积漂移误差问题,还显著提升了整体的定位与建图精度,特别是在复杂环境下的表现尤为突出。
https://arxiv.org/abs/2501.08880
In recent years, 3D Gaussian splatting (3D-GS) has emerged as a novel scene representation approach. However, existing vision-only 3D-GS methods often rely on hand-crafted heuristics for point-cloud densification and face challenges in handling occlusions and high GPU memory and computation consumption. LiDAR-Inertial-Visual (LIV) sensor configuration has demonstrated superior performance in localization and dense mapping by leveraging complementary sensing characteristics: rich texture information from cameras, precise geometric measurements from LiDAR, and high-frequency motion data from IMU. Inspired by this, we propose a novel real-time Gaussian-based simultaneous localization and mapping (SLAM) system. Our map system comprises a global Gaussian map and a sliding window of Gaussians, along with an IESKF-based odometry. The global Gaussian map consists of hash-indexed voxels organized in a recursive octree, effectively covering sparse spatial volumes while adapting to different levels of detail and scales. The Gaussian map is initialized through multi-sensor fusion and optimized with photometric gradients. Our system incrementally maintains a sliding window of Gaussians, significantly reducing GPU computation and memory consumption by only optimizing the map within the sliding window. Moreover, we implement a tightly coupled multi-sensor fusion odometry with an iterative error state Kalman filter (IESKF), leveraging real-time updating and rendering of the Gaussian map. Our system represents the first real-time Gaussian-based SLAM framework deployable on resource-constrained embedded systems, demonstrated on the NVIDIA Jetson Orin NX platform. The framework achieves real-time performance while maintaining robust multi-sensor fusion capabilities. All implementation algorithms, hardware designs, and CAD models will be publicly available.
近年来,3D高斯点阵(3D-GS)作为一种新颖的场景表示方法崭露头角。然而,现有的仅基于视觉的3D-GS方法往往依赖于手工制作的启发式算法进行点云密集化,并且在处理遮挡和高GPU内存及计算消耗方面面临挑战。LiDAR-惯性-视觉(LIV)传感器配置通过利用互补的传感特性,在定位和稠密地图构建中表现出色:摄像头提供了丰富的纹理信息,激光雷达提供了精确的几何测量数据,而IMU则提供高频运动数据。受到这些特性的启发,我们提出了一种基于高斯的新实时同时定位与地图构建(SLAM)系统。我们的地图系统包括一个全局高斯图和一组滑动窗口内的高斯点阵,以及一种基于迭代误差状态卡尔曼滤波器(IESKF)的里程计。 该全球高斯地图由哈希索引的体素组成,并以递归八叉树的形式组织,有效覆盖了稀疏的空间体积同时适应不同的细节层次和尺度。通过多传感器融合初始化并利用光度梯度进行优化,构建全局高斯图。我们的系统逐步维护一组滑动窗口内的高斯点阵,仅在滑动窗口内对地图进行优化,大大减少了GPU计算量和内存消耗。 此外,我们还实现了一种基于迭代误差状态卡尔曼滤波器(IESKF)的紧密耦合多传感器融合里程计,利用高斯图的实时更新和渲染。我们的系统代表了首个可以在资源受限嵌入式系统上部署的实时高斯基SLAM框架,并在NVIDIA Jetson Orin NX平台上进行了演示。该框架实现了实时性能并保持强大的多传感器融合能力。 所有实现算法、硬件设计及CAD模型将在公开平台发布,供所有人使用。
https://arxiv.org/abs/2501.08672
Localization within a known environment is a crucial capability for mobile robots. Simultaneous Localization and Mapping (SLAM) is a prominent solution to this problem. SLAM is a framework that consists of a diverse set of computational tasks ranging from real-time tracking to computation-intensive map optimization. This combination can present a challenge for resource-limited mobile robots. Previously, edge-assisted SLAM methods have demonstrated promising real-time execution capabilities by offloading heavy computations while performing real-time tracking onboard. However, the common approach of utilizing a client-server architecture for offloading is sensitive to server and network failures. In this article, we propose a novel edge-assisted SLAM framework capable of self-organizing fully distributed SLAM execution across a network of devices or functioning on a single device without connectivity. The architecture consists of three layers and is designed to be device-agnostic, resilient to network failures, and minimally invasive to the core SLAM system. We have implemented and demonstrated the framework for monocular ORB SLAM3 and evaluated it in both fully distributed and standalone SLAM configurations against the ORB SLAM3. The experiment results demonstrate that the proposed design matches the accuracy and resource utilization of the monolithic approach while enabling collaborative execution.
在已知环境中进行定位是移动机器人的关键能力之一。同时定位与地图构建(SLAM)是解决这一问题的一种突出方法。SLAM是一个框架,涵盖了从实时跟踪到计算密集型地图优化的一系列多样化的计算任务。这种组合对于资源受限的移动机器人来说可能是个挑战。之前的研究表明,边缘辅助SLAM方法通过将繁重的计算卸载到外部服务器,并在设备上执行实时跟踪,展示出了有前景的实时处理能力。然而,使用客户端-服务器架构进行计算卸载的方法对服务器和网络故障非常敏感。 本文提出了一种新颖的边缘辅助SLAM框架,该框架能够自组织地在网络中的多个设备之间或在一个单独的无连接设备上实现完全分布式的SLAM执行。此架构由三层组成,并设计为与设备无关、抗网络故障且不会干扰核心SLAM系统。我们已经针对单目ORB SLAM3实现了这一框架,并在完全分布式和独立式SLAM配置中对其进行了评估,比较对象是未做修改的ORB SLAM3。 实验结果表明,所提出的架构能够匹配传统单体化方法的精度和资源利用率,并同时支持协作执行。
https://arxiv.org/abs/2501.08629
VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.
VINGS-Mono 是一种单目(惯性)高斯点阵 (GS) 同时定位与地图构建 (SLAM) 框架,专为处理大规模场景而设计。该框架主要由四个组件组成:VIO 前端、二维高斯图、NVS 回环检测和动态擦除器。 在 VIO 前端中,RGB 图像通过密集的束调整(bundle adjustment)和不确定性估计进行处理,以提取场景几何结构和姿态。基于此输出,映射模块会逐步构建并维护一个二维高斯图。二维高斯图的关键组成部分包括基于采样的光栅化器、分数管理器和位姿优化,这些共同提高了映射速度和定位精度。 为了确保大规模场景中的全局一致性,我们设计了一个回环检测模块,该模块创新性地利用了高斯点阵的新型视图合成 (NVS) 能力来进行回环闭合,并对高斯图进行修正。此外,还提出了一种动态擦除器来解决真实世界户外场景中不可避免存在的动态物体问题。 在室内和室外环境中的广泛评估表明,我们方法的定位性能与视觉惯性里程计相当,同时超过了最近的 GS/NeRF SLAM 方法。在映射质量和渲染质量方面,我们的方法也显著优于所有现有的方法。此外,我们开发了一个移动应用程序,并验证了该框架能够仅使用智能手机摄像头和低频惯性测量单元 (IMU) 传感器实现实时高质量高斯图生成。 据我们所知,VINGS-Mono 是第一个能够在户外环境运行并支持公里尺度大规模场景的单目高斯 SLAM 方法。
https://arxiv.org/abs/2501.08286
Consistent maps are key for most autonomous mobile robots. They often use SLAM approaches to build such maps. Loop closures via place recognition help maintain accurate pose estimates by mitigating global drift. This paper presents a robust loop closure detection pipeline for outdoor SLAM with LiDAR-equipped robots. The method handles various LiDAR sensors with different scanning patterns, field of views and resolutions. It generates local maps from LiDAR scans and aligns them using a ground alignment module to handle both planar and non-planar motion of the LiDAR, ensuring applicability across platforms. The method uses density-preserving bird's eye view projections of these local maps and extracts ORB feature descriptors from them for place recognition. It stores the feature descriptors in a binary search tree for efficient retrieval, and self-similarity pruning addresses perceptual aliasing in repetitive environments. Extensive experiments on public and self-recorded datasets demonstrate accurate loop closure detection, long-term localization, and cross-platform multi-map alignment, agnostic to the LiDAR scanning patterns, fields of view, and motion profiles.
一致性地图对于大多数自主移动机器人至关重要。这些机器人通常使用SLAM(即时定位与地图构建)方法来创建这样的地图。通过地点识别进行循环闭合有助于减少全局漂移,从而维持准确的姿态估计。本文提出了一种适用于室外环境的、配备激光雷达的机器人的鲁棒性循环闭合检测流水线。 该方法能够处理不同扫描模式、视场和分辨率的各种激光雷达传感器。它从激光雷达扫描中生成局部地图,并通过地面对齐模块将它们进行配准,以应对激光雷达在平面和非平面上的不同运动方式,确保跨平台的适用性。该方法利用局部地图的密度保持俯视图投影提取ORB特征描述符用于地点识别。 特征描述符存储在一个二进制搜索树中,以便高效检索,并且自相似度修剪解决了重复环境中感知别名的问题。在公共和自行记录的数据集上进行的大量实验验证了准确的循环闭合检测、长期定位以及跨平台多图对齐的能力,这些能力不依赖于激光雷达扫描模式、视场或运动配置文件的不同。
https://arxiv.org/abs/2501.07399
Achieving high-fidelity 3D reconstruction from monocular video remains challenging due to the inherent limitations of traditional methods like Structure-from-Motion (SfM) and monocular SLAM in accurately capturing scene details. While differentiable rendering techniques such as Neural Radiance Fields (NeRF) address some of these challenges, their high computational costs make them unsuitable for real-time applications. Additionally, existing 3D Gaussian Splatting (3DGS) methods often focus on photometric consistency, neglecting geometric accuracy and failing to exploit SLAM's dynamic depth and pose updates for scene refinement. We propose a framework integrating dense SLAM with 3DGS for real-time, high-fidelity dense reconstruction. Our approach introduces SLAM-Informed Adaptive Densification, which dynamically updates and densifies the Gaussian model by leveraging dense point clouds from SLAM. Additionally, we incorporate Geometry-Guided Optimization, which combines edge-aware geometric constraints and photometric consistency to jointly optimize the appearance and geometry of the 3DGS scene representation, enabling detailed and accurate SLAM mapping reconstruction. Experiments on the Replica and TUM-RGBD datasets demonstrate the effectiveness of our approach, achieving state-of-the-art results among monocular systems. Specifically, our method achieves a PSNR of 36.864, SSIM of 0.985, and LPIPS of 0.040 on Replica, representing improvements of 10.7%, 6.4%, and 49.4%, respectively, over the previous SOTA. On TUM-RGBD, our method outperforms the closest baseline by 10.2%, 6.6%, and 34.7% in the same metrics. These results highlight the potential of our framework in bridging the gap between photometric and geometric dense 3D scene representations, paving the way for practical and efficient monocular dense reconstruction.
从单目视频实现高保真的三维重建仍然具有挑战性,这是由于传统方法如基于运动的结构恢复(SfM)和单目SLAM固有的限制所导致的。尽管神经辐射场(NeRF)等可微渲染技术解决了部分问题,但其高昂的计算成本使其不适合实时应用。此外,现有的3D高斯点着色(3DGS)方法通常侧重于光度一致性,忽视了几何准确性,并且未能利用SLAM动态深度和姿态更新来优化场景重建。 我们提出了一种框架,将密集型SLAM与3DGS结合,实现实时、高保真的稠密重建。我们的方法引入了由SLAM信息引导的自适应密度化(SLAM-Informed Adaptive Densification),该方法通过利用来自SLAM的密集点云动态更新和增加高斯模型的密度。此外,我们采用了几何指导优化(Geometry-Guided Optimization)技术,结合边缘感知的几何约束与光度一致性来共同优化3DGS场景表示的外观和几何形状,从而实现详细而准确的SLAM映射重建。 在Replica和TUM-RGBD数据集上的实验表明了我们方法的有效性,并取得了单目系统中最新的结果。具体来说,在Replica上,我们的方法实现了PSNR为36.864、SSIM为0.985以及LPIPS为0.040,分别比以前的最新技术提高了10.7%、6.4%和49.4%。在TUM-RGBD数据集上,我们的方法以相同的指标优于最近的方法达10.2%、6.6%和34.7%。 这些结果突显了我们框架在连接光度与几何稠密三维场景表示之间的差距方面的潜力,并为实用且高效的单目密集重建铺平了道路。
https://arxiv.org/abs/2501.07015
In this paper, we present a large-scale fine-grained dataset using high-resolution images captured from locations worldwide. Compared to existing datasets, our dataset offers a significantly larger size and includes a higher level of detail, making it uniquely suited for fine-grained 3D applications. Notably, our dataset is built using drone-captured aerial imagery, which provides a more accurate perspective for capturing real-world site layouts and architectural structures. By reconstructing environments with these detailed images, our dataset supports applications such as the COLMAP format for Gaussian Splatting and the Structure-from-Motion (SfM) method. It is compatible with widely-used techniques including SLAM, Multi-View Stereo, and Neural Radiance Fields (NeRF), enabling accurate 3D reconstructions and point clouds. This makes it a benchmark for reconstruction and segmentation tasks. The dataset enables seamless integration with multi-modal data, supporting a range of 3D applications, from architectural reconstruction to virtual tourism. Its flexibility promotes innovation, facilitating breakthroughs in 3D modeling and analysis.
在这篇论文中,我们介绍了一个大规模的精细粒度数据集,该数据集使用从全球各地采集的高分辨率图像构建而成。与现有的数据集相比,我们的数据集规模更大,并且包含更高的细节水平,使其特别适合用于细粒度3D应用。值得注意的是,我们的数据集是基于无人机拍摄的空中影像构建的,这提供了捕捉真实世界场地布局和建筑结构更准确视角的能力。 通过使用这些详细图像重构环境,我们的数据集支持诸如用于高斯点置的COLMAP格式以及从运动中恢复结构(SfM)方法的应用。该数据集与包括同时定位与地图构建(SLAM)、多视图立体匹配和神经辐射场(NeRF)在内的广泛使用的技术兼容,从而能够进行准确的3D重建和点云生成。这使其成为重构和分割任务的基准。 此外,该数据集支持无缝集成多种模式的数据,适用于从建筑重建到虚拟旅游等各种3D应用。它的灵活性促进了创新,在3D建模和分析方面推动了突破性进展。
https://arxiv.org/abs/2501.06927
Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.
最近,神经隐式表示在稠密的同时定位与地图构建(SLAM)方面取得了显著进展。然而,现有的方法在重建质量和实时性能上存在不足,主要原因是缺乏灵活的场景表示策略,并且没有利用任何先验信息。本文中,我们介绍了SP-SLAM,这是一种新型的神经RGB-D SLAM系统,能够在实时条件下执行跟踪和建图。SP-SLAM通过计算深度图像并在表面附近建立稀疏的体素编码场景先验来实现模型的快速收敛。随后,从单帧深度图像计算出的编码体素被融合到全局体积中,从而促进高保真度的表面重建。同时,我们使用三平面存储场景外观信息,在保证高质量几何纹理映射的同时最小化内存消耗。 此外,SP-SLAM还引入了一种有效的建图优化策略,使系统能够在运行时持续优化所有历史输入帧的姿态,而不会增加计算开销。我们在五个基准数据集(Replica、ScanNet、TUM RGB-D、合成RGB-D、7-Scenes)上进行了广泛的评估。结果表明,与现有方法相比,我们实现了更优的跟踪精度和重建质量,并且运行速度显著加快。
https://arxiv.org/abs/2501.06469
3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis in the Simultaneous Localization and Mapping (SLAM). However, existing SLAM methods utilizing 3DGS have failed to provide high-quality novel view rendering for monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods perform well for RGB-D cameras but suffer significant degradation in rendering quality for monocular cameras. In this paper, we present Scaffold-SLAM, which delivers simultaneous localization and high-quality photorealistic mapping across monocular, stereo, and RGB-D cameras. We introduce two key innovations to achieve this state-of-the-art visual quality. First, we propose Appearance-from-Motion embedding, enabling 3D Gaussians to better model image appearance variations across different camera poses. Second, we introduce a frequency regularization pyramid to guide the distribution of Gaussians, allowing the model to effectively capture finer details in the scene. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that Scaffold-SLAM significantly outperforms state-of-the-art methods in photorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D datasets for monocular cameras.
最近,3D高斯点阵(3D Gaussian Splatting,简称3DGS)技术在同步定位与地图构建(SLAM)中的新颖视图合成方面实现了革命性突破。然而,现有的利用3DGS的SLAM方法未能同时为单目、立体和RGB-D相机提供高质量的新视角渲染。值得注意的是,某些方法虽然对RGB-D相机表现良好,但在处理单目相机时图像质量显著下降。在本文中,我们介绍了Scaffold-SLAM技术,该技术能够在单目、立体和RGB-D相机之间同时实现定位与高保真度的视觉化地图构建。 为了达到这一卓越的视觉效果,我们引入了两个关键创新: 1. 我们提出了“运动到外观”嵌入(Appearance-from-Motion embedding),使3D高斯点能够更好地模拟不同摄像机姿态下的图像外观变化。 2. 我们引入了一个频率正则化金字塔,以指导高斯分布的形成,使得模型可以有效地捕捉场景中的细节。 在单目、立体和RGB-D数据集上的大量实验表明,Scaffold-SLAM技术在高保真度地图构建的质量上显著优于现有最佳方法。例如,在TUM RGB-D数据集中使用单目相机时,峰值信噪比(PSNR)提高了16.76%。
https://arxiv.org/abs/2501.05242
Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9 compared to the two-stage method.
尽管大型语言模型(LLM)在英语推理任务中取得了显著改进,但它们在多语种推理方面仍然面临挑战。最近的研究利用了全参数和两阶段训练范式来教导模型首先理解非英文问题,然后进行推理。然而,这种方法既消耗大量的计算资源,又会导致灾难性遗忘。根本原因是,在提升多语言理解能力为主要目标的情况下,第一阶段过度调优了许多无关的层和参数。鉴于我们发现语言表征学习仅在较低级别的层次上完成,我们提出了一种高效的多语种推理对齐方法,该方法能够精准识别并微调负责处理多语种问题的相关层级。实验结果显示,我们的方法SLAM只需调整7B和13B规模LLM中6层前馈子层(约占所有参数的6.5-8%),在涉及10种语言的任务上达到了优于所有强大基准模型的平均性能。同时,SLAM仅需一个训练阶段,相比两阶段方法可将训练时间减少4.1到11.9倍。
https://arxiv.org/abs/2501.03681
Despite the advent in 3D hand pose estimation, current methods predominantly focus on single-image 3D hand reconstruction in the camera frame, overlooking the world-space motion of the hands. Such limitation prohibits their direct use in egocentric video settings, where hands and camera are continuously in motion. In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. We propose to decouple the task by reconstructing the hand motion in the camera space and estimating the camera trajectory in the world coordinate system. To achieve precise camera trajectory estimation, we propose an adaptive egocentric SLAM framework that addresses the shortcomings of traditional SLAM methods, providing robust performance under challenging camera dynamics. To ensure robust hand motion trajectories, even when the hands move out of view frustum, we devise a novel motion infiller network that effectively completes the missing frames of the sequence. Through extensive quantitative and qualitative evaluations, we demonstrate that HaWoR achieves state-of-the-art performance on both hand motion reconstruction and world-frame camera trajectory estimation under different egocentric benchmark datasets. Code and models are available on this https URL .
尽管在3D手部姿态估计方面取得了进展,目前的方法主要集中在单一图像的相机坐标系中的3D手部重建上,忽略了手部在世界空间中的运动。这种局限性阻碍了它们直接应用于第一人称视频设置中使用,因为在这些设置中,手和摄像机持续移动。在这项工作中,我们提出了一种名为HaWoR的方法,这是一种从第一人称视频中进行高保真度的手部动作重建方法,在世界坐标系中实现手部运动的重构。 为了分解任务并提高准确性,我们将手部运动在相机空间中的重建与估计摄像机在世界坐标系统中的轨迹分开处理。为获得精确的摄像机轨迹估算,我们提出了一种自适应的第一人称SLAM框架,该框架解决了传统SLAM方法的不足,在挑战性的摄像机动态条件下提供稳健性能。 为了确保即使手部移动超出视野范围时也能产生稳健的手部运动轨迹,我们设计了一个新颖的动作填补网络,能够有效补充序列中的缺失帧。 通过广泛的定量和定性评估,我们证明了HaWoR在不同第一人称基准数据集上,在手部动作重建以及世界坐标系下摄像机轨迹估计方面均达到了最先进的性能。代码和模型可在[此处](请将此链接替换为实际提供的URL)获取。
https://arxiv.org/abs/2501.02973
The Iterative Closest Point (ICP) algorithm is a crucial component of LiDAR-based SLAM algorithms. However, its performance can be negatively affected in unstructured environments that lack features and geometric structures, leading to low accuracy and poor robustness in localization and mapping. It is known that degeneracy caused by the lack of geometric constraints can lead to errors in 6-DOF pose estimation along ill-conditioned directions. Therefore, there is a need for a broader and more fine-grained degeneracy detection and handling method. This paper proposes a new point cloud registration framework, LP-ICP, that combines point-to-line and point-to-plane distance metrics in the ICP algorithm, with localizability detection and handling. LP-ICP consists of a localizability detection module and an optimization module. The localizability detection module performs localizability analysis by utilizing the correspondences between edge points (with low local smoothness) to lines and planar points (with high local smoothness) to planes between the scan and the map. The localizability contribution of individual correspondence constraints can be applied to a broader range. The optimization module adds additional soft and hard constraints to the optimization equations based on the localizability category. This allows the pose to be constrained along ill-conditioned directions, with updates either tending towards the constraint value or leaving the initial estimate unchanged. This improves accuracy and reduces fluctuations. The proposed method is extensively evaluated through experiments on both simulation and real-world datasets, demonstrating higher or comparable accuracy than the state-of-the-art methods. The dataset and code of this paper will also be open-sourced at this https URL.
迭代最近点(ICP)算法是基于LiDAR的SLAM算法中的关键组成部分。然而,在缺乏特征和几何结构的非结构化环境中,其性能可能会受到负面影响,导致定位和建图的精度低、鲁棒性差。众所周知,由于缺少几何约束所造成的退化会导致6自由度姿态估计沿不良条件方向出现误差。因此,需要一种更广泛且精细的退化解检测和处理方法。本文提出了一种新的点云配准框架LP-ICP,该框架在ICP算法中结合了点对线及点对面的距离度量,并包含了可定位性检测与处理。LP-ICP由一个可定位性检测模块和优化模块组成。可定位性检测模块通过利用扫描数据与地图之间边缘点(低局部平滑度)到直线以及平面点(高局部平滑度)到平面的对应关系进行可定位性分析,使得单个对应的约束贡献能够应用到更广泛的范围内。优化模块基于可定位性类别在优化方程中添加额外的软硬约束,这允许姿态沿不良条件方向受到约束,更新则倾向于朝向约束值或保持初始估计不变。这提高了精度并减少了波动。该方法通过模拟和真实数据集上的实验进行了广泛评估,显示其具有高于或与当前最优方法相当的准确度。本文的数据集和代码将在以下网址开源:[此URL]
https://arxiv.org/abs/2501.02580
LiDAR-based SLAM is recognized as one effective method to offer localization guidance in rough environments. However, off-the-shelf LiDAR-based SLAM methods suffer from significant pose estimation drifts, particularly components relevant to the vertical direction, when passing to uneven terrains. This deficiency typically leads to a conspicuously distorted global map. In this article, a LiDAR-based SLAM method is presented to improve the accuracy of pose estimations for ground vehicles in rough terrains, which is termed Rotation-Optimized LiDAR-Only (ROLO) SLAM. The method exploits a forward location prediction to coarsely eliminate the location difference of consecutive scans, thereby enabling separate and accurate determination of the location and orientation at the front-end. Furthermore, we adopt a parallel-capable spatial voxelization for correspondence-matching. We develop a spherical alignment-guided rotation registration within each voxel to estimate the rotation of vehicle. By incorporating geometric alignment, we introduce the motion constraint into the optimization formulation to enhance the rapid and effective estimation of LiDAR's translation. Subsequently, we extract several keyframes to construct the submap and exploit an alignment from the current scan to the submap for precise pose estimation. Meanwhile, a global-scale factor graph is established to aid in the reduction of cumulative errors. In various scenes, diverse experiments have been conducted to evaluate our method. The results demonstrate that ROLO-SLAM excels in pose estimation of ground vehicles and outperforms existing state-of-the-art LiDAR SLAM frameworks.
基于LiDAR的SLAM(同时定位与地图构建)被公认为是在复杂地形中提供定位指导的有效方法。然而,现成的基于LiDAR的SLAM方法在遇到崎岖不平的地面时会遭受显著的姿态估计漂移问题,尤其是垂直方向上的部分,这通常会导致全局地图明显扭曲。本文提出了一种改进型的基于LiDAR的SLAM方法——旋转优化的纯LiDAR(ROLO)SLAM,旨在提高地面车辆在崎岖地形中的姿态估计准确性。 该方法利用前方位置预测粗略消除连续扫描之间的位置差异,从而能够在前端独立且精确地确定位置和方向。此外,我们采用了一种能够并行处理的空间体素化方法来进行对应匹配,并开发了一种球形对齐引导的旋转注册技术,在每个体素内估计车辆的姿态变化。通过结合几何对齐,我们将运动约束引入优化公式中,以增强LiDAR平移的快速有效估算。随后,我们从多个关键帧构建子地图,并利用当前扫描与子地图之间的对准进行精确姿态估计。同时建立了一个全局规模的因素图,帮助减少累积误差。 在各种场景下进行了多样的实验来评估我们的方法。结果表明,ROLO-SLAM在地面车辆的姿态估计算法中表现出色,并超越了现有的最先进的基于LiDAR的SLAM框架。
https://arxiv.org/abs/2501.02166
As large language models (LLMs) like GPT-4 and Llama 3 become integral to educational contexts, concerns are mounting over the cultural biases, power imbalances, and ethical limitations embedded within these technologies. Though generative AI tools aim to enhance learning experiences, they often reflect values rooted in Western, Educated, Industrialized, Rich, and Democratic (WEIRD) cultural paradigms, potentially sidelining diverse global perspectives. This paper proposes a framework to assess and mitigate cultural bias within LLMs through the lens of applied multiplexity. Multiplexity, inspired by Senturk et al. and rooted in Islamic and other wisdom traditions, emphasizes the coexistence of diverse cultural viewpoints, supporting a multi-layered epistemology that integrates both empirical sciences and normative values. Our analysis reveals that LLMs frequently exhibit cultural polarization, with biases appearing in both overt responses and subtle contextual cues. To address inherent biases and incorporate multiplexity in LLMs, we propose two strategies: \textit{Contextually-Implemented Multiplex LLMs}, which embed multiplex principles directly into the system prompt, influencing LLM outputs at a foundational level and independent of individual prompts, and \textit{Multi-Agent System (MAS)-Implemented Multiplex LLMs}, where multiple LLM agents, each representing distinct cultural viewpoints, collaboratively generate a balanced, synthesized response. Our findings demonstrate that as mitigation strategies evolve from contextual prompting to MAS-implementation, cultural inclusivity markedly improves, evidenced by a significant rise in the Perspectives Distribution Score (PDS) and a PDS Entropy increase from 3.25\% at baseline to 98\% with the MAS-Implemented Multiplex LLMs. Sentiment analysis further shows a shift towards positive sentiment across cultures,...
随着大型语言模型(LLM)如GPT-4和Llama 3在教育环境中变得越来越重要,人们对这些技术中嵌入的文化偏见、权力不平衡以及伦理限制的担忧也日益加剧。虽然生成式AI工具旨在增强学习体验,但它们往往反映出西方、受过教育、工业化、富裕且民主(WEIRD)文化范式的价值观,这可能会忽视全球多样化的视角。本文提出了一种通过应用多重性来评估和减轻LLM中文化偏见的框架。多重性的概念由Senturk等人启发,并植根于伊斯兰教及其他智慧传统,强调多种文化观点并存的重要性,支持一种将实证科学与规范价值相结合的多层次知识论。我们的分析揭示了LLMs经常表现出的文化极化现象,在它们的回答和微妙的情境暗示中都存在偏见。 为了应对这些固有的偏见并在LLM中融入多重性,我们提出了两种策略: - **情境实施的多重重语言模型**(Contextually-Implemented Multiplex LLMs):该方法将多重性原则直接嵌入到系统提示中,在基础层面影响LLM输出,并且独立于个别提示。 - **多代理系统(MAS)实现的多重重语言模型**(Multi-Agent System (MAS)-Implemented Multiplex LLMs):在这一策略下,多个代表不同文化视角的LLM代理协同生成一个平衡且综合的回答。 我们的研究发现表明,在从情境提示到多代理系统实施的过程中,文化包容性显著提高。这通过视角分布分数(PDS)和视角熵值由基准时的3.25%大幅上升至98%,得以证明。情感分析进一步显示,随着策略的进步,不同文化之间的积极情绪倾向也有所提升。 这一框架及提出的策略旨在为教育和技术领域的实践者提供一个实用工具,以促进更加包容、多元化且伦理上负责任的技术设计和应用。
https://arxiv.org/abs/2501.03259
Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (this https URL)
理解从连续视频数据中提取三维场景的几何、语义和实例信息对于机器人技术和增强现实应用至关重要。然而,现有的同步定位与地图构建(SLAM)方法通常只关注几何或语义重建中的一个方面。在这篇论文中,我们介绍了PanoSLAM,这是一个首次将几何重建、3D语义分割和3D实例分割整合到统一框架内的SLAM系统。我们的方法基于三维高斯点阵技术,并进行了若干关键改进以实现从任意视角高效渲染深度、颜色、语义及实例信息的能力。 为了直接从连续的RGB-D视频中实现全景式的三维场景重建,我们提出了一种在线空间-时间提升(STL)模块,该模块能够将视觉模型生成的2D全景预测转化为3D高斯表示。这个STL模块通过在多视角输入之间细化伪标签来解决由于标签噪声和不一致性导致的问题,并创建出一个连贯一致的三维表示,从而提高分割精度。 实验表明,PanoSLAM在地图构建和跟踪准确性方面优于最近的语义SLAM方法。首次实现了从RGB-D视频直接完成开放世界环境中的全景式3D重建。
https://arxiv.org/abs/2501.00352
Robotic applications require a comprehensive understanding of the scene. In recent years, neural fields-based approaches that parameterize the entire environment have become popular. These approaches are promising due to their continuous nature and their ability to learn scene priors. However, the use of neural fields in robotics becomes challenging when dealing with unknown sensor poses and sequential measurements. This paper focuses on the problem of sensor pose estimation for large-scale neural implicit SLAM. We investigate implicit mapping from a probabilistic perspective and propose hierarchical pose estimation with a corresponding neural network architecture. Our method is well-suited for large-scale implicit map representations. The proposed approach operates on consecutive outdoor LiDAR scans and achieves accurate pose estimation, while maintaining stable mapping quality for both short and long trajectories. We built our method on a structured and sparse implicit representation suitable for large-scale reconstruction and evaluated it using the KITTI and MaiCity datasets. Our approach outperforms the baseline in terms of mapping with unknown poses and achieves state-of-the-art localization accuracy.
机器人应用需要对场景有全面的理解。近年来,基于神经场的方法因其能够参数化整个环境而变得流行起来。这些方法由于其连续性和学习场景先验的能力而展现出巨大的潜力。然而,在处理未知传感器姿态和顺序测量时,将神经场应用于机器人领域会遇到挑战。本文专注于大型规模的隐式SLAM(Simultaneous Localization and Mapping)中传感器姿态估计的问题。我们从概率的角度研究了隐式映射,并提出了一种层次化的姿态估计算法以及相应的神经网络架构。我们的方法非常适合大规模的隐式地图表示。 该提出的算法基于连续的室外激光雷达扫描操作,能够实现精确的姿态估计,同时在短距离和长距离轨迹上保持稳定的映射质量。我们建立的方法使用一种结构化且稀疏的隐式表征以适应大规模重建,并通过KITTI和MaiCity数据集进行了评估。我们的方法在处理未知姿态的映射方面超越了基线算法,在定位准确性方面达到了最先进的水平。
https://arxiv.org/abs/2412.20976