We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work. this https URL
我们提出了TRAM,一种从野视频中的全局轨迹和运动重建人类的方法。TRAM对SLAM进行了鲁棒,以在存在动态人类的情况下恢复相机运动,并使用场景背景来计算运动规模。将恢复的相机作为指标尺度参考框架,我们引入了一个视频变换模型(VIMO)来预测人类的三维姿态。通过将两个运动组合起来,我们在世界空间中实现了对3D人类的准确重构,将全局运动误差减少60%以上。这个链接:https://
https://arxiv.org/abs/2403.17346
Perception tasks play a crucial role in the development of automated operations and systems across multiple application fields. In the railway transportation domain, these tasks can improve the safety, reliability, and efficiency of various perations, including train localization, signal recognition, and track discrimination. However, collecting considerable and precisely labeled datasets for testing such novel algorithms poses extreme challenges in the railway environment due to the severe restrictions in accessing the infrastructures and the practical difficulties associated with properly equipping trains with the required sensors, such as cameras and LiDARs. The remarkable innovations of graphic engine tools offer new solutions to craft realistic synthetic datasets. To illustrate the advantages of employing graphic simulation for early-stage testing of perception tasks in the railway domain, this paper presents a comparative analysis of the performance of a SLAM algorithm applied both in a virtual synthetic environment and a real-world scenario. The analysis leverages virtual railway environments created with the latest version of Unreal Engine, facilitating data collection and allowing the examination of challenging scenarios, including low-visibility, dangerous operational modes, and complex environments. The results highlight the feasibility and potentiality of graphic simulation to advance perception tasks in the railway domain.
感知任务在多个应用领域中开发自动操作和系统具有关键作用。在铁路运输领域,这些任务可以提高包括列车定位、信号识别和轨道区分的各种操作的安全性、可靠性和效率。然而,为了测试这些新颖算法,收集大量准确标注的 dataset 在铁路环境中具有极具挑战性的,因为铁路环境中访问基础设施受到严重限制,并且与正确装备列车所需的传感器(如摄像头和 LiDAR)相关的实际困难。图形引擎工具的显著创新提供了用图形模拟创建现实合成数据集的新方法。为了说明在铁路领域使用图形模拟进行早期阶段感知任务测试的优势,本文对在虚拟合成环境和真实世界场景中应用 SLAM 算法的性能进行了比较分析。分析依赖于使用 Unreal Engine 最新版本创建的虚拟铁路环境,数据收集得以进行,并允许研究具有挑战性的场景,包括低可见度、危险操作模式和复杂环境。结果强调了图形模拟在铁路领域提高感知任务的可行性和潜力。
https://arxiv.org/abs/2403.17084
Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes. In this paper, we first propose a Terrain-Aware multI-modaL (TAIL) dataset tailored to deformable and sandy terrains. It incorporates various types of robotic proprioception and distinct ground interactions for the unique challenges and benchmark of multi-sensor fusion SLAM. The versatile sensor suite comprises stereo frame cameras, multiple ground-pointing RGB-D cameras, a rotating 3D LiDAR, an IMU, and an RTK device. This ensemble is hardware-synchronized, well-calibrated, and self-contained. Utilizing both wheeled and quadrupedal locomotion, we efficiently collect comprehensive sequences to capture rich unstructured scenarios. It spans the spectrum of scope, terrain interactions, scene changes, ground-level properties, and dynamic robot characteristics. We benchmark several state-of-the-art SLAM methods against ground truth and provide performance validations. Corresponding challenges and limitations are also reported. All associated resources are accessible upon request at \url{this https URL}.
地形感知感知具有在野外提高自主机器人导航的稳健性和精度的潜力,从而促进有效穿越复杂地形。然而,各种运动模式下的多模态感知不足会阻碍同时定位与映射(SLAM)的解决方案,尤其是在面临具有挑战性的复杂地形时。在本文中,我们首先提出了一个专门针对变形和沙质地形的多模态(TAIL)数据集。它专门为机器人本体感知和独特的多传感器融合SLAM挑战和基准而设计。多样化的传感器套件包括双目立体相机、多个地面指向的RGB-D相机、旋转的3D激光雷达、IMU和实时定位与跟踪设备。该集成系统具有硬件同步、校准良好和自包含的特点。通过轮行和四足行走,我们有效地收集了全面的序列以捕捉丰富的非结构化场景。它涵盖了范围、地形交互、场景变化、地面级性质和动态机器人特征。我们还与最先进的SLAM方法进行了对比并提供了性能验证。同时,还报告了相应的挑战和限制。所有相关资源都可以通过请求的链接获取:<https://this https URL>。
https://arxiv.org/abs/2403.16875
Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and mapping (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available. Project page: this https URL.
近年来,神经辐射场(NeRF)已经被广泛应用于作为3D表示同时进行密集定位和映射(SLAM)。尽管NeRF在表面建模和新颖视角合成方面取得了显著的成功,但基于现有NeRF的方法在计算密集和耗时的体积渲染管道方面存在限制。本文提出了一种基于新不确定性和几何稳定性高的新兴3D高斯场,即CG-SLAM,实现高效密集的RGB-D SLAM系统。通过深入分析高斯展开,我们提出了一些方法来构建一个适合跟踪和映射的稳定一致性高斯场。此外,还提出了一种新的深度不确定性模型,在优化过程中确保选择有价值的 Gaussian primitive,从而提高跟踪效率和准确性。在各种数据集上的实验证明CG-SLAM具有卓越的跟踪和映射性能,跟踪速度可以达到高达15Hz。我们将公开源代码。项目页面:此链接。
https://arxiv.org/abs/2403.16095
Camera rotation estimation from a single image is a challenging task, often requiring depth data and/or camera intrinsics, which are generally not available for in-the-wild videos. Although external sensors such as inertial measurement units (IMUs) can help, they often suffer from drift and are not applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm that estimates camera rotation along with uncertainty from uncalibrated RGB images. Using a Manhattan World assumption, our method leverages the per-pixel geometric priors encoded in single-image surface normal predictions and performs optimisation over the SO(3) manifold. Given a sequence of images, we can use the per-frame rotation estimates and their uncertainty to perform multi-frame optimisation, achieving robustness and temporal consistency. Our experiments demonstrate that U-ARE-ME performs comparably to RGB-D methods and is more robust than sparse feature-based SLAM methods. We encourage the reader to view the accompanying video at this https URL for a visual overview of our method.
从单个图像中估计相机的旋转是一个具有挑战性的任务,通常需要深度数据和/或相机内参,这在野外视频通常不是可用的。尽管外部传感器(如惯性测量单元(IMUs))可以帮助,但它们通常会受到漂移的影响,并且不适用于非惯性参考系。我们提出了 U-ARE-ME 算法,该算法从未校准的 RGB 图像中估计相机的旋转并具有不确定性。使用曼哈顿世界假设,我们的方法利用单帧表面法线预测中的每像素几何先验,并在 SO(3) 上进行优化。对于一系列图像,我们可以使用每帧旋转估计及其不确定性进行多帧优化,实现鲁棒性和时间一致性。我们的实验证明,U-ARE-ME 与其他 RGB-D 方法相当,并且比稀疏特征基于 SLAM 方法更稳健。我们鼓励读者查看附录中的视频,以获得对我们方法的视觉概述。
https://arxiv.org/abs/2403.15583
Precise camera tracking, high-fidelity 3D tissue reconstruction, and real-time online visualization are critical for intrabody medical imaging devices such as endoscopes and capsule robots. However, existing SLAM (Simultaneous Localization and Mapping) methods often struggle to achieve both complete high-quality surgical field reconstruction and efficient computation, restricting their intraoperative applications among endoscopic surgeries. In this paper, we introduce EndoGSLAM, an efficient SLAM approach for endoscopic surgeries, which integrates streamlined Gaussian representation and differentiable rasterization to facilitate over 100 fps rendering speed during online camera tracking and tissue reconstructing. Extensive experiments show that EndoGSLAM achieves a better trade-off between intraoperative availability and reconstruction quality than traditional or neural SLAM approaches, showing tremendous potential for endoscopic surgeries. The project page is at this https URL
精确相机跟踪、高保真度的3D组织重建和实时在线可视化对于内窥镜等体内医学成像设备至关重要。然而,现有的SLAM(同时定位与映射)方法通常很难实现完整的手术野重建和高效的计算,限制了它们在内窥镜手术中的应用。在本文中,我们介绍了EndoGSLAM,一种用于内窥镜手术的高效SLAM方法,它将优化高斯表示和可导张量映射以实现在线相机跟踪和组织重建超过100帧/秒的渲染速度。大量的实验结果表明,EndoGSLAM比传统或神经SLAM方法在体内可用性和重建质量之间实现了更好的平衡,具有巨大的内窥镜手术潜力。项目页面位于https://www.endogslam.org/。
https://arxiv.org/abs/2403.15124
Sub-symbolic artificial intelligence methods dominate the fields of environment-type classification and Simultaneous Localisation and Mapping. However, a significant area overlooked within these fields is solution transparency for the human-machine interaction space, as the sub-symbolic methods employed for map generation do not account for the explainability of the solutions generated. This paper proposes a novel approach to environment-type classification through Symbolic Simultaneous Localisation and Mapping, SymboSLAM, to bridge the explainability gap. Our method for environment-type classification observes ontological reasoning used to synthesise the context of an environment through the features found within. We achieve explainability within the model by presenting operators with environment-type classifications overlayed by a semantically labelled occupancy map of landmarks and features. We evaluate SymboSLAM with ground-truth maps of the Canberra region, demonstrating method effectiveness. We assessed the system through both simulations and real-world trials.
子符号人工智能方法在环境类型分类和同时定位与映射领域占据主导地位。然而,在这些领域中,一个被忽视的领域是解决人机交互空间的可解释性,因为用于地图生成的子符号方法没有解释生成的解决方案。本文提出了一种名为符号同时定位与映射的新方法,通过Symbolic Simultaneous Localisation and Mapping (SymboSLAM),桥接了可解释性差距。我们的环境类型分类方法通过观察发现环境通过特征内的上下文合成。通过在模型中呈现带有环境类型分类的运营商,我们实现了模型的可解释性。我们在堪培拉地区的地面真值地图上评估了SymboSLAM,证明了该方法的有效性。我们通过仿真和现实世界试验对系统进行了评估。
https://arxiv.org/abs/2403.15504
This survey paper presents a comprehensive overview of the latest advancements in the field of Simultaneous Localization and Mapping (SLAM) with a focus on the integration of symbolic representation of environment features. The paper synthesizes research trends in multi-agent systems (MAS) and human-machine teaming, highlighting their applications in both symbolic and sub-symbolic SLAM tasks. The survey emphasizes the evolution and significance of ontological designs and symbolic reasoning in creating sophisticated 2D and 3D maps of various environments. Central to this review is the exploration of different architectural approaches in SLAM, with a particular interest in the functionalities and applications of edge and control agent architectures in MAS settings. This study acknowledges the growing demand for enhanced human-machine collaboration in mapping tasks and examines how these collaborative efforts improve the accuracy and efficiency of environmental mapping
这份调查论文全面回顾了在同时定位与映射(SLAM)领域最新的进展,重点关注符号环境特征的符号表示。论文汇编了多智能体系统(MAS)和多机器人协同研究趋势,突出它们在符号和子符号SLAM任务中的应用。调查强调在创建复杂的环境的2D和3D地图的过程中,元设计和符号推理的演变和重要性。本研究的核心是对SLAM中不同架构的探索,特别是边缘和控制代理架构在MAS环境中的功能和应用。本研究表明,在映射任务中加强人与机器的合作越来越受到关注,并探讨了这些协同努力如何提高环境绘图的准确性和效率。
https://arxiv.org/abs/2405.01398
Many LiDAR place recognition systems have been developed and tested specifically for urban driving scenarios. Their performance in natural environments such as forests and woodlands have been studied less closely. In this paper, we analyzed the capabilities of four different LiDAR place recognition systems, both handcrafted and learning-based methods, using LiDAR data collected with a handheld device and legged robot within dense forest environments. In particular, we focused on evaluating localization where there is significant translational and orientation difference between corresponding LiDAR scan pairs. This is particularly important for forest survey systems where the sensor or robot does not follow a defined road or path. Extending our analysis we then incorporated the best performing approach, Logg3dNet, into a full 6-DoF pose estimation system -- introducing several verification layers for precise registration. We demonstrated the performance of our methods in three operational modes: online SLAM, offline multi-mission SLAM map merging, and relocalization into a prior map. We evaluated these modes using data captured in forests from three different countries, achieving 80% of correct loop closures candidates with baseline distances up to 5m, and 60% up to 10m.
许多激光雷达点识别系统专门针对城市驾驶场景进行了开发和测试。他们对自然环境(如森林和林地)的表现研究较少。在本文中,我们分析了四种不同LiDAR点识别系统,包括手工制作和学习方法,利用手持设备收集的森林中使用机器人获取的LiDAR数据。特别关注评估在对应LiDAR扫描对之间存在较大平移和方向差异的定位能力。这对于森林调查系统尤为重要,因为传感器或机器人并不遵循明确的路线或路径。通过扩展我们的分析,我们将最佳表现的方法——Logg3dNet,纳入了一个6DoF姿态估计系统——引入了几个验证层以实现精确的注册。我们在三种操作模式下评估了我们的方法:在线SLAM,离线多任务SLAM地图合并和基于先验地图的重新定位。我们使用从三个不同国家收集的森林数据来评估这些模式,达到80%的循环关闭候选者,其基线距离在5米以内,以及60%在10米以内。
https://arxiv.org/abs/2403.14326
Exoskeletons for daily use by those with mobility impairments are being developed. They will require accurate and robust scene understanding systems. Current research has used vision to identify immediate terrain and geometric obstacles, however these approaches are constrained to detections directly in front of the user and are limited to classifying a finite range of terrain types (e.g., stairs, ramps and level-ground). This paper presents Exosense, a vision-centric scene understanding system which is capable of generating rich, globally-consistent elevation maps, incorporating both semantic and terrain traversability information. It features an elastic Atlas mapping framework associated with a visual SLAM pose graph, embedded with open-vocabulary room labels from a Vision-Language Model (VLM). The device's design includes a wide field-of-view (FoV) fisheye multi-camera system to mitigate the challenges introduced by the exoskeleton walking pattern. We demonstrate the system's robustness to the challenges of typical periodic walking gaits, and its ability to construct accurate semantically-rich maps in indoor settings. Additionally, we showcase its potential for motion planning -- providing a step towards safe navigation for exoskeletons.
为那些行动不便的人开发了一种可日常使用的外骨骼。它们需要准确且可靠的场景理解系统。目前的研究已经利用视觉来识别立即的地形和几何障碍,然而这些方法仅限于在用户前直接检测到,并且局限于对有限范围的地面类型(如楼梯、斜坡和水平地面)进行分类。本文介绍了Exosense,一种以视觉为核心场景理解系统,能够生成丰富、全球一致的地形图,同时包含语义和地形可穿越信息。它采用了一个具有视觉SLAM姿态图的弹性的Atlas映射框架,附带从Vision-Language Model (VLM) 中的开放式词汇房间标签嵌入的房间标签。设备的设计包括一个广角鱼眼多相机系统,以减轻由外骨骼步行模式带来的挑战。我们展示了系统对典型周期性步行姿态的鲁棒性以及其在室内环境中的准确语义丰富地图的构建能力。此外,我们还展示了它在运动规划方面的潜力——为外骨骼的 safe navigation 迈出一步。
https://arxiv.org/abs/2403.14320
Visual simultaneous localization and mapping (VSLAM) has broad applications, with state-of-the-art methods leveraging deep neural networks for better robustness and applicability. However, there is a lack of research in fusing these learning-based methods with multi-sensor information, which could be indispensable to push related applications to large-scale and complex scenarios. In this paper, we tightly integrate the trainable deep dense bundle adjustment (DBA) with multi-sensor information through a factor graph. In the framework, recurrent optical flow and DBA are performed among sequential images. The Hessian information derived from DBA is fed into a generic factor graph for multi-sensor fusion, which employs a sliding window and supports probabilistic marginalization. A pipeline for visual-inertial integration is firstly developed, which provides the minimum ability of metric-scale localization and mapping. Furthermore, other sensors (e.g., global navigation satellite system) are integrated for driftless and geo-referencing functionality. Extensive tests are conducted on both public datasets and self-collected datasets. The results validate the superior localization performance of our approach, which enables real-time dense mapping in large-scale environments. The code has been made open-source (this https URL).
视觉同时定位和映射(VSLAM)具有广泛的应用,最先进的方法利用深度神经网络的优点来提高其稳健性和适用性。然而,将这些基于学习的方法与多传感器信息相结合的研究还很少,这对于推动相关应用向大规模和复杂场景实现至关重要。在本文中,我们将通过因子图将可训练的深度密集卷积 bundle adjustment(DBA)与多传感器信息相结合。在框架中,连续光流和 DBA 在序列图像之间执行。从 DBA 获得的 Hessian 信息被输入到通用因子图进行多传感器融合,该框架采用滑动窗口并支持概率边际。首先开发了视觉-惯性整合的流程,提供了最小的大规模局部定位和映射能力。此外,还集成了其他传感器(例如全球导航卫星系统)以实现无漂移和地理参考功能。在公开数据集和自收集数据集上进行了广泛的测试。测试结果证实了我们的方法在大型环境中的卓越定位性能,从而实现了在大型环境中的实时密集映射。该代码已公开开源(此 https URL)。
https://arxiv.org/abs/2403.13714
Despite the number of works published in recent years, vehicle localization remains an open, challenging problem. While map-based localization and SLAM algorithms are getting better and better, they remain a single point of failure in typical localization pipelines. This paper proposes a modular localization architecture that fuses sensor measurements with the outputs of off-the-shelf localization algorithms. The fusion filter estimates model uncertainties to improve odometry in case absolute pose measurements are lost entirely. The architecture is validated experimentally on a real robot navigating autonomously proving a reduction of the position error of more than 90% with respect to the odometrical estimate without uncertainty estimation in a two-minute navigation period without position measurements.
尽管近年来发表的作品数量不断增加,但车辆定位仍然是一个开放且具有挑战性的问题。虽然基于地图的定位和SLAM算法正在越来越好,但它们在典型的定位流程中仍然是一个单点故障。本文提出了一种模块化的定位架构,将传感器测量结果与普通定位算法的输出相结合。融合滤波器估计模型不确定性以提高在没有绝对姿态测量的情况下进行逆向推理的里程计误差。实验验证表明,该架构在自主导航的机器人上实现了超过90%的定位误差减少,而在没有位置测量的情况下,绝对姿态测量的误差估计时间不到两分钟。
https://arxiv.org/abs/2403.13452
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at this https URL
感知伪迹和弱纹理对空间识别任务构成了重大挑战,阻碍了同时定位与映射(SLAM)系统的性能。本文介绍了一种名为UMF(意为统一多模态特征)的新模型,该模型通过将视觉和激光雷达特征之间的跨注意力和重新排名阶段相结合,从而实现了1)利用多模态通过跨注意力和2)根据局部特征匹配 top-k 候选项的全局表示,重新排序。我们对行星模拟环境中的序列进行的实验表明,UMF在这些具有伪迹挑战性的环境中显著优于之前的基准模型。由于我们的工作旨在增强SLAM在所有情况下的可靠性,我们还研究了SLAM在广泛使用的机器人汽车数据集上的性能,以更广泛的适用性。代码和模型可通过此链接获得:
https://arxiv.org/abs/2403.13395
Despite recent advances in semantic Simultaneous Localization and Mapping (SLAM) for terrestrial and aerial applications, underwater semantic SLAM remains an open and largely unaddressed research problem due to the unique sensing modalities and the object classes found underwater. This paper presents an object-based semantic SLAM method for underwater environments that can identify, localize, classify, and map a wide variety of marine objects without a priori knowledge of the object classes present in the scene. The method performs unsupervised object segmentation and object-level feature aggregation, and then uses opti-acoustic sensor fusion for object localization. Probabilistic data association is used to determine observation to landmark correspondences. Given such correspondences, the method then jointly optimizes landmark and vehicle position estimates. Indoor and outdoor underwater datasets with a wide variety of objects and challenging acoustic and lighting conditions are collected for evaluation and made publicly available. Quantitative and qualitative results show the proposed method achieves reduced trajectory error compared to baseline methods, and is able to obtain comparable map accuracy to a baseline closed-set method that requires hand-labeled data of all objects in the scene.
尽管在陆地和空中应用中,最近取得了同步定位与映射(SLAM)的进展,但水下语义SLAM仍然是一个开放且主要尚未解决的研究问题,原因是水下中发现了独特的感测模式和物体类别。本文提出了一种基于对象的语义SLAM方法,可以无先验知识识别、定位、分类和映射广泛的海洋物体。该方法进行无监督的物体分割和物体级特征聚合,然后使用可选声传感器融合进行对象定位。概率数据关联用于确定观测到地标对应关系。根据这些对应关系,该方法然后共同优化地标和车辆位置估计。室内和室外的水下数据集,包括各种物体和具有挑战性的声学和照明条件,用于评估并公开提供。定量和定性结果表明,与基线方法相比,所提出的方法减少了轨迹误差,并且能够获得与需要手工标注场景中所有物体相同的地形准确性。
https://arxiv.org/abs/2403.12837
Simultaneous Localization and Mapping (SLAM) with dense representation plays a key role in robotics, Virtual Reality (VR), and Augmented Reality (AR) applications. Recent advancements in dense representation SLAM have highlighted the potential of leveraging neural scene representation and 3D Gaussian representation for high-fidelity spatial representation. In this paper, we propose a novel dense representation SLAM approach with a fusion of Generalized Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS). In contrast to existing methods, we utilize a single Gaussian map for both tracking and mapping, resulting in mutual benefits. Through the exchange of covariances between tracking and mapping processes with scale alignment techniques, we minimize redundant computations and achieve an efficient system. Additionally, we enhance tracking accuracy and mapping quality through our keyframe selection methods. Experimental results demonstrate the effectiveness of our approach, showing an incredibly fast speed up to 107 FPS (for the entire system) and superior quality of the reconstructed map.
同时定位与映射(SLAM)在机器人学、虚拟现实(VR)和增强现实(AR)应用中扮演着关键角色。最近,在密集表示SLAM方面的先进技术突出了利用神经场景表示和3D高斯表示进行高保真度空间表示的潜力。在本文中,我们提出了一个新的密集表示SLAM方法,结合了扩展迭代最近点(G-ICP)和3D高斯展平(3DGS)。与现有方法不同,我们使用单个高斯地图进行跟踪和映射,从而实现相互有益。通过跟踪和映射过程之间的协方差交换,我们最小化冗余计算并实现高效的系统。此外,我们还通过关键帧选择方法提高了跟踪准确性和映射质量。实验结果证明了我们的方法的有效性,显示了系统速度加快到107 FPS(整个系统)以及重建地图的质量优越。
https://arxiv.org/abs/2403.12550
We propose a dense RGBD SLAM system based on 3D Gaussian Splatting that provides metrically accurate pose tracking and visually realistic reconstruction. To this end, we first propose a Gaussian densification strategy based on the rendering loss to map unobserved areas and refine reobserved areas. Second, we introduce extra regularization parameters to alleviate the forgetting problem in the continuous mapping problem, where parameters tend to overfit the latest frame and result in decreasing rendering quality for previous frames. Both mapping and tracking are performed with Gaussian parameters by minimizing re-rendering loss in a differentiable way. Compared to recent neural and concurrently developed gaussian splatting RGBD SLAM baselines, our method achieves state-of-the-art results on the synthetic dataset Replica and competitive results on the real-world dataset TUM.
我们提出了一个基于3D高斯平铺的密集RGBD SLAM系统,该系统提供精确的 pose 跟踪和视觉上逼真的重建。为此,我们首先提出了一种基于渲染损失的高斯密度策略,将未观察到的区域映射到优化观察到的区域。其次,我们引入了一些额外的正则化参数来减轻连续映射问题中的遗忘问题,其中参数倾向于过拟合最新的帧,导致前几帧的渲染质量降低。通过最小化在可导方式下的重新渲染损失来进行映射和跟踪。与最近的神经网络和同时开发的Gaussian Splatting RGBD SLAM基线相比,我们的方法在合成数据集Replica上实现了最先进的结果,并在真实世界数据集TUM上实现了竞争力的结果。
https://arxiv.org/abs/2403.12535
Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.
基于隐式表示的同步定位与映射(SLAM)研究在室内环境中取得了良好的结果。然而,仍然存在一些挑战:隐式编码的有限场景表示能力,从隐式表示中渲染过程的不确定性以及动态物体对一致性的干扰。为了应对这些挑战,我们提出了一个基于局部-全局融合神经隐式表示的实时动态SLAM系统,名为DVN-SLAM。为了提高场景表示能力,我们引入了一种局部-全局融合神经隐式表示,使得在考虑全局结构和局部细节的同时构建隐含地图。为了应对渲染过程产生的不确定性,我们设计了一个信息聚类损失,旨在将场景信息集中在物体表面。所提出的DVN-SLAM在多个数据集上的定位和映射达到竞争力的性能。更重要的是,DVN-SLAM展示了在动态场景中的鲁棒性,这是其他基于NeRF的方法所不具备的。
https://arxiv.org/abs/2403.11776
We propose NEDS-SLAM, an Explicit Dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier GS points, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.
我们提出了 NEDS-SLAM,一种基于3D高斯表示的明确密语义SLAM系统,能够实现稳健的3D语义映射、精确的相机跟踪和实时高质量渲染。在系统中,我们提出了一个空间一致特征融合模型,以减少来自预训练分割头上的错误估计对语义重建的影响,实现稳健的3D语义高斯映射。此外,我们还使用轻量级编码器-解码器来压缩高维语义特征,减轻过度内存消耗的影响。此外,我们利用3D高斯扩展的优点,实现高效的非平凡视图渲染,并提出了虚拟相机视图修剪方法,消除异常GS点,从而有效地增强场景表示的质量。我们的 NEDS-SLAM方法在 Replica 和 ScanNet 等数据集上的映射和跟踪准确性方面与现有密集语义SLAM方法竞争,同时在3D语义映射方面表现出卓越的性能。
https://arxiv.org/abs/2403.11679
Perception plays a crucial role in various robot applications. However, existing well-annotated datasets are biased towards autonomous driving scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often lack environment and domain variations. To expand the frontier of these fields, we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), featuring a wide range of sensing modalities, high-accuracy ground truth, and diverse challenging environments across three Eurasian university campuses. MCD comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce semantic annotations of 29 classes over 59k sparse NRE lidar scans across three domains, thus providing a novel challenge to existing semantic segmentation research upon this largely unexplored lidar modality. Finally, we propose, for the first time to the best of our knowledge, continuous-time ground truth based on optimization-based registration of lidar-inertial data on large survey-grade prior maps, which are also publicly released, each several times the size of existing ones. We conduct a rigorous evaluation of numerous state-of-the-art algorithms on MCD, report their performance, and highlight the challenges awaiting solutions from the research community.
感知在各种机器人应用中扮演着关键角色。然而,现有的经过良好注释的数据集偏向于自动驾驶场景,而未标注的SLAM数据集会很快过拟合,并且通常缺乏环境和领域变化。为了扩展这些领域的前沿,我们引入了一个名为MCD(多校园数据集)的全面数据集,其中包括广泛的感测模式、高精度的目标跟踪数据和来自欧洲三个大学校园的多样挑战环境。MCD包括CSC(经典圆柱形旋转)和NRE(非重复周期环形)激光雷达、高质量的惯性测量单元(IMU)、相机和 UWB(超宽波长)传感器。此外,我们在一个具有创新性的努力中引入了跨越三个领域的29类语义注释,对3D NRE激光雷达扫描进行语义标注,这为现有语义分割研究带来了新的挑战。最后,我们提出了一种新颖的基于优化基于重采样定位的连续时间目标跟踪方法,该方法基于大型调查级先验图对激光雷达-惯性数据进行优化,这些数据也是公开发布的,每个的大小是现有方法的许多倍。我们在MCD上对多个最先进的算法进行了严格的评估,报告了它们的性能,并强调了研究社区需要解决的挑战。
https://arxiv.org/abs/2403.11496
The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.
在许多几何计算机视觉任务中,如SLAM,静态环境的假设是很常见的,但它限制了这些任务在高度动态场景中的适用性。由于这些任务依赖于在静态环境中确定输入图像之间的点对应关系,我们提出了一个基于图神经网络的稀疏特征匹配网络,旨在在具有挑战性的条件下实现鲁棒匹配,同时排除运动物体上的关键点。我们在图边上采用类似的注意力和聚合方案来增强关键点表示,与最先进的特征匹配网络类似,但补充了极化的图信息和大大减少了图的边数。此外,我们还引入了一种自监督训练方案,用于从仅处理视觉-inertial数据的动态环境中提取伪标签,用于图像对。一系列实验证明,与最先进的特征匹配网络相比,我们的网络在排除运动物体关键点的同时,仍然实现了与传统匹配指标类似的结果。当集成到SLAM系统中时,我们的网络在动态场景中的性能显著提高,尤其是在高度动态场景中。
https://arxiv.org/abs/2403.11370