This paper explores the integration of linguistic inputs within robotic navigation systems, drawing upon the symbol interdependency hypothesis to bridge the divide between symbolic and embodied cognition. It examines previous work incorporating language and semantics into Neural Network (NN) and Simultaneous Localization and Mapping (SLAM) approaches, highlighting how these integrations have advanced the field. By contrasting abstract symbol manipulation with sensory-motor grounding, we propose a unified framework where language functions both as an abstract communicative system and as a grounded representation of perceptual experiences. Our review of cognitive models of distributional semantics and their application to autonomous agents underscores the transformative potential of language-integrated systems.
本文探讨了在机器人导航系统中集成语言输入的问题,并借鉴符号互依性假设来弥合符号和身体认知之间的分歧。它回顾了将语言和语义融入神经网络(NN)和同时定位与映射(SLAM)方法中的先驱工作,并强调了这些整合如何推动该领域的发展。通过将抽象符号操作与感知-运动 groundeding 相比较,我们提出了一个统一框架,其中语言既作为抽象交流系统,又作为感知经历的 grounded 表示。我们对分布式语义模型的认知模型及其应用于自主机器人的回顾强调了语言集成系统的变革潜力。
https://arxiv.org/abs/2404.03049
Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map. Project Webpage: this https URL
同时定位和映射对于位置跟踪和场景理解是至关重要的。基于3D高斯的高质量地图表示允许使用多个姿态相机进行照片现实主义重建和实时渲染场景。我们首次证明了使用带有未姿态相机图像和惯性测量的3D高斯进行地图表示可以实现准确的位置跟踪。我们的方法MM3DGS通过使渲染更快、具有更好的缩放感知和轨迹跟踪解决了先前的神经辐射场表示器的限制。我们的框架通过使用包含预积分惯性测量、深度估计和光量渲染质量测量的相对姿态变换的损失函数实现了基于关键帧的映射和跟踪。我们还发布了来自配备摄像头和惯性测量单元的移动机器人上的多模态数据集UT-MM。实验评估了数据集中的多个场景,结果显示MM3DGS在跟踪和光量渲染质量方面比当前3DGS SLAM状态提高了3倍,同时允许对高分辨率密集3D地图进行实时渲染。网站链接:https:// this URL
https://arxiv.org/abs/2404.00923
In recent years, there have been significant advancements in 3D reconstruction and dense RGB-D SLAM systems. One notable development is the application of Neural Radiance Fields (NeRF) in these systems, which utilizes implicit neural representation to encode 3D scenes. This extension of NeRF to SLAM has shown promising results. However, the depth images obtained from consumer-grade RGB-D sensors are often sparse and noisy, which poses significant challenges for 3D reconstruction and affects the accuracy of the representation of the scene geometry. Moreover, the original hierarchical feature grid with occupancy value is inaccurate for scene geometry representation. Furthermore, the existing methods select random pixels for camera tracking, which leads to inaccurate localization and is not robust in real-world indoor environments. To this end, we present NeSLAM, an advanced framework that achieves accurate and dense depth estimation, robust camera tracking, and realistic synthesis of novel views. First, a depth completion and denoising network is designed to provide dense geometry prior and guide the neural implicit representation optimization. Second, the occupancy scene representation is replaced with Signed Distance Field (SDF) hierarchical scene representation for high-quality reconstruction and view synthesis. Furthermore, we also propose a NeRF-based self-supervised feature tracking algorithm for robust real-time tracking. Experiments on various indoor datasets demonstrate the effectiveness and accuracy of the system in reconstruction, tracking quality, and novel view synthesis.
近年来,3D建模和密集RGB-D SLAM系统取得了显著的进展。一个值得注意的是,将神经辐射场(NeRF)应用于这些系统,利用隐式神经表示来编码3D场景。这种将NeRF扩展到SLAM系统的方法已经取得了良好的效果。然而,消费级RGB-D传感器获得的深度图像通常稀疏且噪声严重,这使得3D建模和场景几何表示的准确性面临重大挑战。此外,原始分层特征网格的占有值值也不准确,并且现有的方法选择随机像素进行相机跟踪,导致不准确的局部定位,在现实世界的室内环境中也不够稳健。因此,我们提出了NeSLAM,一种实现准确和密集深度估计、稳健相机跟踪和真实场景合成的高性能框架。首先,设计了一个深度完成和去噪网络,以提供丰富的几何信息并指导神经隐式表示优化。其次,用符号距离场(SDF)层次场景表示来替代填充场景表示,以实现高质量的重构和视图合成。此外,我们还提出了一种基于NeRF的自监督特征跟踪算法,用于实时跟踪。在各种室内数据集上进行的实验证明了这个系统在建模、跟踪质量和场景合成方面的有效性和准确性。
https://arxiv.org/abs/2403.20034
Multi-camera SLAM systems offer a plethora of advantages, primarily stemming from their capacity to amalgamate information from a broader field of view, thereby resulting in heightened robustness and improved localization accuracy. In this research, we present a significant extension and refinement of the state-of-the-art stereo SLAM system, known as ORB-SLAM2, with the objective of attaining even higher this http URL accomplish this objective, we commence by mapping measurements from all cameras onto a virtual camera termed BundledFrame. This virtual camera is meticulously engineered to seamlessly adapt to multi-camera configurations, facilitating the effective fusion of data captured from multiple cameras. Additionally, we harness extrinsic parameters in the bundle adjustment (BA) process to achieve precise trajectory estimation.Furthermore, we conduct an extensive analysis of the role of bundle adjustment (BA) in the context of multi-camera scenarios, delving into its impact on tracking, local mapping, and global optimization. Our experimental evaluation entails comprehensive comparisons between ground truth data and the state-of-the-art SLAM system. To rigorously assess the system's performance, we utilize the EuRoC datasets. The consistent results of our evaluations demonstrate the superior accuracy of our system in comparison to existing approaches.
多摄像头SLAM系统具有诸多优势,主要源于其能够整合更广泛视角下的信息,从而实现更高的稳健性和更准确的局部定位精度。在本文中,我们提出了对最先进的立体SLAM系统ORB-SLAM2的重大拓展和改进,旨在实现更高的性能,我们在开始时将来自所有摄像头的测量结果映射到一个虚拟相机,称之为BundledFrame。这个虚拟相机精心设计,以无缝适应多种摄像机配置,促进多个摄像头捕获到的数据的有效融合。此外,我们还利用 bundle adjustment(BA)过程中提取的外部参数来实现精确的运动轨迹估计。 此外,我们对多摄像头场景中 bundle adjustment(BA)的作用进行了深入分析,探讨了其对跟踪、局部建模和全局优化的影响。我们的实验评估是对真实数据和最先进的SLAM系统之间的全面比较。为了严谨评估系统的性能,我们使用了EuRoC数据集。评估结果表明,我们的系统的性能优于现有方法。
https://arxiv.org/abs/2403.19886
Visual SLAM with thermal imagery, and other low contrast visually degraded environments such as underwater, or in areas dominated by snow and ice, remain a difficult problem for many state of the art (SOTA) algorithms. In addition to challenging front-end data association, thermal imagery presents an additional difficulty for long term relocalization and map reuse. The relative temperatures of objects in thermal imagery change dramatically from day to night. Feature descriptors typically used for relocalization in SLAM are unable to maintain consistency over these diurnal changes. We show that learned feature descriptors can be used within existing Bag of Word based localization schemes to dramatically improve place recognition across large temporal gaps in thermal imagery. In order to demonstrate the effectiveness of our trained vocabulary, we have developed a baseline SLAM system, integrating learned features and matching into a classical SLAM algorithm. Our system demonstrates good local tracking on challenging thermal imagery, and relocalization that overcomes dramatic day to night thermal appearance changes. Our code and datasets are available here: this https URL
视觉SLAM与热成像以及其他低对比视觉降噪环境(如水下或被雪和冰主导的区域) remains a difficult problem for many state-of-the-art (SOTA) algorithms. 除了具有挑战性的前端数据关联之外,热成像还提出了长期重定位和地图重用的问题。热成像中物体的相对温度从白天到黑夜急剧变化。用于SLAM中进行重定位的特征描述符通常无法维持这些日间变化的一致性。我们证明了学习到的特征描述符可以用于现有的基于Bag of Word的定位方案,以显著改善大时间间隔热成像中place recognition。为了证明我们训练词汇的有效性,我们开发了一个基于学习特征和匹配的经典SLAM系统。我们的系统在具有挑战性的热成像上表现出良好的局部跟踪能力,并克服了白天到黑夜热成像显著变化的 relocalization。我们的代码和数据集都可以在这里找到:这个链接
https://arxiv.org/abs/2403.19885
Simultaneous localization and mapping (SLAM) is a critical capability in autonomous navigation, but memory and computational limits make long-term application of common SLAM techniques impractical; a robot must be able to determine what information should be retained and what can safely be forgotten. In graph-based SLAM, the number of edges (measurements) in a pose graph determines both the memory requirements of storing a robot's observations and the computational expense of algorithms deployed for performing state estimation using those observations, both of which can grow unbounded during long-term navigation. Motivated by these challenges, we propose a new general purpose approach to sparsify graphs in a manner that maximizes algebraic connectivity, a key spectral property of graphs which has been shown to control the estimation error of pose graph SLAM solutions. Our algorithm, MAC (for maximizing algebraic connectivity), is simple and computationally inexpensive, and admits formal post hoc performance guarantees on the quality of the solution that it provides. In application to the problem of pose-graph SLAM, we show on several benchmark datasets that our approach quickly produces high-quality sparsification results which retain the connectivity of the graph and, in turn, the quality of corresponding SLAM solutions.
同时定位与映射(SLAM)是自动驾驶中的关键能力,但内存和计算能力的限制使得长期应用常见的SLAM技术变得不可行;机器人必须能够确定应该保留哪些信息以及可以安全忘记哪些信息。在基于图的SLAM中,姿态图中的边数(测量)决定了存储机器人观测信息的内存需求以及使用这些观测信息进行状态估计的算法的计算开销,两者在长期导航过程中都可能无限制增长。为了应对这些挑战,我们提出了一种新的通用方法来稀疏图,以最大程度地增加图的 algebraic 连接性,这是图的一个重要特征,已被证明可以控制姿态图SLAM解决方案的估计误差。我们的算法MAC(用于最大化 algebraic connectivity)简单且计算成本低,并且对所提供解决方案的质量具有正式的后续性能保证。在应用到姿态图SLAM问题中,我们在多个基准数据集上证明了我们的方法可以快速产生高质量稀疏化结果,保留图的连接性,并进而保留相应的SLAM解决方案的质量。
https://arxiv.org/abs/2403.19879
Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths. Code is available at this https URL
视觉空间识别的目标是基于视觉线索识别之前访问过的地方,应用于机器人应用中的SLAM和定位。由于通常移动机器人可以访问连续的帧流,因此将此任务自然地表示为序列到序列的定位问题。然而,获得带标签的序列数据比收集孤立的图像要昂贵得多,这可以通过自动方式完成,几乎没有监督。为了缓解这个问题,我们提出了一种名为Joint Image and Sequence Training (JIST)的新方法,通过多任务学习框架利用大量未标注的图像。借助JIST,我们还引入了SeqGeM,一个聚合层,它回顾了流行的GeM池化,从序列单帧嵌入中产生一个稳健且紧凑的嵌入。我们证明了我们的模型可以在保持前 state-of-the-art性能的同时更快地运行,使用8倍较小的描述符,具有更轻的架构,并允许处理不同长度的序列。代码可以从此链接获取:
https://arxiv.org/abs/2403.19787
Recent advancements in RGB-only dense Simultaneous Localization and Mapping (SLAM) have predominantly utilized grid-based neural implicit encodings and/or struggle to efficiently realize global map and pose consistency. To this end, we propose an efficient RGB-only dense SLAM system using a flexible neural point cloud scene representation that adapts to keyframe poses and depth updates, without needing costly backpropagation. Another critical challenge of RGB-only SLAM is the lack of geometric priors. To alleviate this issue, with the aid of a monocular depth estimator, we introduce a novel DSPO layer for bundle adjustment which optimizes the pose and depth of keyframes along with the scale of the monocular depth. Finally, our system benefits from loop closure and online global bundle adjustment and performs either better or competitive to existing dense neural RGB SLAM methods in tracking, mapping and rendering accuracy on the Replica, TUM-RGBD and ScanNet datasets. The source code will be made available.
近年来,在仅使用红色-绿色-蓝色(RGB)的密集同时定位与映射(SLAM)中,主要采用了基于网格的神经隐式编码,/或努力实现全局地图和姿态一致性。为此,我们提出了一个高效利用仅红色-绿色-蓝色(RGB)的密集SLAM系统,该系统具有适应关键帧姿势和深度更新的灵活神经点云表示。另一个RGB-only SLAM的关键挑战是缺乏几何先验。为了减轻这个问题,在单目深度估计器的帮助下,我们引入了一种名为DSPO的卷积神经网络层用于捆绑调整,该层优化了关键帧的姿势和深度,同时与单目深度成比例。最后,我们的系统利用环路闭合和在线全局捆绑调整在Replica、TUM-RGBD和ScanNet数据集上的跟踪、映射和渲染准确性要么更好,要么与现有密集神经RGB SLAM方法相当。源代码将提供。
https://arxiv.org/abs/2403.19549
Implicit neural representation (INR), in combination with geometric rendering, has recently been employed in real-time dense RGB-D SLAM. Despite active research endeavors being made, there lacks a unified protocol for fair evaluation, impeding the evolution of this area. In this work, we establish, to our knowledge, the first open-source benchmark framework to evaluate the performance of a wide spectrum of commonly used INRs and rendering functions for mapping and localization. The goal of our benchmark is to 1) gain an intuition of how different INRs and rendering functions impact mapping and localization and 2) establish a unified evaluation protocol w.r.t. the design choices that may impact the mapping and localization. With the framework, we conduct a large suite of experiments, offering various insights in choosing the INRs and geometric rendering functions: for example, the dense feature grid outperforms other INRs (e.g. tri-plane and hash grid), even when geometric and color features are jointly encoded for memory efficiency. To extend the findings into the practical scenario, a hybrid encoding strategy is proposed to bring the best of the accuracy and completion from the grid-based and decomposition-based INRs. We further propose explicit hybrid encoding for high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts the premise on robustness and computation efficiency.
隐式神经表示(INR)与几何渲染相结合,最近在实时密集RGB-D SLAM中得到了应用。尽管有很多研究在进行中,但缺乏一个统一的评估协议,阻碍了该领域的进步。在这项工作中,我们据我们所知,建立了第一个开源 benchmark 框架,以评估广泛使用的 INRs 和渲染函数在映射和定位中的性能。我们的 benchmark 的目标是:1)了解不同 INRs 和渲染函数对映射和定位的影响;2)确定关于可能影响映射和定位设计的架构选择。借助这个框架,我们进行了一系列实验,提供了关于选择 INRs 和几何渲染函数的许多见解:例如,密集特征网格在其他 INRs(如三平面和哈希网格)中表现更好,即使几何和颜色特征是联合编码以实现记忆效率。要将这些发现扩展到实际场景中,我们提出了一种混合编码策略,以在基于网格和分解-based INRs 的准确性和完整性之间取得平衡。我们进一步提出了为高保真度密集网格映射而设计的显式混合编码,以符合将稳健性和计算效率放在基础上的 RGB-D SLAM 系统。
https://arxiv.org/abs/2403.19473
We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. It dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only speeds up convergence but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code will be made public upon acceptance of the paper.
我们提出了MUTE-SLAM,一种采用多个三平面哈希编码实现实时神经实时SLAM系统,用于对场景进行高效表示。MUTE-SLAM有效地跟踪相机位置,并逐步构建了可扩展的多地图表示,无论是小还是大的室内环境。它动态地分配给新观察到的局部区域的子图,从而在不需要先验场景信息的情况下实现无约束的映射。与传统网格 based 方法不同,我们使用三个正交的轴向对齐平面进行哈希编码,显著减少了哈希冲突和训练参数的数量。这种混合方法不仅加速了收敛,还提高了表面复原的准确度。此外,我们的优化策略同时优化所有与当前相机弗鲁斯面相交的子图,确保全局一致性。在真实世界和合成数据集的广泛测试中,MUTE-SLAM证明了其在各种室内环境中具有最先进的表面复原质量和竞争力的跟踪性能。代码将在论文接受后公开。
https://arxiv.org/abs/2403.17765
Terrestrial laser scanning (TLS) is the standard technique used to create accurate point clouds for digital forest inventories. However, the measurement process is demanding, requiring up to two days per hectare for data collection, significant data storage, as well as resource-heavy post-processing of 3D data. In this work, we present a real-time mapping and analysis system that enables online generation of forest inventories using mobile laser scanners that can be mounted e.g. on mobile robots. Given incrementally created and locally accurate submaps-data payloads-our approach extracts tree candidates using a custom, Voronoi-inspired clustering algorithm. Tree candidates are reconstructed using an adapted Hough algorithm, which enables robust modeling of the tree stem. Further, we explicitly incorporate the incremental nature of the data collection by consistently updating the database using a pose graph LiDAR SLAM system. This enables us to refine our estimates of the tree traits if an area is revisited later during a mission. We demonstrate competitive accuracy to TLS or manual measurements using laser scanners that we mounted on backpacks or mobile robots operating in conifer, broad-leaf and mixed forests. Our results achieve RMSE of 1.93 cm, a bias of 0.65 cm and a standard deviation of 1.81 cm (averaged across these sequences)-with no post-processing required after the mission is complete.
地面激光扫描(TLS)是创建数字森林清查的准确点云的标准技术。然而,测量过程要求较高,需要每公顷数据收集长达两天的开销,以及处理大量3D数据的资源密集型后处理。在这项工作中,我们提出了一个可以在线生成森林清查的移动激光扫描器系统,该系统可以通过安装在移动机器人上实现。鉴于通过逐步创建的局部准确子图数据负载,我们的方法使用自定义的Voronoi启发式聚类算法提取树候选者。树候选者通过适应的Hough算法进行重建,该算法可以稳健地建模树干。此外,我们通过使用姿态图LiDAR SLAM系统更新数据库,明确表示数据的增益性。这使我们能够在后续任务中重新访问该区域时,优化我们对树特征的估计。我们在 conifer、 broad-leaf 和 mixed 森林中安装的激光扫描器上进行了竞争性的 TLS 或手动测量。我们的结果在平均这些序列中的 RMSE、偏差和标准差分别为 1.93 cm、0.65 cm 和 1.81 cm 的情况下,实现了卓越的准确度。在任务完成后的不需要后处理。
https://arxiv.org/abs/2403.17622
We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work. this https URL
我们提出了TRAM,一种从野视频中的全局轨迹和运动重建人类的方法。TRAM对SLAM进行了鲁棒,以在存在动态人类的情况下恢复相机运动,并使用场景背景来计算运动规模。将恢复的相机作为指标尺度参考框架,我们引入了一个视频变换模型(VIMO)来预测人类的三维姿态。通过将两个运动组合起来,我们在世界空间中实现了对3D人类的准确重构,将全局运动误差减少60%以上。这个链接:https://
https://arxiv.org/abs/2403.17346
Perception tasks play a crucial role in the development of automated operations and systems across multiple application fields. In the railway transportation domain, these tasks can improve the safety, reliability, and efficiency of various perations, including train localization, signal recognition, and track discrimination. However, collecting considerable and precisely labeled datasets for testing such novel algorithms poses extreme challenges in the railway environment due to the severe restrictions in accessing the infrastructures and the practical difficulties associated with properly equipping trains with the required sensors, such as cameras and LiDARs. The remarkable innovations of graphic engine tools offer new solutions to craft realistic synthetic datasets. To illustrate the advantages of employing graphic simulation for early-stage testing of perception tasks in the railway domain, this paper presents a comparative analysis of the performance of a SLAM algorithm applied both in a virtual synthetic environment and a real-world scenario. The analysis leverages virtual railway environments created with the latest version of Unreal Engine, facilitating data collection and allowing the examination of challenging scenarios, including low-visibility, dangerous operational modes, and complex environments. The results highlight the feasibility and potentiality of graphic simulation to advance perception tasks in the railway domain.
感知任务在多个应用领域中开发自动操作和系统具有关键作用。在铁路运输领域,这些任务可以提高包括列车定位、信号识别和轨道区分的各种操作的安全性、可靠性和效率。然而,为了测试这些新颖算法,收集大量准确标注的 dataset 在铁路环境中具有极具挑战性的,因为铁路环境中访问基础设施受到严重限制,并且与正确装备列车所需的传感器(如摄像头和 LiDAR)相关的实际困难。图形引擎工具的显著创新提供了用图形模拟创建现实合成数据集的新方法。为了说明在铁路领域使用图形模拟进行早期阶段感知任务测试的优势,本文对在虚拟合成环境和真实世界场景中应用 SLAM 算法的性能进行了比较分析。分析依赖于使用 Unreal Engine 最新版本创建的虚拟铁路环境,数据收集得以进行,并允许研究具有挑战性的场景,包括低可见度、危险操作模式和复杂环境。结果强调了图形模拟在铁路领域提高感知任务的可行性和潜力。
https://arxiv.org/abs/2403.17084
Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes. In this paper, we first propose a Terrain-Aware multI-modaL (TAIL) dataset tailored to deformable and sandy terrains. It incorporates various types of robotic proprioception and distinct ground interactions for the unique challenges and benchmark of multi-sensor fusion SLAM. The versatile sensor suite comprises stereo frame cameras, multiple ground-pointing RGB-D cameras, a rotating 3D LiDAR, an IMU, and an RTK device. This ensemble is hardware-synchronized, well-calibrated, and self-contained. Utilizing both wheeled and quadrupedal locomotion, we efficiently collect comprehensive sequences to capture rich unstructured scenarios. It spans the spectrum of scope, terrain interactions, scene changes, ground-level properties, and dynamic robot characteristics. We benchmark several state-of-the-art SLAM methods against ground truth and provide performance validations. Corresponding challenges and limitations are also reported. All associated resources are accessible upon request at \url{this https URL}.
地形感知感知具有在野外提高自主机器人导航的稳健性和精度的潜力,从而促进有效穿越复杂地形。然而,各种运动模式下的多模态感知不足会阻碍同时定位与映射(SLAM)的解决方案,尤其是在面临具有挑战性的复杂地形时。在本文中,我们首先提出了一个专门针对变形和沙质地形的多模态(TAIL)数据集。它专门为机器人本体感知和独特的多传感器融合SLAM挑战和基准而设计。多样化的传感器套件包括双目立体相机、多个地面指向的RGB-D相机、旋转的3D激光雷达、IMU和实时定位与跟踪设备。该集成系统具有硬件同步、校准良好和自包含的特点。通过轮行和四足行走,我们有效地收集了全面的序列以捕捉丰富的非结构化场景。它涵盖了范围、地形交互、场景变化、地面级性质和动态机器人特征。我们还与最先进的SLAM方法进行了对比并提供了性能验证。同时,还报告了相应的挑战和限制。所有相关资源都可以通过请求的链接获取:<https://this https URL>。
https://arxiv.org/abs/2403.16875
Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and mapping (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available. Project page: this https URL.
近年来,神经辐射场(NeRF)已经被广泛应用于作为3D表示同时进行密集定位和映射(SLAM)。尽管NeRF在表面建模和新颖视角合成方面取得了显著的成功,但基于现有NeRF的方法在计算密集和耗时的体积渲染管道方面存在限制。本文提出了一种基于新不确定性和几何稳定性高的新兴3D高斯场,即CG-SLAM,实现高效密集的RGB-D SLAM系统。通过深入分析高斯展开,我们提出了一些方法来构建一个适合跟踪和映射的稳定一致性高斯场。此外,还提出了一种新的深度不确定性模型,在优化过程中确保选择有价值的 Gaussian primitive,从而提高跟踪效率和准确性。在各种数据集上的实验证明CG-SLAM具有卓越的跟踪和映射性能,跟踪速度可以达到高达15Hz。我们将公开源代码。项目页面:此链接。
https://arxiv.org/abs/2403.16095
Camera rotation estimation from a single image is a challenging task, often requiring depth data and/or camera intrinsics, which are generally not available for in-the-wild videos. Although external sensors such as inertial measurement units (IMUs) can help, they often suffer from drift and are not applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm that estimates camera rotation along with uncertainty from uncalibrated RGB images. Using a Manhattan World assumption, our method leverages the per-pixel geometric priors encoded in single-image surface normal predictions and performs optimisation over the SO(3) manifold. Given a sequence of images, we can use the per-frame rotation estimates and their uncertainty to perform multi-frame optimisation, achieving robustness and temporal consistency. Our experiments demonstrate that U-ARE-ME performs comparably to RGB-D methods and is more robust than sparse feature-based SLAM methods. We encourage the reader to view the accompanying video at this https URL for a visual overview of our method.
从单个图像中估计相机的旋转是一个具有挑战性的任务,通常需要深度数据和/或相机内参,这在野外视频通常不是可用的。尽管外部传感器(如惯性测量单元(IMUs))可以帮助,但它们通常会受到漂移的影响,并且不适用于非惯性参考系。我们提出了 U-ARE-ME 算法,该算法从未校准的 RGB 图像中估计相机的旋转并具有不确定性。使用曼哈顿世界假设,我们的方法利用单帧表面法线预测中的每像素几何先验,并在 SO(3) 上进行优化。对于一系列图像,我们可以使用每帧旋转估计及其不确定性进行多帧优化,实现鲁棒性和时间一致性。我们的实验证明,U-ARE-ME 与其他 RGB-D 方法相当,并且比稀疏特征基于 SLAM 方法更稳健。我们鼓励读者查看附录中的视频,以获得对我们方法的视觉概述。
https://arxiv.org/abs/2403.15583
Precise camera tracking, high-fidelity 3D tissue reconstruction, and real-time online visualization are critical for intrabody medical imaging devices such as endoscopes and capsule robots. However, existing SLAM (Simultaneous Localization and Mapping) methods often struggle to achieve both complete high-quality surgical field reconstruction and efficient computation, restricting their intraoperative applications among endoscopic surgeries. In this paper, we introduce EndoGSLAM, an efficient SLAM approach for endoscopic surgeries, which integrates streamlined Gaussian representation and differentiable rasterization to facilitate over 100 fps rendering speed during online camera tracking and tissue reconstructing. Extensive experiments show that EndoGSLAM achieves a better trade-off between intraoperative availability and reconstruction quality than traditional or neural SLAM approaches, showing tremendous potential for endoscopic surgeries. The project page is at this https URL
精确相机跟踪、高保真度的3D组织重建和实时在线可视化对于内窥镜等体内医学成像设备至关重要。然而,现有的SLAM(同时定位与映射)方法通常很难实现完整的手术野重建和高效的计算,限制了它们在内窥镜手术中的应用。在本文中,我们介绍了EndoGSLAM,一种用于内窥镜手术的高效SLAM方法,它将优化高斯表示和可导张量映射以实现在线相机跟踪和组织重建超过100帧/秒的渲染速度。大量的实验结果表明,EndoGSLAM比传统或神经SLAM方法在体内可用性和重建质量之间实现了更好的平衡,具有巨大的内窥镜手术潜力。项目页面位于https://www.endogslam.org/。
https://arxiv.org/abs/2403.15124
Sub-symbolic artificial intelligence methods dominate the fields of environment-type classification and Simultaneous Localisation and Mapping. However, a significant area overlooked within these fields is solution transparency for the human-machine interaction space, as the sub-symbolic methods employed for map generation do not account for the explainability of the solutions generated. This paper proposes a novel approach to environment-type classification through Symbolic Simultaneous Localisation and Mapping, SymboSLAM, to bridge the explainability gap. Our method for environment-type classification observes ontological reasoning used to synthesise the context of an environment through the features found within. We achieve explainability within the model by presenting operators with environment-type classifications overlayed by a semantically labelled occupancy map of landmarks and features. We evaluate SymboSLAM with ground-truth maps of the Canberra region, demonstrating method effectiveness. We assessed the system through both simulations and real-world trials.
子符号人工智能方法在环境类型分类和同时定位与映射领域占据主导地位。然而,在这些领域中,一个被忽视的领域是解决人机交互空间的可解释性,因为用于地图生成的子符号方法没有解释生成的解决方案。本文提出了一种名为符号同时定位与映射的新方法,通过Symbolic Simultaneous Localisation and Mapping (SymboSLAM),桥接了可解释性差距。我们的环境类型分类方法通过观察发现环境通过特征内的上下文合成。通过在模型中呈现带有环境类型分类的运营商,我们实现了模型的可解释性。我们在堪培拉地区的地面真值地图上评估了SymboSLAM,证明了该方法的有效性。我们通过仿真和现实世界试验对系统进行了评估。
https://arxiv.org/abs/2403.15504
Many LiDAR place recognition systems have been developed and tested specifically for urban driving scenarios. Their performance in natural environments such as forests and woodlands have been studied less closely. In this paper, we analyzed the capabilities of four different LiDAR place recognition systems, both handcrafted and learning-based methods, using LiDAR data collected with a handheld device and legged robot within dense forest environments. In particular, we focused on evaluating localization where there is significant translational and orientation difference between corresponding LiDAR scan pairs. This is particularly important for forest survey systems where the sensor or robot does not follow a defined road or path. Extending our analysis we then incorporated the best performing approach, Logg3dNet, into a full 6-DoF pose estimation system -- introducing several verification layers for precise registration. We demonstrated the performance of our methods in three operational modes: online SLAM, offline multi-mission SLAM map merging, and relocalization into a prior map. We evaluated these modes using data captured in forests from three different countries, achieving 80% of correct loop closures candidates with baseline distances up to 5m, and 60% up to 10m.
许多激光雷达点识别系统专门针对城市驾驶场景进行了开发和测试。他们对自然环境(如森林和林地)的表现研究较少。在本文中,我们分析了四种不同LiDAR点识别系统,包括手工制作和学习方法,利用手持设备收集的森林中使用机器人获取的LiDAR数据。特别关注评估在对应LiDAR扫描对之间存在较大平移和方向差异的定位能力。这对于森林调查系统尤为重要,因为传感器或机器人并不遵循明确的路线或路径。通过扩展我们的分析,我们将最佳表现的方法——Logg3dNet,纳入了一个6DoF姿态估计系统——引入了几个验证层以实现精确的注册。我们在三种操作模式下评估了我们的方法:在线SLAM,离线多任务SLAM地图合并和基于先验地图的重新定位。我们使用从三个不同国家收集的森林数据来评估这些模式,达到80%的循环关闭候选者,其基线距离在5米以内,以及60%在10米以内。
https://arxiv.org/abs/2403.14326
Exoskeletons for daily use by those with mobility impairments are being developed. They will require accurate and robust scene understanding systems. Current research has used vision to identify immediate terrain and geometric obstacles, however these approaches are constrained to detections directly in front of the user and are limited to classifying a finite range of terrain types (e.g., stairs, ramps and level-ground). This paper presents Exosense, a vision-centric scene understanding system which is capable of generating rich, globally-consistent elevation maps, incorporating both semantic and terrain traversability information. It features an elastic Atlas mapping framework associated with a visual SLAM pose graph, embedded with open-vocabulary room labels from a Vision-Language Model (VLM). The device's design includes a wide field-of-view (FoV) fisheye multi-camera system to mitigate the challenges introduced by the exoskeleton walking pattern. We demonstrate the system's robustness to the challenges of typical periodic walking gaits, and its ability to construct accurate semantically-rich maps in indoor settings. Additionally, we showcase its potential for motion planning -- providing a step towards safe navigation for exoskeletons.
为那些行动不便的人开发了一种可日常使用的外骨骼。它们需要准确且可靠的场景理解系统。目前的研究已经利用视觉来识别立即的地形和几何障碍,然而这些方法仅限于在用户前直接检测到,并且局限于对有限范围的地面类型(如楼梯、斜坡和水平地面)进行分类。本文介绍了Exosense,一种以视觉为核心场景理解系统,能够生成丰富、全球一致的地形图,同时包含语义和地形可穿越信息。它采用了一个具有视觉SLAM姿态图的弹性的Atlas映射框架,附带从Vision-Language Model (VLM) 中的开放式词汇房间标签嵌入的房间标签。设备的设计包括一个广角鱼眼多相机系统,以减轻由外骨骼步行模式带来的挑战。我们展示了系统对典型周期性步行姿态的鲁棒性以及其在室内环境中的准确语义丰富地图的构建能力。此外,我们还展示了它在运动规划方面的潜力——为外骨骼的 safe navigation 迈出一步。
https://arxiv.org/abs/2403.14320