Perceiving and mapping the surroundings are essential for enabling autonomous navigation in any robotic platform. The algorithm class that enables accurate mapping while correcting the odometry errors present in most robotics systems is Simultaneous Localization and Mapping (SLAM). Today, fully onboard mapping is only achievable on robotic platforms that can host high-wattage processors, mainly due to the significant computational load and memory demands required for executing SLAM algorithms. For this reason, pocket-size hardware-constrained robots offload the execution of SLAM to external infrastructures. To address the challenge of enabling SLAM algorithms on resource-constrained processors, this paper proposes NanoSLAM, a lightweight and optimized end-to-end SLAM approach specifically designed to operate on centimeter-size robots at a power budget of only 87.9 mW. We demonstrate the mapping capabilities in real-world scenarios and deploy NanoSLAM on a nano-drone weighing 44 g and equipped with a novel commercial RISC-V low-power parallel processor called GAP9. The algorithm is designed to leverage the parallel capabilities of the RISC-V processing cores and enables mapping of a general environment with an accuracy of 4.5 cm and an end-to-end execution time of less than 250 ms.
感知和周围环境的绘制是任何机器人平台实现自主导航所必需的。能够实现准确地图同时纠正大多数机器人系统中的步距估计误差的算法类别是同时定位和地图(SLAM)。如今,只有能够搭载高性能处理器的机器人平台才能实现完整的内置地图,这主要是因为执行SLAM算法需要大量的计算资源和内存要求。因此,小型硬件受限的机器人只能将SLAM的执行委托给外部基础设施。为了解决在资源受限的处理器上实现SLAM算法的挑战,本论文提出了纳米SLAM,它是一种轻量级、优化的端到端SLAM方法,专门设计用于操作厘米级别的机器人,功率预算仅87.9毫瓦。我们在现实场景中展示了地图能力,并部署了体重44克、搭载名为GAP9的 novel RISC-V低功耗并行处理器的纳米机器人。该算法利用RISC-V处理器的并行能力,能够实现对环境的精确地图,精度为4.5厘米,且端到端执行时间小于250毫秒。
https://arxiv.org/abs/2309.12008
Numerous datasets and benchmarks exist to assess and compare Simultaneous Localization and Mapping (SLAM) algorithms. Nevertheless, their precision must follow the rate at which SLAM algorithms improved in recent years. Moreover, current datasets fall short of comprehensive data-collection protocol for reproducibility and the evaluation of the precision or accuracy of the recorded trajectories. With this objective in mind, we proposed the Robotic Total Stations Ground Truthing dataset (RTS-GT) dataset to support localization research with the generation of six-Degrees Of Freedom (DOF) ground truth trajectories. This novel dataset includes six-DOF ground truth trajectories generated using a system of three Robotic Total Stations (RTSs) tracking moving robotic platforms. Furthermore, we compare the performance of the RTS-based system to a Global Navigation Satellite System (GNSS)-based setup. The dataset comprises around sixty experiments conducted in various conditions over a period of 17 months, and encompasses over 49 kilometers of trajectories, making it the most extensive dataset of RTS-based measurements to date. Additionally, we provide the precision of all poses for each experiment, a feature not found in the current state-of-the-art datasets. Our results demonstrate that RTSs provide measurements that are 22 times more stable than GNSS in various environmental settings, making them a valuable resource for SLAM benchmark development.
有许多数据和基准存在,以评估和比较Simultaneous Localization and Mapping(SLAM)算法。然而,它们的精度必须跟上近年来SLAM算法改进的速度。此外,当前的数据集缺乏完整的数据收集协议,以重复测量和评估记录的轨迹的精度或准确性。考虑到这一目标,我们提出了机器人总站地面 truthing 数据集(RTS-GT)数据集,以支持以生成六自由度(DOF)地面 truth轨迹为目标的Localization研究。这个新的数据集包括使用三个机器人总站跟踪移动机器人平台的六自由度地面 truth轨迹。我们还比较了基于RTS的系统与全球导航卫星系统(GNSS)系统的setup的性能。数据集包括在多种条件下进行的实验,持续了17个月,覆盖了超过49公里的轨迹,是迄今为止最广泛的基于RTS的数据集。此外,我们提供了每个实验的所有姿态的精度,这是当前先进数据集中所没有的特征。我们的结果表明,RTS提供的稳定性比GNSS在多种环境设置中高出22倍,因此它们是SLAM基准开发的宝贵资源。
https://arxiv.org/abs/2309.11935
Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: this https URL.
视觉估计(VO)在自主系统中扮演了关键的角色,其中主要挑战是相机图像缺乏深度信息。本文介绍了Occ3D-nuScenes,这是一种新框架,利用深度学习的最新进展将2D相机图像转换为3D语义占用,从而绕过了传统的同时估计自我姿态和地标位置的需求。在这个框架中,我们使用TPV- former将周围的视图相机图像转换为3D语义占用。为了解决这种转换所带来的挑战,我们特别定制了姿态估计和映射算法,包括语义标签过滤器、动态物体过滤器,并最终使用 Voxel PFilter维持一个稳定的全球语义地图。在Occ3D-nuScenes的评估中,不仅表现出与ORB-SLAM3相比20.6%的成功 Ratio 和29.6%的轨迹精度提高,也强调了我们构建全面了解图的能力。我们的实现是开源的,可以在以下httpsURL上获取。
https://arxiv.org/abs/2309.11011
This document presents PLVS: a real-time system that leverages sparse SLAM, volumetric mapping, and 3D unsupervised incremental segmentation. PLVS stands for Points, Lines, Volumetric mapping, and Segmentation. It supports RGB-D and Stereo cameras, which may be optionally equipped with IMUs. The SLAM module is keyframe-based, and extracts and tracks sparse points and line segments as features. Volumetric mapping runs in parallel with respect to the SLAM front-end and generates a 3D reconstruction of the explored environment by fusing point clouds backprojected from keyframes. Different volumetric mapping methods are supported and integrated in PLVS. We use a novel reprojection error to bundle-adjust line segments. This error exploits available depth information to stabilize the position estimates of line segment endpoints. An incremental and geometric-based segmentation method is implemented and integrated for RGB-D cameras in the PLVS framework. We present qualitative and quantitative evaluations of the PLVS framework on some publicly available datasets. The appendix details the adopted stereo line triangulation method and provides a derivation of the Jacobians we used for line error terms. The software is available as open-source.
这份文档介绍了PLVS:一个利用稀疏SLAM、体积映射和3D无监督增量分割实时系统。PLVS代表点、线、体积映射和分割。它支持RGB-D和立体相机,这些相机可能可选地配备惯性测量单元。SLAM模块基于关键帧,提取和跟踪稀疏点和平线片段作为特征。体积映射在SLAM前端并行运行,从关键帧聚合点云并生成从关键帧聚合的探索环境3D重构。不同体积映射方法和PLVS框架被支持和整合。我们使用一种新颖的投影误差来打包调整线片段。这个误差利用可用的深度信息稳定线片段端点的位置估计。在PLVS框架中,RGB-D相机采用增量和几何based分割方法。我们提供了对PLVS框架的一些公开数据集的定性和定量评估。附录详细描述了所采用的立体线三角化方法,并提供了线误差 terms的 Jacobians的推导。软件作为开源可用。
https://arxiv.org/abs/2309.10896
Decision making under uncertainty is at the heart of any autonomous system acting with imperfect information. The cost of solving the decision making problem is exponential in the action and observation spaces, thus rendering it unfeasible for many online systems. This paper introduces a novel approach to efficient decision-making, by partitioning the high-dimensional observation space. Using the partitioned observation space, we formulate analytical bounds on the expected information-theoretic reward, for general belief distributions. These bounds are then used to plan efficiently while keeping performance guarantees. We show that the bounds are adaptive, computationally efficient, and that they converge to the original solution. We extend the partitioning paradigm and present a hierarchy of partitioned spaces that allows greater efficiency in planning. We then propose a specific variant of these bounds for Gaussian beliefs and show a theoretical performance improvement of at least a factor of 4. Finally, we compare our novel method to other state of the art algorithms in active SLAM scenarios, in simulation and in real experiments. In both cases we show a significant speed-up in planning with performance guarantees.
不确定性下的决策是在处理不完美信息的机器行为的核心。解决决策问题的代价在行动和观测空间中是指数级的,因此许多在线系统无法实现。本文介绍了一种高效决策的方法,通过分割高维观测空间来实现。利用分割的观测空间,我们制定了针对一般信念分布的预期信息理论奖励的 analytical bounds。这些 bounds 后来被用于高效地规划和保持性能保证。我们证明, bounds 是自适应的、计算高效的,并且会趋近于原始解决方案。我们扩展了分割范式,并介绍了一个阶梯状的分割空间架构,以便更有效地规划。我们随后提出了针对高斯信念分布的特定变体,并证明理论性能改进至少为4倍。最后,我们将我们的新型方法与其他最先进的主动 SLAM 算法在模拟和实际实验中进行比较。在 both cases 中,我们展示了在性能保证下规划的显著加速。
https://arxiv.org/abs/2309.10701
The number and arrangement of sensors on an autonomous mobile robot dramatically influence its perception capabilities. Ensuring that sensors are mounted in a manner that enables accurate detection, localization, and mapping is essential for the success of downstream control tasks. However, when designing a new robotic platform, researchers and practitioners alike usually mimic standard configurations or maximize simple heuristics like field-of-view (FOV) coverage to decide where to place exteroceptive sensors. In this work, we conduct an information-theoretic investigation of this overlooked element of mobile robotic perception in the context of simultaneous localization and mapping (SLAM). We show how to formalize the sensor arrangement problem as a form of subset selection under the E-optimality performance criterion. While this formulation is NP-hard in general, we further show that a combination of greedy sensor selection and fast convex relaxation-based post-hoc verification enables the efficient recovery of certifiably optimal sensor designs in practice. Results from synthetic experiments reveal that sensors placed with OASIS outperform benchmarks in terms of mean squared error of visual SLAM estimates.
在 autonomous mobile robot 中,传感器的数量和排列方式会显著影响其感知能力。确保传感器以能够准确检测、定位和映射的方式安装是实现后续控制任务成功的关键。然而,在设计新的机器人平台时,研究人员和从业者通常会模仿标准配置或最大化简单的启发式算法,如视野覆盖,以决定在哪里放置感受器传感器。在本研究中,我们利用信息论方法研究 mobile robotic 感知中被忽视的元素,同时定位和地图构建(SLAM)。我们表明如何将传感器排列问题 formal 化为一种子集选择的形式, under the E-最优性能 criterion。虽然这种表述在一般情况下是 NP-hard 的,但我们进一步表明,贪心传感器选择和快速凸 relaxation-based 后验验证的组合可以在实践中高效恢复显然最优的传感器设计。模拟实验的结果表明,与 OASIS 放置的传感器相比,视觉 SLAM 估计的视觉平方误差均值误差表现更好。
https://arxiv.org/abs/2309.10698
3D scene graphs offer a more efficient representation of the environment by hierarchically organizing diverse semantic entities and the topological relationships among them. Fiducial markers, on the other hand, offer a valuable mechanism for encoding comprehensive information pertaining to environments and the objects within them. In the context of Visual SLAM (VSLAM), especially when the reconstructed maps are enriched with practical semantic information, these markers have the potential to enhance the map by augmenting valuable semantic information and fostering meaningful connections among the semantic objects. In this regard, this paper exploits the potential of fiducial markers to incorporate a VSLAM framework with hierarchical representations that generates optimizable multi-layered vision-based situational graphs. The framework comprises a conventional VSLAM system with low-level feature tracking and mapping capabilities bolstered by the incorporation of a fiducial marker map. The fiducial markers aid in identifying walls and doors in the environment, subsequently establishing meaningful associations with high-level entities, including corridors and rooms. Experimental results are conducted on a real-world dataset collected using various legged robots and benchmarked against a Light Detection And Ranging (LiDAR)-based framework (S-Graphs) as the ground truth. Consequently, our framework not only excels in crafting a richer, multi-layered hierarchical map of the environment but also shows enhancement in robot pose accuracy when contrasted with state-of-the-art methodologies.
3D场景Graph通过Hierarchically organizing diverse semantic entities和它们之间的topological关系,提供了更高效的对环境的表示。标志位图则提供了一个重要的机制,用于编码与环境和其中的对象相关的全面信息。在视觉多时态SLAM(VSLAM)的背景下,特别是当重构的地图中添加实际语义信息时,这些标志位图有潜力通过增加宝贵的语义信息并促进语义对象之间的有意义连接来增强地图。在这方面,本文利用标志位图的潜力,将其纳入一个VSLAM框架,该框架通过Hierarchically representing产生可优化的多层视觉场景 Graph。框架包括一个传统的VSLAM系统,通过添加标志位图增强了低级别特征跟踪和映射能力。标志位图帮助识别环境中的墙壁和门,随后与高级别实体,包括走廊和房间建立有意义的连接。实验结果使用了使用各种腿机器人收集的现实世界数据集,并将其与基于光检测和测量(LiDAR)框架(S-Graphs)作为基准值进行比较。因此,我们的框架不仅 excels 在构建更丰富、多层的Hierarchically organize environmental map方面,而且在与最先进的方法学进行对比时,还表现出机器人姿态准确性的提高。
https://arxiv.org/abs/2309.10461
Keypoint detection and description play a pivotal role in various robotics and autonomous applications including visual odometry (VO), visual navigation, and Simultaneous localization and mapping (SLAM). While a myriad of keypoint detectors and descriptors have been extensively studied in conventional camera images, the effectiveness of these techniques in the context of LiDAR-generated images, i.e. reflectivity and ranges images, has not been assessed. These images have gained attention due to their resilience in adverse conditions such as rain or fog. Additionally, they contain significant textural information that supplements the geometric information provided by LiDAR point clouds in the point cloud registration phase, especially when reliant solely on LiDAR sensors. This addresses the challenge of drift encountered in LiDAR Odometry (LO) within geometrically identical scenarios or where not all the raw point cloud is informative and may even be misleading. This paper aims to analyze the applicability of conventional image key point extractors and descriptors on LiDAR-generated images via a comprehensive quantitative investigation. Moreover, we propose a novel approach to enhance the robustness and reliability of LO. After extracting key points, we proceed to downsample the point cloud, subsequently integrating it into the point cloud registration phase for the purpose of odometry estimation. Our experiment demonstrates that the proposed approach has comparable accuracy but reduced computational overhead, higher odometry publishing rate, and even superior performance in scenarios prone to drift by using the raw point cloud. This, in turn, lays a foundation for subsequent investigations into the integration of LiDAR-generated images with LO. Our code is available on GitHub: this https URL.
要点检测和描述在多种机器人和自主应用中扮演着关键角色,包括视觉测距(VO)、视觉导航和同时定位和映射(SLAM)。尽管在传统相机图像中已经广泛研究了各种要点检测和描述方法,但这些方法在LiDAR生成图像的背景下(即反射率和距离图像)的有效性尚未得到评估。这些图像因其在不良条件下的韧性而引起了关注,例如雨和雾。此外,它们包含大量文本信息,在点云注册阶段提供了几何信息,特别是在仅依赖LiDAR传感器的情况下。这解决了在LiDAR测距(LO)情况下,在具有几何相似的场景中或所有原始点云非 informative 的情况下,出现漂移的挑战。本文旨在通过全面量化研究分析传统图像要点提取器和描述方法在LiDAR生成图像中的应用。我们还提出了一种新的方法来增强LO的鲁棒性和可靠性。在要点提取后,我们进行点云降采样,随后将其融入点云注册阶段以进行测距估计。我们的实验表明,我们提出的方法具有相当准确的精度,但减少了计算开销,测距发布率更高,甚至在使用原始点云的情况下,在漂移易受影响的场景中表现出更好的性能。这为后续研究将LiDAR生成图像与LO集成建立了基础。我们的代码可在GitHub上可用:此httpsURL。
https://arxiv.org/abs/2309.10436
Spiking Neural Networks (SNNs) are at the forefront of neuromorphic computing thanks to their potential energy-efficiency, low latencies, and capacity for continual learning. While these capabilities are well suited for robotics tasks, SNNs have seen limited adaptation in this field thus far. This work introduces a SNN for Visual Place Recognition (VPR) that is both trainable within minutes and queryable in milliseconds, making it well suited for deployment on compute-constrained robotic systems. Our proposed system, VPRTempo, overcomes slow training and inference times using an abstracted SNN that trades biological realism for efficiency. VPRTempo employs a temporal code that determines the timing of a single spike based on a pixel's intensity, as opposed to prior SNNs relying on rate coding that determined the number of spikes; improving spike efficiency by over 100%. VPRTempo is trained using Spike-Timing Dependent Plasticity and a supervised delta learning rule enforcing that each output spiking neuron responds to just a single place. We evaluate our system on the Nordland and Oxford RobotCar benchmark localization datasets, which include up to 27k places. We found that VPRTempo's accuracy is comparable to prior SNNs and the popular NetVLAD place recognition algorithm, while being several orders of magnitude faster and suitable for real-time deployment -- with inference speeds over 50 Hz on CPU. VPRTempo could be integrated as a loop closure component for online SLAM on resource-constrained systems such as space and underwater robots.
击发神经网络(SNN)是神经形态计算的领先者,因为它们的潜在能源效率、低延迟和持续学习能力。尽管这些能力非常适合机器人任务,但SNN在该领域的接触有限。这项工作介绍了一种用于视觉位置识别(VPR)的SNN,可以在几分钟内训练,并在毫秒级内查询,因此非常适合部署在计算限制的机器人系统上。我们提出的系统是VPRTempo,通过使用抽象的SNN,以生物真实度换取效率,克服了训练和推断速度慢的问题。VPRTempo使用时间编码,基于像素的强度来确定单个 spike 的 timing,而不是以前的SNN依赖于速率编码来确定 spike 的数量,提高了 spike 的效率超过100%。VPRTempo使用基于delta learning规则的监督增量学习规则训练,每个输出击发神经元只响应一个位置。我们评估了我们的系统在 Nordland 和 Oxford RobotCar 基准定位数据集上,该数据集包括27,000个位置。我们发现,VPRTempo的精度与以前的SNN和流行的NetVLAD位置识别算法相当,但速度更快,适合实时部署,CPU的推断速度超过50 Hz。VPRTempo可以集成作为在线单点同步(SLAM)在资源受限的系统,如太空和水下机器人上的循环闭环组件。
https://arxiv.org/abs/2309.10225
In static environments, visual simultaneous localization and mapping (V-SLAM) methods achieve remarkable performance. However, moving objects severely affect core modules of such systems like state estimation and loop closure detection. To address this, dynamic SLAM approaches often use semantic information, geometric constraints, or optical flow to mask features associated with dynamic entities. These are limited by various factors such as a dependency on the quality of the underlying method, poor generalization to unknown or unexpected moving objects, and often produce noisy results, e.g. by masking static but movable objects or making use of predefined thresholds. In this paper, to address these trade-offs, we introduce a novel visual SLAM system, DynaPix, based on per-pixel motion probability values. Our approach consists of a new semantic-free probabilistic pixel-wise motion estimation module and an improved pose optimization process. Our per-pixel motion probability estimation combines a novel static background differencing method on both images and optical flows from splatted frames. DynaPix fully integrates those motion probabilities into both map point selection and weighted bundle adjustment within the tracking and optimization modules of ORB-SLAM2. We evaluate DynaPix against ORB-SLAM2 and DynaSLAM on both GRADE and TUM-RGBD datasets, obtaining lower errors and longer trajectory tracking times. We will release both source code and data upon acceptance of this work.
在静态环境中,视觉同时定位和映射(V-SLAM)方法能够实现卓越的性能。然而,移动物体严重地影响此类系统的核心模块,如状态估计和循环圈检测。为了解决这个问题,动态的V-SLAM方法通常使用语义信息、几何约束或光学流来掩盖与动态实体相关的特征。这些限制了许多因素,如依赖于底层方法的质量、对未知或意外移动物体的泛化较差,并且通常产生噪声性结果,例如通过掩盖静态但可移动的物体或使用预先定义的阈值。在本文中,为了解决这些问题,我们介绍了基于像素运动概率值的新视觉V-SLAM系统DynaPix。我们的研究方法包括一个新的无语义概率像素wise运动估计模块和一个改进的姿态优化过程。我们的像素运动概率估计结合了一种新的静态背景差异处理方法,在图像中同时结合splatted帧的光学流。DynaPix将这些运动概率完全集成到ORB-SLAM2跟踪和优化模块中的地图点选择和加权条带调整中。我们评估了DynaPix与ORB-SLAM2和DynaSLAM在Grade和TUM-RGBD数据集上的性能,取得了更低的错误率和更长的轨迹跟踪时间。我们将在该项目接受后发布源代码和数据。
https://arxiv.org/abs/2309.09879
We investigate a new paradigm that uses differentiable SLAM architectures in a self-supervised manner to train end-to-end deep learning models in various LiDAR based applications. To the best of our knowledge there does not exist any work that leverages SLAM as a training signal for deep learning based models. We explore new ways to improve the efficiency, robustness, and adaptability of LiDAR systems with deep learning techniques. We focus on the potential benefits of differentiable SLAM architectures for improving performance of deep learning tasks such as classification, regression as well as SLAM. Our experimental results demonstrate a non-trivial increase in the performance of two deep learning applications - Ground Level Estimation and Dynamic to Static LiDAR Translation, when used with differentiable SLAM architectures. Overall, our findings provide important insights that enhance the performance of LiDAR based navigation systems. We demonstrate that this new paradigm of using SLAM Loss signal while training LiDAR based models can be easily adopted by the community.
我们研究了一种新范式,该范式使用可区分的 SLAM 架构在自我监督的情况下,在各个 LiDAR 基于应用中训练端到端深度学习模型。据我们所知,目前还没有任何工作利用 SLAM 作为深度学习模型训练信号。我们探索了利用深度学习技术提高 LiDAR 系统的效率和鲁棒性以及适应能力的新方法。我们重点研究了可区分的 SLAM 架构对于提高分类、回归以及 SLAM 任务的性能的潜在好处。我们的实验结果表明,与可区分的 SLAM 架构一起使用的两个深度学习应用的性能出现了非 trivial 的提升,分别是地面高度的估计和动态静态 LiDAR 翻译。总体而言,我们的研究结果提供了重要的启示,有助于提高 LiDAR 基于导航系统的性能。我们证明了这个新范式利用 SLAM 损失信号同时训练 LiDAR 模型可以轻松地被社区采用。
https://arxiv.org/abs/2309.09206
Dynamic reconstruction with neural radiance fields (NeRF) requires accurate camera poses. These are often hard to retrieve with existing structure-from-motion (SfM) pipelines as both camera and scene content can change. We propose DynaMoN that leverages simultaneous localization and mapping (SLAM) jointly with motion masking to handle dynamic scene content. Our robust SLAM-based tracking module significantly accelerates the training process of the dynamic NeRF while improving the quality of synthesized views at the same time. Extensive experimental validation on TUM RGB-D, BONN RGB-D Dynamic and the DyCheck's iPhone dataset, three real-world datasets, shows the advantages of DynaMoN both for camera pose estimation and novel view synthesis.
动态重构(NeRF)需要准确的相机姿态,而现有的运动结构(SfM) pipeline 很难处理相机和场景内容都可以随时改变的这种情况。我们提出了DynaMoN,它利用同时定位和映射(SLAM)结合运动掩膜来处理动态场景内容。我们的可靠的 SLAM 跟踪模块显著加速了动态 NeRF 的训练过程,同时提高了合成视图的质量。在 TUM RGB-D、BONN RGB-D 动态和 DyCheck 的 iPhone 数据集等三个真实数据集上进行了大量的实验验证,证明了DynaMoN 在相机姿态估计和新视图合成方面的优势。
https://arxiv.org/abs/2309.08927
The robustness of SLAM algorithms in challenging environmental conditions is crucial for autonomous driving, but the impact of these conditions are unknown while given the difficulty of arbitrarily changing the relevant environmental parameters of the same environment in the real world. Therefore, we propose CARLA-Loc, a synthetic dataset of challenging and dynamic environments built on CARLA simulator. We integrate multiple sensors into the dataset with strict calibration, synchronization and precise timestamping. 7 maps and 42 sequences are posed in our dataset with different dynamic levels and weather conditions. Objects in both stereo images and point clouds are well-segmented with their class labels. We evaluate 5 visual-based and 4 LiDAR-based approaches on varies sequences and analyze the effect of challenging environmental factors on the localization accuracy, showing the applicability of proposed dataset for validating SLAM algorithms.
SLAM算法在挑战性环境中的鲁棒性对于自动驾驶至关重要,但是这些环境的可能影响未知,同时考虑到在现实世界任意改变相同环境中的相关环境参数是困难的。因此,我们提出了CARLA-Loc,一个基于CARLA模拟器的具有挑战性和动态环境的合成数据集。我们严格校准、同步和精确计时并将多个传感器集成到数据集中。在我们的数据集中,有7张地图和42个序列,这些序列在不同的动态水平和天气条件下呈现。两只立体图像和点云中的物体都根据它们的类标签进行了良好的分割。我们对不同序列评估了5种视觉方法和4种激光雷达方法,并分析了挑战性环境因素对定位精度的影响,证明了所提出的数据集适用于验证SLAM算法。
https://arxiv.org/abs/2309.08909
We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50\% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at this https URL.
我们提出了一种方法,用于在拥挤的真实场景下估计相机旋转,这些场景由手持单目视频组成。尽管相机旋转估计是一个受到广泛关注的问题,但以前的方法和在该场景中表现出高精度和高速度的方法都没有。由于其他数据集无法很好地解决该场景,我们提供了一个新的数据集和基准,具有高精度且严格验证的地面 truth,对17个视频序列进行了测试。对于宽基线双图像方法(例如5点方法)在单目视频上表现不佳。另一方面,用于自动驾驶(例如SLAM)的方法利用了特定的传感器设置、特定的运动模型或局部优化策略(滞后批量处理),并不太适用于手持视频。最后,对于动态场景,常用的鲁棒增强技术如RANSAC需要大量迭代,变得极其缓慢。我们介绍了SO(3)上的Hough变换的新扩展,以高效且稳健地找到与光学流最兼容的相机旋转。在相对较快的这些方法中,我们的方法几乎可以减少误差的50%,比任何方法都更准确,无论速度如何。这代表了拥挤场景的新性能点,对于计算机视觉是非常重要的场景。代码和数据集可在此httpsURL上获取。
https://arxiv.org/abs/2309.08588
Automated Valet Parking (AVP) requires precise localization in challenging garage conditions, including poor lighting, sparse textures, repetitive structures, dynamic scenes, and the absence of Global Positioning System (GPS) signals, which often pose problems for conventional localization methods. To address these adversities, we present AVM-SLAM, a semantic visual SLAM framework with multi-sensor fusion in a Bird's Eye View (BEV). Our framework integrates four fisheye cameras, four wheel encoders, and an Inertial Measurement Unit (IMU). The fisheye cameras form an Around View Monitor (AVM) subsystem, generating BEV images. Convolutional Neural Networks (CNNs) extract semantic features from these images, aiding in mapping and localization tasks. These semantic features provide long-term stability and perspective invariance, effectively mitigating environmental challenges. Additionally, data fusion from wheel encoders and IMU enhances system robustness by improving motion estimation and reducing drift. To validate AVM-SLAM's efficacy and robustness, we provide a large-scale, high-resolution underground garage dataset, available at this https URL. This dataset enables researchers to further explore and assess AVM-SLAM in similar environments.
自动寄存器停车(AVP)在挑战性的停车场环境下需要精确定位,包括照明不足、稀疏纹理、重复结构、动态场景以及缺少全球定位系统(GPS)信号,这通常对传统定位方法构成问题。为了解决这些困难,我们提出了AVM-SLAM,这是一个基于语义视觉SLAM框架的多方传感器融合的模型。我们的框架包括四个广角摄像头、四个车轮编码器和一个惯性测量单元(IMU)。广角摄像头组成了 around View Monitor (AVM) 子系统,生成BEV图像。卷积神经网络(CNN)从这些图像中提取语义特征,协助进行地图和定位任务。这些语义特征提供了长期稳定性和视角不变性,有效地减轻环境挑战。此外,从车轮编码器和IMU产生的数据融合可以提高系统的鲁棒性,减少漂移。为了验证AVM-SLAM的有效性和鲁棒性,我们提供了一份大规模的、高分辨率的地下停车场数据集,该数据集可在该httpsURL上可用。该数据集使研究人员能够在类似环境下进一步探索和评估AVM-SLAM。
https://arxiv.org/abs/2309.08180
We present a novel optimization-based Visual-Inertial SLAM system designed for multiple partially overlapped camera systems, named MAVIS. Our framework fully exploits the benefits of wide field-of-view from multi-camera systems, and the metric scale measurements provided by an inertial measurement unit (IMU). We introduce an improved IMU pre-integration formulation based on the exponential function of an automorphism of SE_2(3), which can effectively enhance tracking performance under fast rotational motion and extended integration time. Furthermore, we extend conventional front-end tracking and back-end optimization module designed for monocular or stereo setup towards multi-camera systems, and introduce implementation details that contribute to the performance of our system in challenging scenarios. The practical validity of our approach is supported by our experiments on public datasets. Our MAVIS won the first place in all the vision-IMU tracks (single and multi-session SLAM) on Hilti SLAM Challenge 2023 with 1.7 times the score compared to the second place.
我们提出了一种基于优化的新型视觉惯性SLAM系统,名为MAVIS。该系统设计为多个重叠摄像头系统,我们称之为MAVIS。我们的框架充分利用了多摄像头系统的广域视角和惯性测量单元(IMU)提供的度量尺度测量。我们介绍了基于SE_2(3)automorphism的改进的IMU预积分 formulation,该 formulation可有效地增强快速旋转 motion 和延长积分时间下的跟踪性能。此外,我们扩展了传统的前端跟踪和后端优化模块,将其适用于多摄像头系统,并介绍了实现细节,这些细节对于在挑战性场景中我们系统的性能做出了贡献。我们的方法的实践有效性我们通过公开数据集的实验支持。2023年 Hilti SLAM Challenge中,我们的MAVIS在所有视觉IMU track(单时态和多时态 SLAM)中获得了与第二名相比 score 1.7 倍的第一名。
https://arxiv.org/abs/2309.08142
Loop closing and relocalization are crucial techniques to establish reliable and robust long-term SLAM by addressing pose estimation drift and degeneration. This article begins by formulating loop closing and relocalization within a unified framework. Then, we propose a novel multi-head network LCR-Net to tackle both tasks effectively. It exploits novel feature extraction and pose-aware attention mechanism to precisely estimate similarities and 6-DoF poses between pairs of LiDAR scans. In the end, we integrate our LCR-Net into a SLAM system and achieve robust and accurate online LiDAR SLAM in outdoor driving environments. We thoroughly evaluate our LCR-Net through three setups derived from loop closing and relocalization, including candidate retrieval, closed-loop point cloud registration, and continuous relocalization using multiple datasets. The results demonstrate that LCR-Net excels in all three tasks, surpassing the state-of-the-art methods and exhibiting a remarkable generalization ability. Notably, our LCR-Net outperforms baseline methods without using a time-consuming robust pose estimator, rendering it suitable for online SLAM applications. To our best knowledge, the integration of LCR-Net yields the first LiDAR SLAM with the capability of deep loop closing and relocalization. The implementation of our methods will be made open-source.
Loop closing和relocalization是建立可靠且稳健的长期SLAM的关键技术,通过解决姿态估计漂移和退化问题。本文首先在一个统一框架内制定Loop closing和reLocalization。然后,我们提出了一个新的多头部网络LCR-Net,能够有效地解决两个任务。利用新的特征提取和姿态 aware 注意力机制,精确地估计两个LiDAR扫描之间的相似性和6-DoF姿态。最终,我们将我们的LCR-Net集成到一个SLAM系统中,在户外运动环境中实现可靠且准确的LiDAR SLAM。我们对LCR-Net进行了全面的评估,通过三个Loop closing和reLocalization的构建,包括候选提取、闭式点云注册以及使用多个数据集的连续reLocalization。结果显示,LCR-Net在所有三个任务上都表现出色,超越了当前的方法,并表现出显著的泛化能力。值得注意的是,我们LCR-Net在没有使用耗时且可靠的姿态估计器的情况下,比基准方法表现更好,适合在线SLAM应用。据我们所知,LCR-Net的集成生成了第一个具有深度Loop closing和reLocalization能力的LiDAR SLAM。我们的这些方法的实现将采用开源方式。
https://arxiv.org/abs/2309.08086
In this letter, we address the problem of exploration and metric-semantic mapping of multi-floor GPS-denied indoor environments using Size Weight and Power (SWaP) constrained aerial robots. Most previous work in exploration assumes that robot localization is solved. However, neglecting the state uncertainty of the agent can ultimately lead to cascading errors both in the resulting map and in the state of the agent itself. Furthermore, actions that reduce localization errors may be at direct odds with the exploration task. We propose a framework that balances the efficiency of exploration with actions that reduce the state uncertainty of the agent. In particular, our algorithmic approach for active metric-semantic SLAM is built upon sparse information abstracted from raw problem data, to make it suitable for SWaP-constrained robots. Furthermore, we integrate this framework within a fully autonomous aerial robotic system that achieves autonomous exploration in cluttered, 3D environments. From extensive real-world experiments, we showed that by including Semantic Loop Closure (SLC), we can reduce the robot pose estimation errors by over 90% in translation and approximately 75% in yaw, and the uncertainties in pose estimates and semantic maps by over 70% and 65%, respectively. Although discussed in the context of indoor multi-floor exploration, our system can be used for various other applications, such as infrastructure inspection and precision agriculture where reliable GPS data may not be available.
本信函讨论了利用限制尺寸、重量和功率的空中机器人探索多层GPS拒绝的室内环境的问题。以往的探索工作大多数假设机器人位置问题解决。然而,忽略代理的状态不确定性可能会最终导致在结果地图和代理状态本身中出现的级联错误。此外,减少位置不确定性的行动可能与探索任务直接冲突。我们提出了一个框架,该框架平衡了探索效率和减少代理状态不确定性的行动。特别是,我们的算法方法 Active metric-Semantic SLAM 是基于从原始问题数据中稀疏信息抽象出来的信息构建的,使其适用于 SWaP 限制的机器人。我们还将这个框架融入了一个完全自主的空中机器人系统,使其能够在复杂三维环境中自主探索。从广泛的现实世界实验中,我们表明,通过包括语义循环关闭(SLC),我们可以在翻译方面超过90%减少机器人姿态估计错误,而在 yaw方面大约减少75%,并在姿态估计和语义地图的不确定性方面分别超过70%和65%。虽然讨论了室内多层探索的背景,但我们的系统可以用于各种其他应用,例如基础设施检查和精确农业,其中可靠的GPS数据可能不可用。
https://arxiv.org/abs/2309.06950
For SLAM to be safely deployed in unstructured real world environments, it must possess several key properties that are not encompassed by conventional benchmarks. In this paper we show that SLAM commutativity, that is, consistency in trajectory estimates on forward and reverse traverses of the same route, is a significant issue for the state of the art. Current pipelines show a significant bias between forward and reverse directions of travel, that is in addition inconsistent regarding which direction of travel exhibits better performance. In this paper we propose several contributions to feature-based SLAM pipelines that remedies the motion bias problem. In a comprehensive evaluation across four datasets, we show that our contributions implemented in ORB-SLAM2 substantially reduce the bias between forward and backward motion and additionally improve the aggregated trajectory error. Removing the SLAM motion bias has significant relevance for the wide range of robotics and computer vision applications where performance consistency is important.
为了使SLAM能够在无结构的真实世界环境中安全部署,它必须拥有一些关键属性,这些属性无法通过传统基准涵盖。在本文中,我们表明,SLAM互变性,即同一路径 forward和reverse traversal estimate 的一致性,是一个当前领域的关键问题。当前管道路线在前进和后退方向上存在显著偏差,此外,对于哪种方向表现更好,一致性也存在不一致的问题。在本文中,我们提出了多个对基于特征的 SLAM 管道路线的贡献,以解决运动偏差问题。在四种数据集的全面评估中,我们表明,ORB-SLAM2 实现的我们的贡献在很大程度上减少了前进和后退运动之间的偏差,并同时提高了总的轨迹误差。删除 SLAM 运动偏差对于许多机器人和计算机视觉应用,表现一致性非常重要,具有重要的意义。
https://arxiv.org/abs/2309.06792
Maps have played an indispensable role in enabling safe and automated driving. Although there have been many advances on different fronts ranging from SLAM to semantics, building an actionable hierarchical semantic representation of urban dynamic scenes from multiple agents is still a challenging problem. In this work, we present collaborative urban scene graphs (CURB-SG) that enable higher-order reasoning and efficient querying for many functions of automated driving. CURB-SG leverages panoptic LiDAR data from multiple agents to build large-scale maps using an effective graph-based collaborative SLAM approach that detects inter-agent loop closures. To semantically decompose the obtained 3D map, we build a lane graph from the paths of ego agents and their panoptic observations of other vehicles. Based on the connectivity of the lane graph, we segregate the environment into intersecting and non-intersecting road areas. Subsequently, we construct a multi-layered scene graph that includes lane information, the position of static landmarks and their assignment to certain map sections, other vehicles observed by the ego agents, and the pose graph from SLAM including 3D panoptic point clouds. We extensively evaluate CURB-SG in urban scenarios using a photorealistic simulator and release our code at this http URL.
地图在实现安全和自动驾驶中发挥了不可或缺的作用。尽管从SLAM到语义等方面已经取得了许多进展,但构建从多个agent中提取城市动态场景的可操作层次语义表示仍然是一个具有挑战性的问题。在本文中,我们提出了协作城市场景Graph(curb-SG),该Graph能够提高更高级别的推理和高效查询,以解决自动驾驶的许多功能。curb-SG利用多个agent的Panoptic LiDAR数据构建大规模的地图,通过一种有效的基于Graph的协作SLAM方法检测到agent之间的循环结束。为了语义分解获取的3D地图,我们构建了一个Lane Graph,从 ego agents的路径和他们的对其他车辆的Panoptic观察中提取。基于Lane Graph的连通性,我们将环境分为相交和不相交的公路区域。随后,我们构建了一个多層的场景Graph,包括Lane信息、静态地标的位置和将它们分配给某些地图区域的其他车辆、SLAM中包含3DPanoptic点云的姿态Graph。我们使用photorealistic模拟程序在城市场景下广泛评估curb-SG,并将我们的代码释放在此httpURL上。
https://arxiv.org/abs/2309.06635