Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.
基于隐式表示的同步定位与映射(SLAM)研究在室内环境中取得了良好的结果。然而,仍然存在一些挑战:隐式编码的有限场景表示能力,从隐式表示中渲染过程的不确定性以及动态物体对一致性的干扰。为了应对这些挑战,我们提出了一个基于局部-全局融合神经隐式表示的实时动态SLAM系统,名为DVN-SLAM。为了提高场景表示能力,我们引入了一种局部-全局融合神经隐式表示,使得在考虑全局结构和局部细节的同时构建隐含地图。为了应对渲染过程产生的不确定性,我们设计了一个信息聚类损失,旨在将场景信息集中在物体表面。所提出的DVN-SLAM在多个数据集上的定位和映射达到竞争力的性能。更重要的是,DVN-SLAM展示了在动态场景中的鲁棒性,这是其他基于NeRF的方法所不具备的。
https://arxiv.org/abs/2403.11776
We propose NEDS-SLAM, an Explicit Dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier GS points, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.
我们提出了 NEDS-SLAM,一种基于3D高斯表示的明确密语义SLAM系统,能够实现稳健的3D语义映射、精确的相机跟踪和实时高质量渲染。在系统中,我们提出了一个空间一致特征融合模型,以减少来自预训练分割头上的错误估计对语义重建的影响,实现稳健的3D语义高斯映射。此外,我们还使用轻量级编码器-解码器来压缩高维语义特征,减轻过度内存消耗的影响。此外,我们利用3D高斯扩展的优点,实现高效的非平凡视图渲染,并提出了虚拟相机视图修剪方法,消除异常GS点,从而有效地增强场景表示的质量。我们的 NEDS-SLAM方法在 Replica 和 ScanNet 等数据集上的映射和跟踪准确性方面与现有密集语义SLAM方法竞争,同时在3D语义映射方面表现出卓越的性能。
https://arxiv.org/abs/2403.11679
Perception plays a crucial role in various robot applications. However, existing well-annotated datasets are biased towards autonomous driving scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often lack environment and domain variations. To expand the frontier of these fields, we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), featuring a wide range of sensing modalities, high-accuracy ground truth, and diverse challenging environments across three Eurasian university campuses. MCD comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce semantic annotations of 29 classes over 59k sparse NRE lidar scans across three domains, thus providing a novel challenge to existing semantic segmentation research upon this largely unexplored lidar modality. Finally, we propose, for the first time to the best of our knowledge, continuous-time ground truth based on optimization-based registration of lidar-inertial data on large survey-grade prior maps, which are also publicly released, each several times the size of existing ones. We conduct a rigorous evaluation of numerous state-of-the-art algorithms on MCD, report their performance, and highlight the challenges awaiting solutions from the research community.
感知在各种机器人应用中扮演着关键角色。然而,现有的经过良好注释的数据集偏向于自动驾驶场景,而未标注的SLAM数据集会很快过拟合,并且通常缺乏环境和领域变化。为了扩展这些领域的前沿,我们引入了一个名为MCD(多校园数据集)的全面数据集,其中包括广泛的感测模式、高精度的目标跟踪数据和来自欧洲三个大学校园的多样挑战环境。MCD包括CSC(经典圆柱形旋转)和NRE(非重复周期环形)激光雷达、高质量的惯性测量单元(IMU)、相机和 UWB(超宽波长)传感器。此外,我们在一个具有创新性的努力中引入了跨越三个领域的29类语义注释,对3D NRE激光雷达扫描进行语义标注,这为现有语义分割研究带来了新的挑战。最后,我们提出了一种新颖的基于优化基于重采样定位的连续时间目标跟踪方法,该方法基于大型调查级先验图对激光雷达-惯性数据进行优化,这些数据也是公开发布的,每个的大小是现有方法的许多倍。我们在MCD上对多个最先进的算法进行了严格的评估,报告了它们的性能,并强调了研究社区需要解决的挑战。
https://arxiv.org/abs/2403.11496
The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.
在许多几何计算机视觉任务中,如SLAM,静态环境的假设是很常见的,但它限制了这些任务在高度动态场景中的适用性。由于这些任务依赖于在静态环境中确定输入图像之间的点对应关系,我们提出了一个基于图神经网络的稀疏特征匹配网络,旨在在具有挑战性的条件下实现鲁棒匹配,同时排除运动物体上的关键点。我们在图边上采用类似的注意力和聚合方案来增强关键点表示,与最先进的特征匹配网络类似,但补充了极化的图信息和大大减少了图的边数。此外,我们还引入了一种自监督训练方案,用于从仅处理视觉-inertial数据的动态环境中提取伪标签,用于图像对。一系列实验证明,与最先进的特征匹配网络相比,我们的网络在排除运动物体关键点的同时,仍然实现了与传统匹配指标类似的结果。当集成到SLAM系统中时,我们的网络在动态场景中的性能显著提高,尤其是在高度动态场景中。
https://arxiv.org/abs/2403.11370
Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs, and slow training speed. To address the limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then we observe that the covariance matrix (geometry) of most 3D Gaussian ellipsoids are extremely similar, which motivates a novel geometry codebook to compress 3D Gaussian geometric attributes, i.e., the parameters. Robust and accurate pose estimation is achieved by a global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training and rendering speed while maintaining the state-of-the-art (SOTA) quality of the scene representation.
最近的工作表明,基于3D高斯-based SLAM能够实现高质量的重建、精确的姿势估计和实时的场景渲染。然而,这些方法基于大量的冗余3D高斯椭圆,导致高内存和存储成本,以及训练速度缓慢。为了克服这一局限,我们提出了一个紧凑的3D高斯伸缩 SLAM 系统,减少了高斯椭圆的数量和参数大小。首先,我们提出了一种滑动窗口掩码策略来减少冗余椭圆。然后,我们观察到大多数3D高斯椭圆的协方差矩阵(几何)非常相似,这激励了我们提出一种新的几何代码库来压缩3D高斯几何属性,即参数。通过全局束调整方法实现稳健和准确的姿势估计。大量实验证明,我们的方法在保持场景表示的 state-of-the-art(SOTA)质量的同时,具有更快的训练和渲染速度。
https://arxiv.org/abs/2403.11247
Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap. Human users can query inventory by providing natural language queries and receiving a 3D heatmap of potential object locations. To manage the computational load, we use Fog-ROS2, a cloud robotics platform, to offload resource-intensive tasks. Lifelong LERF obtains poses from a monocular RGBD SLAM backend, and uses these poses to progressively optimize a Language Embedded Radiance Field (LERF) for semantic monitoring. Experiments with 3-5 objects arranged on a tabletop and a Turtlebot with a RealSense camera suggest that Lifelong LERF can persistently adapt to changes in objects with up to 91% accuracy.
家庭、工厂和零售店的库存监测依赖于在物体交换、添加、移除或移动时维护数据。我们介绍了一种名为Lifelong LERF的方法,允许具有最小计算的移动机器人共同优化其周围密集语言和几何表示。Lifelong LERF通过检测语义变化并选择性地更新环境中的这些区域来维持此表示。人类用户可以通过提供自然语言查询并获得潜在物体位置的3D热图来查询库存。为了管理计算负载,我们使用Fog-ROS2,一个云计算机器人平台,将资源密集的任务卸载到云端。Lifelong LERF从单目RGBD SLAM后端获得姿态,并使用这些姿态逐步优化语义监测的LERF。在摆放3-5个物体在桌面上的实验和配备Turtlebot和RealSense摄像头的Turtlebot的实验表明,Lifelong LERF可以以高达91%的准确度持续适应物体变化。
https://arxiv.org/abs/2403.10494
Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL.
实时高精度光流估计是各种应用的关键组件,包括机器人定位和地图、目标跟踪和计算机视觉活动识别。虽然最近基于学习的光流方法已经达到高准确度,但它们通常伴随着沉重的计算成本。在本文中,我们提出了一个高效的光流架构,称为NeuFlow,该架构解决了高准确度和计算成本的问题。架构遵循全局到局部方案。根据不同分辨率提取的输入图像的特征,采用全局匹配来估计初始光流在1/16分辨率上,捕获大的位移,然后在1/8分辨率上通过轻量级的CNN层进行微调,以提高准确性。我们在Jetson Orin Nano和RTX 2080上评估我们的方法,以证明不同计算平台上的效率改进。我们实现了与几个最先进方法相当的增长速度,同时保持较高的准确性。我们的方法在边缘计算平台上达到约30 FPS,这标志着在部署类似SLAM等复杂计算机视觉任务的小型机器人方面取得了显著的突破。完整的训练和评估代码可在此处访问:https://url.
https://arxiv.org/abs/2403.10425
Forestry constitutes a key element for a sustainable future, while it is supremely challenging to introduce digital processes to improve efficiency. The main limitation is the difficulty of obtaining accurate maps at high temporal and spatial resolution as a basis for informed forestry decision-making, due to the vast area forests extend over and the sheer number of trees. To address this challenge, we present an autonomous Micro Aerial Vehicle (MAV) system which purely relies on cost-effective and light-weight passive visual and inertial sensors to perform under-canopy autonomous navigation. We leverage visual-inertial simultaneous localization and mapping (VI-SLAM) for accurate MAV state estimates and couple it with a volumetric occupancy submapping system to achieve a scalable mapping framework which can be directly used for path planning. As opposed to a monolithic map, submaps inherently deal with inevitable drift and corrections from VI-SLAM, since they move with pose estimates as they are updated. To ensure the safety of the MAV during navigation, we also propose a novel reference trajectory anchoring scheme that moves and deforms the reference trajectory the MAV is tracking upon state updates from the VI-SLAM system in a consistent way, even upon large changes in state estimates due to loop-closures. We thoroughly validate our system in both real and simulated forest environments with high tree densities in excess of 400 trees per hectare and at speeds up to 3 m/s - while not encountering a single collision or system failure. To the best of our knowledge this is the first system which achieves this level of performance in such unstructured environment using low-cost passive visual sensors and fully on-board computation including VI-SLAM.
林业是实现可持续未来至关重要的一部分,而将其引入数字过程以提高效率则具有极大的挑战性。主要的限制是由于森林面积广阔和树木众多,因此准确地获取高时间和空间分辨率的数据作为 informed 林业决策的依据是非常困难的。为了应对这一挑战,我们提出了一个自治的微型无人机(MAV)系统,该系统仅依赖成本效益高和轻量级的被动视觉和惯性传感器执行林下自主导航。我们利用视觉-惯性同时定位与映射(VI-SLAM)技术对MAV状态估计,并将其与体积占用下映射系统相结合以实现可扩展的映射框架,可以直接用于路径规划。与单体地图不同,子图本质上会处理VI-SLAM系统在更新时必然产生的漂移和修正,因为它们随着姿态估计而移动。为了确保MAV在导航过程中的安全性,我们还提出了一个新颖的参考轨迹锚定方案,在VI-SLAM系统状态更新时以一致的方式移动和变形MAV跟踪的参考轨迹。我们在超过400棵树/公顷的实际森林环境中进行了充分验证,并将其速度提高到3米/秒,尽管在状态估计出现大幅度变化时,没有发生碰撞或系统故障。据我们所知,这是第一个在如此无结构环境中使用低成本被动视觉传感器并实现完全车载计算的系统。
https://arxiv.org/abs/2403.09596
Feature point detection and description is the backbone for various computer vision applications, such as Structure-from-Motion, visual SLAM, and visual place recognition. While learning-based methods have surpassed traditional handcrafted techniques, their training often relies on simplistic homography-based simulations of multi-view perspectives, limiting model generalisability. This paper introduces a novel approach leveraging neural radiance fields (NeRFs) for realistic multi-view training data generation. We create a diverse multi-view dataset using NeRFs, consisting of indoor and outdoor scenes. Our proposed methodology adapts state-of-the-art feature detectors and descriptors to train on NeRF-synthesised views supervised by perspective projective geometry. Our experiments demonstrate that the proposed methods achieve competitive or superior performance on standard benchmarks for relative pose estimation, point cloud registration, and homography estimation while requiring significantly less training data compared to existing approaches.
特征点检测和描述是各种计算机视觉应用程序的基础,如 Structure-from-Motion、视觉 SLAM 和视觉 place recognition。尽管基于学习的方法已经超越了传统手工定制技术,但它们的训练通常依赖于基于简单的主观性方法的多视角模拟,这限制了模型的泛化能力。本文介绍了一种利用神经元辐射场(NeRFs)生成真实多视角训练数据的新颖方法。我们使用 NeRFs 创建了室内和室外场景的多样多视角数据集。我们提出的方法将最先进的特征检测和描述方法适应性地训练为基于 NeRF 合成视图的视角投影几何监督。我们的实验结果表明,与现有方法相比,所提出的方法在相对姿态估计、点云配准和配准估计等标准基准上实现了竞争性或卓越性能,同时训练数据需要大大减少。
https://arxiv.org/abs/2403.08156
Monocular SLAM has long grappled with the challenge of accurately modeling 3D geometries. Recent advances in Neural Radiance Fields (NeRF)-based monocular SLAM have shown promise, yet these methods typically focus on novel view synthesis rather than precise 3D geometry modeling. This focus results in a significant disconnect between NeRF applications, i.e., novel-view synthesis and the requirements of SLAM. We identify that the gap results from the volumetric representations used in NeRF, which are often dense and noisy. In this study, we propose a novel approach that reimagines volumetric representations through the lens of quadric forms. We posit that most scene components can be effectively represented as quadric planes. Leveraging this assumption, we reshape the volumetric representations with million of cubes by several quadric planes, which leads to more accurate and efficient modeling of 3D scenes in SLAM contexts. Our method involves two key steps: First, we use the quadric assumption to enhance coarse depth estimations obtained from tracking modules, e.g., Droid-SLAM. This step alone significantly improves depth estimation accuracy. Second, in the subsequent mapping phase, we diverge from previous NeRF-based SLAM methods that distribute sampling points across the entire volume space. Instead, we concentrate sampling points around quadric planes and aggregate them using a novel quadric-decomposed Transformer. Additionally, we introduce an end-to-end joint optimization strategy that synchronizes pose estimation with 3D reconstruction.
单目SLAM长期以来一直面临着准确建模3D几何的挑战。基于Neural Radiance Fields(NeRF)的单目SLAM近来的进展表明具有希望,然而这些方法通常集中于新的视图生成而不是精确的3D几何建模。这种关注导致NeRF应用之间的显著断开,即新的视图生成和SLAM需求之间的断开。我们发现,这个缺口是由NeRF中使用的体积表示造成的,这些表示通常密集且噪声干扰。在这项研究中,我们提出了一种通过四元形式重新想象体积表示的新方法。我们主张,大多数场景组件可以用四元平面表示。利用这个假设,我们在几个四元平面上重新塑造体积表示,从而在SLAM环境中更准确和高效地建模3D场景。我们的方法包括两个关键步骤:首先,我们使用四元假设增强来自跟踪模块获得的粗度估计,例如Droid-SLAM。这一步 alone显著提高了深度估计的准确性。其次,在后续的映射阶段,我们远离了之前基于NeRF的SLAM方法,这些方法将采样点分布在整个体积空间。相反,我们集中采样点在四元平面上,并使用一种新的四元分解Transformer进行聚合。此外,我们还引入了一个端到端的联合优化策略,使姿态估计与3D重建同步。
https://arxiv.org/abs/2403.08125
Event cameras are bio-inspired visual sensors that capture pixel-wise intensity changes and output asynchronous event streams. They show great potential over conventional cameras to handle challenging scenarios in robotics and computer vision, such as high-speed and high dynamic range. This paper considers the problem of rotational motion estimation using event cameras. Several event-based rotation estimation methods have been developed in the past decade, but their performance has not been evaluated and compared under unified criteria yet. In addition, these prior works do not consider a global refinement step. To this end, we conduct a systematic study of this problem with two objectives in mind: summarizing previous works and presenting our own solution. First, we compare prior works both theoretically and experimentally. Second, we propose the first event-based rotation-only bundle adjustment (BA) approach. We formulate it leveraging the state-of-the-art Contrast Maximization (CMax) framework, which is principled and avoids the need to convert events into frames. Third, we use the proposed BA to build CMax-SLAM, the first event-based rotation-only SLAM system comprising a front-end and a back-end. Our BA is able to run both offline (trajectory smoothing) and online (CMax-SLAM back-end). To demonstrate the performance and versatility of our method, we present comprehensive experiments on synthetic and real-world datasets, including indoor, outdoor and space scenarios. We discuss the pitfalls of real-world evaluation and propose a proxy for the reprojection error as the figure of merit to evaluate event-based rotation BA methods. We release the source code and novel data sequences to benefit the community. We hope this work leads to a better understanding and fosters further research on event-based ego-motion estimation. Project page: this https URL
事件相机是一种生物启发的视觉传感器,用于捕获像素级的强度变化并输出异步事件流。它们在机器人学和计算机视觉领域具有很大的潜力,处理高速和高动态范围等挑战性的场景。本文考虑了使用事件相机进行旋转运动估计的问题。过去十年里,已经开发了许多基于事件的旋转估计方法,但它们的性能还没有在统一标准下进行评估和比较。此外,这些先前的作品没有考虑全局优化步骤。因此,我们进行了这个问题的系统研究,目标有两个:总结先前的研究并介绍自己的解决方案。首先,我们理论上将先前的研究进行比较。其次,我们提出了基于事件的旋转仅束估计(BA)方法。我们利用了最先进的对比最大化(CMax)框架来表述它,这是一种基于原理的方法,避免了将事件转换为帧的需要。第三,我们将BA应用于CMax-SLAM,这是第一个事件基于旋转的SLAM系统,包括前端和后端。我们的BA能够实现离线(轨迹平滑)和在线(CMax-SLAM后端)运行。为了展示我们方法的最佳性能和多样性,我们在合成和现实世界数据集上进行了全面的实验,包括室内、室外和空间场景。我们讨论了真实世界评估的缺陷,并提出了一个用于评估事件基于旋转BA方法的代理。我们发布了源代码和新数据序列,以造福于社区。我们希望这项工作带来更好的理解和进一步研究事件基于自适应运动估计。项目页面:此链接
https://arxiv.org/abs/2403.08119
Imitation learning from human hand motion data presents a promising avenue for imbuing robots with human-like dexterity in real-world manipulation tasks. Despite this potential, substantial challenges persist, particularly with the portability of existing hand motion capture (mocap) systems and the difficulty of translating mocap data into effective control policies. To tackle these issues, we introduce DexCap, a portable hand motion capture system, alongside DexIL, a novel imitation algorithm for training dexterous robot skills directly from human hand mocap data. DexCap offers precise, occlusion-resistant tracking of wrist and finger motions based on SLAM and electromagnetic field together with 3D observations of the environment. Utilizing this rich dataset, DexIL employs inverse kinematics and point cloud-based imitation learning to replicate human actions with robot hands. Beyond learning from human motion, DexCap also offers an optional human-in-the-loop correction mechanism to refine and further improve robot performance. Through extensive evaluation across six dexterous manipulation tasks, our approach not only demonstrates superior performance but also showcases the system's capability to effectively learn from in-the-wild mocap data, paving the way for future data collection methods for dexterous manipulation. More details can be found at this https URL
从人类手动作数据中进行模仿学习为在现实操作任务中赋予机器人人类般的灵巧性提供了希望。尽管具有这种潜力,但仍然存在许多挑战,特别是现有手动作捕捉(mocap)系统可移植性和将mocap数据转换为有效控制策略的难度。为解决这些问题,我们引入了DexCap,一个可携式的手动作捕捉系统,以及DexIL,一种直接从人类手动作mocap数据中训练机器人技能的新颖模仿算法。DexCap基于SLAM和电磁场技术提供精确的手腕和手指动作跟踪,并利用3D观察环境。利用这个丰富多样的数据集,DexIL采用逆运动学和点云基于模仿学习来复制人类动作。除了从人类运动中学习之外,DexCap还提供了一个可选的人类反馈机制,以进一步优化和改善机器人的性能。通过在六个灵巧操作任务中进行广泛的评估,我们的方法不仅证明了卓越的性能,还展示了系统从野外mocap数据中有效学习的能力,为未来的数据收集方法铺平了道路,用于灵巧操作。更多细节可在该链接中找到:
https://arxiv.org/abs/2403.07788
Drones are also known as UAVs are originally designed for military purposes. With the technological advances, they can be seen in most of the aspects of life from filming to logistics. The increased use of drones made it sometimes essential to form a collaboration between them to perform the task efficiently in a defined process. This paper investigates the use of a combined centralised and decentralised architecture for the collaborative operation of drones in a parts delivery scenario to enable and expedite the operation of the factories of the future. The centralised and decentralised approaches were extensively researched, with experimentation being undertaken to determine the appropriateness of each approach for this use-case. Decentralised control was utilised to remove the need for excessive communication during the operation of the drones, resulting in smoother operations. Initial results suggested that the decentralised approach is more appropriate for this use-case. The individual functionalities necessary for the implementation of a decentralised architecture were proven and assessed, determining that a combination of multiple individual functionalities, namely VSLAM, dynamic collision avoidance and object tracking, would give an appropriate solution for use in an industrial setting. A final architecture for the parts delivery system was proposed for future work, using a combined centralised and decentralised approach to combat the limitations inherent in each architecture.
无人机,也称为UAV,最初是为军事目的而设计的。随着技术的进步,它们现在可以应用于生活的各个领域,从拍摄到物流。无人机在生活和工业中的应用越来越广泛,使它们在执行定义的任务时有时成为必不可少的工具。本文研究了在分派场景中使用综合集中和分散架构协同操作无人机以实现更高效操作的方法,以促进未来工厂的运作。本文对集中和分散方法进行了广泛研究,并通过实验确定了哪种方法最适合这个使用场景。通过采用分散控制,消除了无人机在操作过程中需要进行过度沟通的问题,从而实现了更顺畅的操作。初步结果表明,分散方法更适合这个使用场景。为了实现分散架构,对实施分散架构所需的单个功能进行了深入评估,确定多种单独功能的组合,即VSLAM、动态避障和物体跟踪,将为工业环境提供适当的解决方案。为未来工作,提出了一个集成了中央集中和分散架构的部件交付系统架构。该架构旨在克服每个架构固有的限制。
https://arxiv.org/abs/2403.07635
We propose SemGauss-SLAM, the first semantic SLAM system utilizing 3D Gaussian representation, that enables accurate 3D semantic mapping, robust camera tracking, and high-quality rendering in real-time. In this system, we incorporate semantic feature embedding into 3D Gaussian representation, which effectively encodes semantic information within the spatial layout of the environment for precise semantic scene representation. Furthermore, we propose feature-level loss for updating 3D Gaussian representation, enabling higher-level guidance for 3D Gaussian optimization. In addition, to reduce cumulative drift and improve reconstruction accuracy, we introduce semantic-informed bundle adjustment leveraging semantic associations for joint optimization of 3D Gaussian representation and camera poses, leading to more robust tracking and consistent mapping. Our SemGauss-SLAM method demonstrates superior performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in novel-view semantic synthesis and 3D semantic mapping.
我们提出了SemGauss-SLAM,这是第一个利用3D高斯表示的语义SLAM系统,它能够实现精确的3D语义映射、稳健的相机跟踪和实时高质渲染。在这个系统中,我们将语义特征嵌入到3D高斯表示中,有效地在环境的空间布局中编码语义信息,实现精确的语义场景表示。此外,我们提出了基于特征级的损失函数来更新3D高斯表示,为3D高斯优化的更高层次指导。为了减少累积漂移并提高重建准确性,我们还引入了语义引导的束估计,用于对3D高斯表示和相机姿态的联合优化,从而实现更稳健的跟踪和一致的映射。我们的SemGauss-SLAM方法在替换和ScanNet数据集上的映射和跟踪准确性方面超过了现有的密集语义SLAM方法,同时在新的视图语义合成和3D语义映射方面表现卓越。
https://arxiv.org/abs/2403.07494
We propose an accurate and robust initialization approach for stereo visual-inertial SLAM systems. Unlike the current state-of-the-art method, which heavily relies on the accuracy of a pure visual SLAM system to estimate inertial variables without updating camera poses, potentially compromising accuracy and robustness, our approach offers a different solution. We realize the crucial impact of precise gyroscope bias estimation on rotation accuracy. This, in turn, affects trajectory accuracy due to the accumulation of translation errors. To address this, we first independently estimate the gyroscope bias and use it to formulate a maximum a posteriori problem for further refinement. After this refinement, we proceed to update the rotation estimation by performing IMU integration with gyroscope bias removed from gyroscope measurements. We then leverage robust and accurate rotation estimates to enhance translation estimation via 3-DoF bundle adjustment. Moreover, we introduce a novel approach for determining the success of the initialization by evaluating the residual of the normal epipolar constraint. Extensive evaluations on the EuRoC dataset illustrate that our method excels in accuracy and robustness. It outperforms ORB-SLAM3, the current leading stereo visual-inertial initialization method, in terms of absolute trajectory error and relative rotation error, while maintaining competitive computational speed. Notably, even with 5 keyframes for initialization, our method consistently surpasses the state-of-the-art approach using 10 keyframes in rotation accuracy.
我们提出了一个精确且鲁棒的初始化方法,用于双目视觉惯性SLAM系统。与当前最先进的方法不同,该方法在很大程度上依赖于纯视觉SLAM系统的准确性来估计惯性变量,这可能会导致精度和鲁棒性的妥协。我们的方法提供了一种不同的解决方案。我们意识到精确陀螺仪偏差估计对旋转准确性的关键影响。这一影响导致由于累积平移误差而导致轨迹精度下降。为了应对这个问题,我们首先独立估计陀螺仪偏差,并使用它来构建一个后验最大概率问题以进行进一步的细化。在细化之后,我们通过IMU与陀螺仪偏差分离来更新旋转估计。然后,我们利用鲁棒且准确的旋转估计来通过3-DoF束估计来增强平移估计。此外,我们引入了一种新的方法来通过评估平滑约束的残差来确定初始化的成功程度。在EuRoC数据集上进行广泛的评估表明,我们的方法在精度和鲁棒性方面表现优异。它不仅在绝对轨迹误差和相对旋转误差方面优于当前领先的双目视觉惯性初始化方法ORB-SLAM3,而且具有竞争力的计算速度。值得注意的是,即使在使用5个关键帧进行初始化,我们的方法也始终超越了使用10个关键帧在旋转精度方面的最先进方法。
https://arxiv.org/abs/2403.07225
We present a neural-field-based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photo-realistic textures. This system adapts the state-of-the-art neural radiance field (NeRF) representation to also incorporate lidar data which adds strong geometric constraints on the depth and surface normals. We exploit the trajectory from a real-time lidar SLAM system to bootstrap a Structure-from-Motion (SfM) procedure to both significantly reduce the computation time and to provide metric scale which is crucial for lidar depth loss. We use submapping to scale the system to large-scale environments captured over long trajectories. We demonstrate the reconstruction system with data from a multi-camera, lidar sensor suite onboard a legged robot, hand-held while scanning building scenes for 600 metres, and onboard an aerial robot surveying a multi-storey mock disaster site-building. Website: this https URL
我们提出了一个基于神经场的大型规模重构系统,将激光雷达(LIDAR)和视觉数据融合在一起,生成高质量的重构,具有几何准确性和逼真的纹理捕捉。该系统将最先进的神经辐射场(NeRF)表示状态拓展至还包括激光雷达数据,这为深度和表面法线带来了强烈的几何约束。我们利用来自实时激光雷达SLAM系统的轨迹,引导一个结构从运动(SfM)程序,以显著降低计算时间并提供用于激光雷达深度损失的指标尺度。我们使用子映射将系统扩展到大型环境。我们还展示了在船上腿式机器人上的多相机、激光雷达传感器套件和一个多层灾难场景建筑上进行扫描的数据。网站:https://this URL
https://arxiv.org/abs/2403.06877
Distributed as an open source library since 2013, RTAB-Map started as an appearance-based loop closure detection approach with memory management to deal with large-scale and long-term online operation. It then grew to implement Simultaneous Localization and Mapping (SLAM) on various robots and mobile platforms. As each application brings its own set of contraints on sensors, processing capabilities and locomotion, it raises the question of which SLAM approach is the most appropriate to use in terms of cost, accuracy, computation power and ease of integration. Since most of SLAM approaches are either visual or lidar-based, comparison is difficult. Therefore, we decided to extend RTAB-Map to support both visual and lidar SLAM, providing in one package a tool allowing users to implement and compare a variety of 3D and 2D solutions for a wide range of applications with different robots and sensors. This paper presents this extended version of RTAB-Map and its use in comparing, both quantitatively and qualitatively, a large selection of popular real-world datasets (e.g., KITTI, EuRoC, TUM RGB-D, MIT Stata Center on PR2 robot), outlining strengths and limitations of visual and lidar SLAM configurations from a practical perspective for autonomous navigation applications.
作为从2013年开始作为一个开源库的分布式系统,RTAB-Map 最初是一个基于出现次数的循环检测方法,具有内存管理来处理大规模和长期在线操作。然后,它增长到在各种机器人和移动平台上实现同时定位与映射(SLAM)。由于每个应用程序都有其自身的传感器、处理能力和运动限制,因此它提出了这样一个问题:哪种SLAM方法在成本、准确度、计算能力和集成 ease方面是最合适的? 由于大多数SLAM方法要么是基于视觉的,要么是基于激光雷达的,因此很难进行比较。因此,我们决定扩展RTAB-Map以支持视觉和激光雷达SLAM,在一个包中提供了一个工具,让用户可以实现和比较各种3D和2D解决方案,适用于各种机器人和传感器。本文介绍了这个扩展版本的RTAB-Map及其在比较方面的应用,从实际的角度概述了视觉和激光雷达SLAM配置的优缺点,为自主导航应用提供了工具。
https://arxiv.org/abs/2403.06341
Asynchronous Microphone array calibration is a prerequisite for most audition robot applications. In practice, the calibration requires estimating microphone positions, time offsets, clock drift rates, and sound event locations simultaneously. The existing method proposed Graph-based Simultaneous Localisation and Mapping (Graph-SLAM) utilizing common TDOA, time difference of arrival between two microphones (TDOA-M), and odometry measurement, however, it heavily depends on the initial value. In this paper, we propose a novel TDOA, time difference of arrival between adjacent sound events (TDOA-S), combine it with TDOA-M, called hybrid TDOA, and add odometry measurement to construct Graph-SLAM and use the Gauss-Newton (GN) method to solve. TDOA-S is simple and efficient because it eliminates time offset without generating new variables. Simulation and real-world experiment results consistently show that our method is independent of microphone number, insensitive to initial values, and has better calibration accuracy and stability under various TDOA noises. In addition, the simulation result demonstrates that our method has a lower Cramér-Rao lower bound (CRLB) for microphone parameters, which explains the advantages of my method.
异步麦克风阵列校准是大多数验收机器人应用程序的必要前提。在实践中,校准需要同时估计麦克风位置、时间偏移量、时钟漂移速率和声音事件位置。然而,现有的基于图的同步定位与映射(Graph-SLAM)方法,利用共有的TDOA、两个麦克风之间的时间差(TDOA-M)和运动测量,虽然可以提高性能,但仍然高度依赖于初始值。在本文中,我们提出了一种新颖的TDOA,即相邻声音事件之间的时间差(TDOA-S),结合TDOA-M,称为混合TDOA,并添加运动测量来构建Graph-SLAM,并使用Gauss-Newton(GN)方法求解。TDOA-S因为消除了不需要生成新变量的时偏移量而变得简单和高效。模拟和现实世界的实验结果一致表明,我们的方法与麦克风数量无关,对初始值不敏感,在各种TDOA噪声下具有更好的校准精度和稳定性。此外,模拟结果还表明,我们的方法对麦克风参数的Cramér-Rao下界(CRLB)较低,解释了该方法的优势。
https://arxiv.org/abs/2403.05791
Reliable localization is an essential capability for marine robots navigating in GPS-denied environments. SLAM, commonly used to mitigate dead reckoning errors, still fails in feature-sparse environments or with limited-range sensors. Pose estimation can be improved by incorporating the uncertainty prediction of future poses into the planning process and choosing actions that reduce uncertainty. However, performing belief propagation is computationally costly, especially when operating in large-scale environments. This work proposes a computationally efficient planning under uncertainty frame-work suitable for large-scale, feature-sparse environments. Our strategy leverages SLAM graph and occupancy map data obtained from a prior exploration phase to create a virtual map, describing the uncertainty of each map cell using a multivariate Gaussian. The virtual map is then used as a cost map in the planning phase, and performing belief propagation at each step is avoided. A receding horizon planning strategy is implemented, managing a goal-reaching and uncertainty-reduction tradeoff. Simulation experiments in a realistic underwater environment validate this approach. Experimental comparisons against a full belief propagation approach and a standard shortest-distance approach are conducted.
可靠的局部定位是海洋机器人导航在GPS禁止的环境中必不可少的技能。SLAM,通常用于减轻死记错误差,在稀疏环境中或有限距离传感器下仍然失败。通过将未来姿态的不确定性预测纳入规划过程并选择降低不确定性的动作,可以提高姿态估计。然而,进行信念传播在大型环境中计算代价较高。本研究提出了一种适用于大型、稀疏环境的计算效率规划框架。我们的策略利用先前探索阶段获得的SLAM图和占有图数据创建了一个虚拟地图,用多元高斯分布描述每个地图单元的不确定性。然后,虚拟地图用作规划阶段中的成本地图,并避免了在每一步进行信念传播。一种后退视野规划策略被实现,管理目标达成和不确定性降低的权衡。在现实水下环境中进行仿真实验验证了这种方法的有效性。与完整的信念传播方法和标准最短距离方法进行了实验比较。
https://arxiv.org/abs/2403.04936
In this paper, we consider a Micro Aerial Vehicle (MAV) system teleoperated by a non-expert and introduce a perceptive safety filter that leverages Control Barrier Functions (CBFs) in conjunction with Visual-Inertial Simultaneous Localization and Mapping (VI-SLAM) and dense 3D occupancy mapping to guarantee safe navigation in complex and unstructured environments. Our system relies solely on onboard IMU measurements, stereo infrared images, and depth images and autonomously corrects teleoperated inputs when they are deemed unsafe. We define a point in 3D space as unsafe if it satisfies either of two conditions: (i) it is occupied by an obstacle, or (ii) it remains unmapped. At each time step, an occupancy map of the environment is updated by the VI-SLAM by fusing the onboard measurements, and a CBF is constructed to parameterize the (un)safe region in the 3D space. Given the CBF and state feedback from the VI-SLAM module, a safety filter computes a certified reference that best matches the teleoperation input while satisfying the safety constraint encoded by the CBF. In contrast to existing perception-based safe control frameworks, we directly close the perception-action loop and demonstrate the full capability of safe control in combination with real-time VI-SLAM without any external infrastructure or prior knowledge of the environment. We verify the efficacy of the perceptive safety filter in real-time MAV experiments using exclusively onboard sensing and computation and show that the teleoperated MAV is able to safely navigate through unknown environments despite arbitrary inputs sent by the teleoperator.
在本文中,我们考虑了一个由非专家操作者的微型无人机(MAV)系统,并引入了一种感知安全滤波器,该滤波器利用了控制障碍物功能(CBFs)与视觉-惯性同时定位与映射(VI-SLAM)和密集3D占有率映射来确保在复杂且无结构的环境中安全导航。我们的系统仅依赖车载IMU测量、立体红外图像和深度图像,并在认为不安全时自动修正遥控输入。我们定义3D空间中一个点为不安全的情况是:(i) 被障碍物占据,或者(ii) 未被映射。在每一步中,VI-SLAM通过融合车载测量更新环境占有率图,并构建一个控制障碍物范围(无/安全)的CBF。给定CBF和VI-SLAM模块的状态反馈,安全滤波器计算一个认证参考,它在与CBF编码的安全约束相匹配的同时,最好地匹配遥控输入。与现有的基于感知的安全控制框架相比,我们直接关闭感知-操作循环,并展示了在没有任何外部基础设施或环境先验知识的情况下,安全控制与实时VI-SLAM的完整能力。我们通过仅使用车载传感器和计算来验证感知安全滤波器在实时MAV实验中的有效性,并表明即使有随意发送的遥控输入,遥控的MAV也能安全地导航通过未知环境。
https://arxiv.org/abs/2403.04331