TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.
TAT-VPR 是一种三值量化变压器,它为视觉SLAM闭环提供了动态的精度与效率权衡。通过融合三值权重和一个学习到的激活-稀疏门控,该模型可以在运行时最多控制计算量减少40%,同时不降低性能(Recall@1)。提出的两阶段蒸馏流水线保持了描述符的质量,使其能够在微型无人机和嵌入式SLAM系统中运行,并且在定位精度方面与当前最先进的技术相匹配。
https://arxiv.org/abs/2505.16447
Spatial labeling assigns labels to specific spatial locations to characterize their spatial properties and relationships, with broad applications in scientific research and practice. Measuring the similarity between two spatial labelings is essential for understanding their differences and the contributing factors, such as changes in location properties or labeling methods. An adequate and unbiased measurement of spatial labeling similarity should consider the number of matched labels (label agreement), the topology of spatial label distribution, and the heterogeneous impacts of mismatched labels. However, existing methods often fail to account for all these aspects. To address this gap, we propose a methodological framework to guide the development of methods that meet these requirements. Given two spatial labelings, the framework transforms them into graphs based on location organization, labels, and attributes (e.g., location significance). The distributions of their graph attributes are then extracted, enabling an efficient computation of distributional discrepancy to reflect the dissimilarity level between the two labelings. We further provide a concrete implementation of this framework, termed Spatial Labeling Analogy Metric (SLAM), along with an analysis of its theoretical foundation, for evaluating spatial labeling results in spatial transcriptomics (ST) \textit{as per} their similarity with ground truth labeling. Through a series of carefully designed experimental cases involving both simulated and real ST data, we demonstrate that SLAM provides a comprehensive and accurate reflection of labeling quality compared to other well-established evaluation metrics. Our code is available at this https URL.
空间标注是指将标签分配到特定的空间位置,以描述这些位置的空间属性和关系,在科学研究与实践中有着广泛的应用。测量两个空间标注之间的相似性对于理解它们的差异及影响因素(如位置属性的变化或标注方法的不同)至关重要。一个充分且无偏见的空间标注相似度量应考虑匹配标签的数量(即标签一致性)、空间标签分布的拓扑结构以及不匹配标签的影响差异。然而,现有的许多方法往往无法兼顾所有这些方面。 为了解决这一问题,我们提出了一种框架,旨在指导开发满足上述需求的方法。给定两个空间标注,该框架将它们基于位置组织、标签和属性(如位置的重要性)转化为图,并提取其图形属性的分布情况,从而能够高效计算出分布差异来反映两者之间的不相似程度。 我们进一步提供了一个具体实现方案,称为空间标注类比度量法 (Spatial Labeling Analogy Metric, SLAM),并分析了它的理论基础,用于评估空间转录组学(ST)中的空间标签结果与真实标签的相似性。通过一系列精心设计的实验案例,包括模拟和真实的ST数据,我们证明SLAM能够提供一种比其他成熟评价标准更全面且准确的空间标注质量反映。 我们的代码可在以下网址获得:[此链接](https://this-https-url.com)。
https://arxiv.org/abs/2505.14128
Place recognition is a cornerstone of vehicle navigation and mapping, which is pivotal in enabling systems to determine whether a location has been previously visited. This capability is critical for tasks such as loop closure in Simultaneous Localization and Mapping (SLAM) and long-term navigation under varying environmental conditions. In this survey, we comprehensively review recent advancements in place recognition, emphasizing three representative methodological paradigms: Convolutional Neural Network (CNN)-based approaches, Transformer-based frameworks, and cross-modal strategies. We begin by elucidating the significance of place recognition within the broader context of autonomous systems. Subsequently, we trace the evolution of CNN-based methods, highlighting their contributions to robust visual descriptor learning and scalability in large-scale environments. We then examine the emerging class of Transformer-based models, which leverage self-attention mechanisms to capture global dependencies and offer improved generalization across diverse scenes. Furthermore, we discuss cross-modal approaches that integrate heterogeneous data sources such as Lidar, vision, and text description, thereby enhancing resilience to viewpoint, illumination, and seasonal variations. We also summarize standard datasets and evaluation metrics widely adopted in the literature. Finally, we identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain. The unified framework of leading-edge place recognition methods, i.e., code library, and the results of their experimental evaluations are available at this https URL.
地点识别是车辆导航和地图绘制的核心,对于系统确定某个位置是否曾经被访问过至关重要。这种能力对诸如同时定位与建图(SLAM)中的闭环任务以及在不同环境条件下进行长期导航的任务都非常重要。在这篇综述中,我们全面回顾了近期地点识别领域的进展,并着重介绍了三种具有代表性的方法论范式:基于卷积神经网络(CNN)的方法、基于Transformer的框架和跨模态策略。 首先,我们阐明了地点识别在自主系统更广泛背景中的重要性。接着,我们追溯了基于CNN的方法的发展历程,强调它们对于鲁棒视觉描述符学习以及大规模环境中可扩展性的贡献。然后,我们考察了一类新兴的基于Transformer模型,这些模型利用自注意力机制来捕捉全局依赖关系,并能在不同场景中提供更好的泛化能力。此外,我们也讨论了跨模态方法,这类方法融合了诸如激光雷达(LiDAR)、视觉和文本描述等异质数据源,从而增强了对视角、光照和季节变化的抗干扰能力。 我们还总结了文献中广泛采用的标准数据集和评估指标。最后,我们指出现有的研究挑战,并概述未来的方向,包括领域适应性、实时性能以及终身学习,以激励该领域的未来发展。领先的地点识别方法统一框架(即代码库)及其实验评价结果可以在提供的链接上获取。 原文链接中的内容没有直接提供,如果您需要访问具体的研究资源或数据集,请根据上下文信息查找对应网址。
https://arxiv.org/abs/2505.14068
We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.
我们提出了一种名为VGGT-SLAM的稠密RGB SLAM系统,该系统通过使用前向场景重建方法VGGT增量且全局地对齐由未校准单目相机创建的子地图。与相关工作使用相似变换(即平移、旋转和尺度)来对齐子地图不同,我们展示了对于未校准的摄像头而言,这样的方法是不足的。特别地,我们重新审视了重建模糊性的概念:给定一组未校准的摄像机,在不假设相机运动或场景结构的情况下,只能以15自由度的射影变换来重构该场景的真实几何。这激发我们在SL(4)流形上优化以在子地图之间恢复一致的场景重建,并估算出相邻子地图之间的15自由度的单应性变换,同时考虑潜在的闭环约束。 通过广泛的实验验证,我们展示了VGGT-SLAM利用长视频序列实现了改进的地图质量。对于由于高GPU需求而难以处理这些序列的VGGT来说,这是一个显著的进步。
https://arxiv.org/abs/2505.12549
In embedded systems, robots must perceive and interpret their environment efficiently to operate reliably in real-world conditions. Visual Semantic SLAM (Simultaneous Localization and Mapping) enhances standard SLAM by incorporating semantic information into the map, enabling more informed decision-making. However, implementing such systems on resource-limited hardware involves trade-offs between accuracy, computing efficiency, and power usage. This paper provides a comparative review of recent Semantic Visual SLAM methods with a focus on their applicability to embedded platforms. We analyze three main types of architectures - Geometric SLAM, Neural Radiance Fields (NeRF), and 3D Gaussian Splatting - and evaluate their performance on constrained hardware, specifically the NVIDIA Jetson AGX Orin. We compare their accuracy, segmentation quality, memory usage, and energy consumption. Our results show that methods based on NeRF and Gaussian Splatting achieve high semantic detail but demand substantial computing resources, limiting their use on embedded devices. In contrast, Semantic Geometric SLAM offers a more practical balance between computational cost and accuracy. The review highlights a need for SLAM algorithms that are better adapted to embedded environments, and it discusses key directions for improving their efficiency through algorithm-hardware co-design.
在嵌入式系统中,机器人必须高效地感知和解读其环境,以便在实际条件下可靠运行。视觉语义SLAM(即时定位与地图构建)通过将语义信息融入地图来增强标准的SLAM技术,从而实现更加明智的决策制定。然而,在资源受限的硬件上实施此类系统需要在精度、计算效率和能耗之间进行权衡。本文提供了一篇关于近期视觉语义SLAM方法的比较性综述,并重点关注其在嵌入式平台上的适用性。我们分析了三种主要架构类型:几何SLAM、神经辐射场(NeRF)和3D高斯点阵,并评估它们在受限硬件(特别是NVIDIA Jetson AGX Orin开发板)上的性能,包括精度、分割质量、内存使用量和能耗。我们的结果表明,基于NeRF和Gaussian Splatting的方法能够达到高度的语义细节,但需要大量的计算资源,限制了其在嵌入式设备中的应用。相比之下,语义几何SLAM提供了一个更实用的成本与精度之间的平衡点。综述强调了对更适合嵌入式环境的SLAM算法的需求,并讨论了通过算法-硬件协同设计来提高效率的关键方向。
https://arxiv.org/abs/2505.12384
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models.
模仿学习在操作领域面临着众所周知的数据稀缺问题。与自然语言和二维计算机视觉不同,没有大规模的互联网级数据集来支持灵巧操作的研究。一个吸引人的选择是使用第一人称视角的人类视频,这是一种被动且可扩展的数据来源。然而,现有的大型数据集如Ego4D并没有原生的手部姿态标注,并不专注于物体操作。 为此,我们利用Apple Vision Pro设备收集了EgoDex:迄今为止最大的、最具多样性的灵巧人类操作数据集。EgoDex包含829小时的第一人称视角视频,每段视频都配有一对3D手部和手指追踪数据,在录制时使用多台校准过的摄像机和设备上的即时定位与地图构建技术(SLAM)来精确跟踪每个手关节的位置。 该数据集涵盖了194种不同的桌面任务中的多样化操作行为,从系鞋带到折叠衣物等日常生活用品的操作。此外,我们在这一数据集上训练并系统评估了模仿学习策略用于手部轨迹预测的效果,并引入了一系列指标和基准测试以衡量在这一日益重要的领域中的进展。 通过发布这个大规模的数据集,我们希望推动机器人技术、计算机视觉及基础模型研究的前沿发展。
https://arxiv.org/abs/2505.11709
Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate $k$-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method's effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.
使用朝下的相机进行地面纹理定位提供了一种低成本、高精度的定位解决方案,该方案能够应对动态环境,并且无需对环境进行任何改动。我们提出了一种显著改进的词袋(BoW)图像检索系统,用于提高地面纹理定位的准确性,在全局定位和闭环检测中达到了更高的准确率和召回率。我们的方法利用了近似$k$-均值(AKM)词汇表及软分配策略,并且充分利用了地面纹理定位所固有的方位一致性和尺度恒定约束条件。根据SLAM中的全局定位与闭环检测的不同需求,我们提供了高精度版本以及高速度版本的算法。通过消融研究测试了我们的每个改进措施的有效性,并展示了在全局定位和闭环检测上方法的有效性。 由于已经有许多地面纹理定位系统使用了BoW技术,我们的方法可以轻松替换这些系统的现有通用BoW系统,并立即提升其性能结果。
https://arxiv.org/abs/2505.11620
Contemporary Quranic Orthography (CQO) relies on a precise system of phonetic notation that can be traced back to the early stages of Islam, when the Quran was mainly oral in nature and the first written renderings of it served as memory aids for this oral tradition. The early systems of diacritical marks created on top of the Quranic Consonantal Text (QCT) motivated the creation and further development of a fine-grained system of phonetic notation that represented tajwid-the rules of recitation. We explored the systematicity of the rules of tajwid, as they are encountered in the Cairo Quran, using a fully and accurately encoded digital edition of the Quranic text. For this purpose, we developed a python module that can remove or add the orthographic layer of tajwid from a Quranic text in CQO. The interesting characteristic of these two sets of rules is that they address the complete Quranic text of the Cairo Quran, so they can be used as precise witnesses to study its phonetic and prosodic processes. From a computational point of view, the text of the Cairo Quran can be used as a linchpin to align and compare Quranic manuscripts, due to its richness and completeness. This will let us create a very powerful framework to work with the Arabic script, not just within an isolated text, but automatically exploring a specific textual phenomenon in other connected manuscripts. Having all the texts mapped among each other can serve as a powerful tool to study the nature of the notation systems of diacritics added to the consonantal skeleton.
当代《古兰经》正字法(CQO)依赖于一套精确的语音标音系统,这一系统可以追溯到伊斯兰早期阶段,《古兰经》主要以口头形式传播时所使用的手写记录作为记忆辅助。在《古兰经辅音文本》(QCT)之上创建的早期标点符号系统的出现激励了更为精细的语音标音体系的发展,这套体系代表了诵读规则——即“ Tajweed”。我们利用了一部完整而准确编码的《古兰经》数字版本来探讨卡罗姆版《古兰经》中遇到的“Tajweed”法则的系统性。为此,我们开发了一个Python模块,可以去除或添加CQO文本中的Tajweed正字法层。这两套规则的一个有趣特点在于它们涵盖了整个卡罗姆版《古兰经》文本,因此可以用作研究其语音和节奏过程的精确参考。 从计算角度来看,《卡罗姆版〈古兰经〉》的文本可以作为核心支点来对齐和比较各种手稿版本,因其丰富性和完整性而成为此任务的理想选择。这将使我们能够创建一个非常强大的框架,在处理阿拉伯文脚本时不仅限于孤立的一份文件,还能自动探索其他相关手稿中特定的文字现象。把所有文本相互映射后可以成为一个有力的工具来研究添加到辅音基础结构上的标点符号体系的本质特征。
https://arxiv.org/abs/2505.11379
Simultaneous localization and mapping (SLAM) approaches for mobile robots remains challenging in forest or arboreal fruit farming environments, where tree canopies obstruct Global Navigation Satellite Systems (GNSS) signals. Unlike indoor settings, these agricultural environments possess additional challenges due to outdoor variables such as foliage motion and illumination variability. This paper proposes a solution based on 2D lidar measurements, which requires less processing and storage, and is more cost-effective, than approaches that employ 3D lidars. Utilizing the modified Hausdorff distance (MHD) metric, the method can solve the scan matching robustly and with high accuracy without needing sophisticated feature extraction. The method's robustness was validated using public datasets and considering various metrics, facilitating meaningful comparisons for future research. Comparative evaluations against state-of-the-art algorithms, particularly A-LOAM, show that the proposed approach achieves lower positional and angular errors while maintaining higher accuracy and resilience in GNSS-denied settings. This work contributes to the advancement of precision agriculture by enabling reliable and autonomous navigation in challenging outdoor environments.
移动机器人在森林或果园环境中的同时定位与地图构建(SLAM)依然面临挑战,因为树冠会阻碍全球导航卫星系统(GNSS)信号。相较于室内场景,这些农业环境中由于户外变量如树叶的动态变化和光照不稳定性而带来了额外的困难。本文提出了一种基于二维激光雷达(2D lidar)测量的方法,这种方法比使用三维激光雷达(3D lidars)的方法处理量更少、存储成本更低且更具经济性。 通过采用改进的豪斯多夫距离(Modified Hausdorff Distance, MHD)度量方法,该技术能够在不依赖复杂的特征提取的情况下稳健而准确地解决扫描匹配问题。该方法在使用公开数据集时进行了验证,并考虑了多种指标,以支持未来研究中有意义的比较分析。 与当前最先进的算法(如A-LOAM)进行对比评估表明,所提出的解决方案可以实现更低的位置和角度误差,在没有GNSS信号的情况下仍能保持更高的精度和鲁棒性。这项工作有助于推动精准农业的发展,使在挑战性的户外环境中实现可靠的自主导航成为可能。
https://arxiv.org/abs/2505.10847
We present TartanGround, a large-scale, multi-modal dataset to advance the perception and autonomy of ground robots operating in diverse environments. This dataset, collected in various photorealistic simulation environments includes multiple RGB stereo cameras for 360-degree coverage, along with depth, optical flow, stereo disparity, LiDAR point clouds, ground truth poses, semantic segmented images, and occupancy maps with semantic labels. Data is collected using an integrated automatic pipeline, which generates trajectories mimicking the motion patterns of various ground robot platforms, including wheeled and legged robots. We collect 910 trajectories across 70 environments, resulting in 1.5 million samples. Evaluations on occupancy prediction and SLAM tasks reveal that state-of-the-art methods trained on existing datasets struggle to generalize across diverse scenes. TartanGround can serve as a testbed for training and evaluation of a broad range of learning-based tasks, including occupancy prediction, SLAM, neural scene representation, perception-based navigation, and more, enabling advancements in robotic perception and autonomy towards achieving robust models generalizable to more diverse scenarios. The dataset and codebase for data collection will be made publicly available upon acceptance. Webpage: this https URL
我们介绍了TartanGround,这是一个大规模、多模态的数据集,旨在推动在各种环境中运行的地面机器人感知和自主性的进步。该数据集是在多个逼真的模拟环境中收集而成,包括了多台RGB立体相机以实现360度覆盖范围,以及深度信息、光流、立体视差、激光雷达点云、真实姿态、语义分割图像和带有语义标签的占用地图等信息。数据采集采用了一条集成自动管道,该管道生成模仿各种地面机器人平台(包括轮式和腿足式机器人)运动模式的轨迹。我们在70个环境中收集了910条轨迹,总计产生了150万个样本。在占据预测和SLAM任务上的评估结果表明,基于现有数据集训练出的方法难以泛化到不同的场景中。TartanGround可以作为广泛学习任务(包括占据预测、SLAM、神经场景表示、感知导航等)的训练与评测平台,有助于推动机器人感知和自主性的发展,使模型能够适用于更加多样的场景。该数据集及其收集代码将在接受后公开发布。官网:[此URL](此URL)
https://arxiv.org/abs/2505.10696
The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: this https URL.
最近开发的神经辐射场(NeRF)和三维高斯点阵(3DGS)在视觉同步定位与地图构建(SLAM)领域取得了令人鼓舞且印象深刻的结果。然而,大多数代表性方法需要RGBD传感器,并且仅适用于室内环境。对于大规模室外场景重建的鲁棒性问题尚未得到解决。 本文介绍了一种基于立体相机的大规模3DGS视觉SLAM系统,命名为LSG-SLAM。提出的LSG-SLAM采用多模态策略来估计在视野变化较大的情况下之前的姿态信息。在跟踪过程中,我们引入了特征对齐扭曲约束,以缓解渲染损失中的外观相似性带来的负面影响。为了应对大规模场景的可扩展性问题,我们提出了连续高斯点阵子地图处理不受限场景并使用有限内存的方法。 通过位置识别,在GS子地图之间检测到环路,并且在闭环关键帧之间的相对姿态优化时利用了渲染和特征扭曲损失进行调整。在对摄像机姿势和高斯点进行全面优化之后,一个结构细化模块被用来提升重建质量。 通过对EuRoc和KITTI数据集的广泛评估,LSG-SLAM在现有神经网络、3DGS基础方法甚至传统方法中表现出了优越性能。项目页面:[请参考原文链接] (此URL)。
https://arxiv.org/abs/2505.09915
We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human's mental expectation and a Large Language Model's (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content by interpreting the intended and unintended generated text traits. To measure human mental beliefs around content production, users modify long form AI-generated text articles before publication at the US Open 2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM) alignment problem by anticipating and including human edits within the creation of text from an LLM. Throughout experimentation and by interpreting the results of a live production system, the expectations of human content reviewers had 100% of alignment with AI 53.8% of the time with an average iteration count of 4.38. The geometric interpretation of content traits such as factualness, novelty, repetitiveness, and relevancy over a Hilbert vector space combines spatial volume (all trait importance) with vertices alignment (individual trait relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an increase in content quality by extending the coverage of tennis action. Our work that was deployed at the US Open 2024 has been used across other live events within sports and entertainment.
我们介绍了一种元提示方法,该方法在生成流畅文本以处理复杂任务的同时,优化了人类心理预期与大型语言模型(LLM)神经处理状态之间的相似性。采用了一种代理强化学习技术,在此过程中,一个充当裁判的大型语言模型(LLMaaJ)通过情境学习教导另一个大型语言模型如何根据意图和非意图生成文本特征来生产内容。为了衡量人类在内容生产中的心理信念,用户在美国公开赛2024年网球大满贯期间修改了长篇人工智能生成的文章,并在发布前进行了调整。 现在,LLMaaJ可以通过预测并纳入人类编辑操作,在从大型语言模型创建文本的过程中解决心智理论(ToM)对齐问题。通过实验以及解读实时生产系统的成果,我们发现人类内容审核员的期望与AI的一致率为53.8%,平均迭代次数为4.38次。 通过对事实性、新颖性、重复性和相关性的内容特性在希尔伯特矢量空间中的几何解释,并结合空间体积(所有特征的重要性)和顶点对齐(单独特征的相关性),使得LLMaaJ能够优化人类ToM。这提高了内容质量,扩大了对网球赛事的报道范围。 我们在2024年美国公开赛中部署的工作现已应用于体育和其他娱乐活动中的其他实时事件。
https://arxiv.org/abs/2505.09024
Indoor localization faces persistent challenges in achieving high accuracy, particularly in GPS-deprived environments. This study unveils a cutting-edge handheld indoor localization system that integrates 2D LiDAR and IMU sensors, delivering enhanced high-velocity precision mapping, computational efficiency, and real-time adaptability. Unlike 3D LiDAR systems, it excels with rapid processing, low-cost scalability, and robust performance, setting new standards for emergency response, autonomous navigation, and industrial automation. Enhanced with a CNN-driven object detection framework and optimized through Cartographer SLAM (simultaneous localization and mapping ) in ROS, the system significantly reduces Absolute Trajectory Error (ATE) by 21.03%, achieving exceptional precision compared to state-of-the-art approaches like SC-ALOAM, with a mean x-position error of -0.884 meters (1.976 meters). The integration of CNN-based object detection ensures robustness in mapping and localization, even in cluttered or dynamic environments, outperforming existing methods by 26.09%. These advancements establish the system as a reliable, scalable solution for high-precision localization in challenging indoor scenarios
室内定位在没有GPS的环境中实现高精度一直面临挑战。本研究介绍了一种先进的手持式室内定位系统,该系统集成了2D激光雷达(LiDAR)和惯性测量单元(IMU)传感器,提供增强的高速精确地图绘制、计算效率以及实时适应能力。与3D LiDAR系统相比,它在快速处理、低成本可扩展性和强大性能方面表现出色,在应急响应、自主导航和工业自动化等领域树立了新的标准。 该系统通过基于卷积神经网络(CNN)的对象检测框架增强,并通过ROS中的Cartographer同步定位与地图构建(SLAM)技术进行了优化。这使得绝对轨迹误差(ATE)减少了21.03%,相较于最先进的方法如SC-ALOAM,实现了卓越的精度,其平均x位置误差为-0.884米(1.976米)。集成基于CNN的对象检测功能确保了在拥挤或动态环境中的地图构建和定位更加稳健可靠,并且比现有方法提升了26.09%。 这些改进使得该系统成为挑战性室内场景中高精度定位的可靠、可扩展解决方案。
https://arxiv.org/abs/2505.08388
Distributed LiDAR SLAM is crucial for achieving efficient robot autonomy and improving the scalability of mapping. However, two issues need to be considered when applying it in field environments: one is resource limitation, and the other is inter/intra-robot association. The resource limitation issue arises when the data size exceeds the processing capacity of the network or memory, especially when utilizing communication systems or onboard computers in the field. The inter/intra-robot association issue occurs due to the narrow convergence region of ICP under large viewpoint differences, triggering many false positive loops and ultimately resulting in an inconsistent global map for multi-robot systems. To tackle these problems, we propose a distributed LiDAR SLAM framework designed for versatile field applications, called SKiD-SLAM. Extending our previous work that solely focused on lightweight place recognition and fast and robust global registration, we present a multi-robot mapping framework that focuses on robust and lightweight inter-robot loop closure in distributed LiDAR SLAM. Through various environmental experiments, we demonstrate that our method is more robust and lightweight compared to other state-of-the-art distributed SLAM approaches, overcoming resource limitation and inter/intra-robot association issues. Also, we validated the field applicability of our approach through mapping experiments in real-world planetary emulation terrain and cave environments, which are in-house datasets. Our code will be available at this https URL.
分布式激光雷达SLAM(Simultaneous Localization and Mapping)对于实现高效的机器人自主性和提高地图构建的可扩展性至关重要。然而,在现场环境中应用时,需要考虑两个问题:资源限制和机器人间的关联(包括机器人内部和机器人之间的关联)。资源限制的问题在于数据量超过网络或内存处理能力的情况,尤其是在利用现场通信系统或车载计算机时更为突出。机器人间关联问题则源于ICP(Iterative Closest Point)在视角差异较大时收敛区域狭窄,导致出现许多假阳性回环并最终产生不一致的全局地图,特别是在多机器人系统中。 为了应对这些问题,我们提出了一种名为SKiD-SLAM的分布式激光雷达SLAM框架,适用于各种现场应用。该框架扩展了我们之前的专注于轻量级地点识别和快速稳健全局注册的工作,提供了一个侧重于在分布式LiDAR SLAM中实现鲁棒且轻量级机器人间回环闭合的多机器人地图构建框架。 通过各种环境实验,我们证明了我们的方法比其他最先进的分布式SLAM方案更加坚固和轻便,能够克服资源限制和机器人间的关联问题。此外,我们在真实的行星模拟地形和洞穴环境中进行了映射试验以验证我们方法在现场应用中的适用性,这些数据集是由我们内部创建的。 我们将公开我们的代码,具体信息请访问该链接:[此链接处应插入实际URL]。
https://arxiv.org/abs/2505.08230
As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark -- an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.
在诸如视角显著变化等具有挑战性的场景中,结构从运动和SLAM(同步定位与地图构建)的核心步骤之一——鲁棒特征检测和描述仍然存在未解决的问题。尽管最近的研究已经识别了局部特征在建模几何变换中的重要性,但这些方法无法学习长距离关系中存在的视觉线索。我们提出了一个新颖且强大的关键点检测/描述器——鲁棒可变形检测器(RDD),该方法利用可变形变压器来捕捉全局上下文和几何不变量,通过使用可变形自注意力机制实现。 具体来说,我们观察到可变形注意机制聚焦于关键位置,从而有效地减少了搜索空间的复杂性,并建模了几何不变性。此外,除了标准的MegaDepth数据集外,我们还收集了一个“空对地”(Air-to-Ground)数据集用于训练。我们的方法在稀疏匹配任务中超越所有现有的最先进的关键点检测/描述方法,在半密集匹配方面也表现出色。 为了确保全面评估,我们引入了两个具有挑战性的基准测试:一个强调大的视角和尺度变化,另一个是“空对地”(Air-to-Ground)基准——这是一个近年来在不同高度进行3D重建中越来越受欢迎的评估设置。
https://arxiv.org/abs/2505.08013
Place recognition plays a significant role in SLAM, robot navigation, and autonomous driving applications. Benefiting from deep learning, the performance of LiDAR place recognition (LPR) has been greatly improved. However, many existing learning-based LPR methods suffer from catastrophic forgetting, which severely harms the performance of LPR on previously trained places after training on a new environment. In this paper, we introduce a continual learning framework for LPR via Knowledge Distillation and Fusion (KDF) to alleviate forgetting. Inspired by the ranking process of place recognition retrieval, we present a ranking-aware knowledge distillation loss that encourages the network to preserve the high-level place recognition knowledge. We also introduce a knowledge fusion module to integrate the knowledge of old and new models for LiDAR place recognition. Our extensive experiments demonstrate that KDF can be applied to different networks to overcome catastrophic forgetting, surpassing the state-of-the-art methods in terms of mean Recall@1 and forgetting score.
地点识别在SLAM(同步定位与地图构建)、机器人导航和自动驾驶应用中扮演着重要角色。得益于深度学习,激光雷达地点识别(LPR)的性能得到了显著提升。然而,许多现有的基于学习的LPR方法在训练新的环境后,在之前已训练过的地点上出现了灾难性遗忘问题,严重影响了LPR的表现。为此,本文提出了一种通过知识蒸馏和融合(KDF)来缓解遗忘现象的连续学习框架。 受地点识别检索过程中的排名机制启发,我们提出了一个基于排名的知识蒸馏损失函数,鼓励网络保留高层次的地点识别知识。此外,还引入了一个知识融合模块,用于将旧模型与新模型的知识整合起来进行激光雷达地点识别。我们的大量实验表明,KDF能够应用于不同的网络架构中以克服灾难性遗忘问题,并且在平均召回率@1和遗忘分数方面超过了现有最佳方法。
https://arxiv.org/abs/2505.07198
General-purpose large language models (LLMs), despite their broad capabilities accrued from open-world data, frequently exhibit suboptimal performance when confronted with the nuanced and specialized demands inherent in real-time telecommunications applications. This investigation addresses this critical limitation through the meticulous fine-tuning of TSLAM-Mini developed by NetoAI, a compact (3.8-billion parameter) causal language model architecturally derived from Phi-4 Mini Instruct 4B. The fine-tuning regimen leverages a bespoke dataset comprising 100,000 samples, strategically engineered to address 20 pivotal telecommunications use-cases, encompassing domains such as Network Fundamentals, IP Routing, MPLS, Network Security, Automation, OSS/BSS, RAN, Mobile Core, Satellite Communications, and Ethical AI. This dataset was curated utilizing NetoAI's DigiTwin platform, enriched with granular insights from venerated network Subject Matter Experts (SMEs) and authoritative RFC documents, thereby capturing high-fidelity representations of real-world network dynamics through simulations inspired by digital twin paradigms. Employing Quantized Low-Rank Adaptation (QLoRA), a state-of-the-art Parameter Efficient Fine-Tuning (PEFT) technique, we achieved substantial training efficiency and enabled prospective deployment on resource-constrained hardware. A novel evaluation framework, predicated on a high-capacity LLM (Qwen3-235B-A22B) functioning as an automated adjudicator, was instituted to rigorously assess instruction-following fidelity and response quality across the specified telecom use-cases. Empirical results unequivocally demonstrate TSLAM-Mini's superior aptitude in telecom-centric applications, underscoring the profound efficacy of domain-specific datasets and PEFT methodologies for advancing intelligent network management.
通用大型语言模型(LLMs)虽然从开放世界数据中积累了广泛的能力,但在面对实时电信应用中的细微和专业化需求时,往往表现出次优的性能。本研究通过NetoAI开发的TSLAM-Mini的精细调整来解决这一关键限制,这是一种源自Phi-4 Mini Instruct 4B架构的小型(38亿参数)因果语言模型。该精细调优程序利用了一个定制的数据集,其中包含10万个样本,战略性地针对20个关键电信应用场景进行设计,涵盖了网络基础、IP路由、MPLS、网络安全、自动化、OSS/BSS、RAN、移动核心、卫星通信和伦理AI等领域。该数据集通过NetoAI的DigiTwin平台精心策划,并得到了尊贵的网络专家(SMEs)和权威RFC文档中详尽见解的支持,从而捕捉了现实世界网络动态的高度保真表示,这些网络动力学源自数字孪生范式的启发式模拟。 利用先进的参数高效微调(PEFT)技术——量化低秩适应(QLoRA),我们实现了显著的训练效率,并为在资源受限硬件上的潜在部署铺平了道路。我们建立了一个新颖的评估框架,该框架基于一个高容量的大语言模型(Qwen3-235B-A22B)作为自动裁判,对指定电信应用场景中的指令遵循准确度和响应质量进行了严格评估。 实证结果明确显示TSLAM-Mini在以电信为中心的应用程序中表现出色,强调了领域特定数据集和PEFT方法对于推进智能网络管理的深远效力。
https://arxiv.org/abs/2505.07877
Robot autonomy in unknown, GPS-denied, and complex underground environments requires real-time, robust, and accurate onboard pose estimation and mapping for reliable operations. This becomes particularly challenging in perception-degraded subterranean conditions under harsh environmental factors, including darkness, dust, and geometrically self-similar structures. This paper details CompSLAM, a highly resilient and hierarchical multi-modal localization and mapping framework designed to address these challenges. Its flexible architecture achieves resilience through redundancy by leveraging the complementary nature of pose estimates derived from diverse sensor modalities. Developed during the DARPA Subterranean Challenge, CompSLAM was successfully deployed on all aerial, legged, and wheeled robots of Team Cerberus during their competition-winning final run. Furthermore, it has proven to be a reliable odometry and mapping solution in various subsequent projects, with extensions enabling multi-robot map sharing for marsupial robotic deployments and collaborative mapping. This paper also introduces a comprehensive dataset acquired by a manually teleoperated quadrupedal robot, covering a significant portion of the DARPA Subterranean Challenge finals course. This dataset evaluates CompSLAM's robustness to sensor degradations as the robot traverses 740 meters in an environment characterized by highly variable geometries and demanding lighting conditions. The CompSLAM code and the DARPA SubT Finals dataset are made publicly available for the benefit of the robotics community
在未知、缺乏GPS信号以及复杂地下环境中的机器人自主操作,需要实时的、稳健且准确的机载姿态估计和地图构建以实现可靠运行。特别是在感知能力受限的地下环境中(例如黑暗、灰尘、几何相似结构等恶劣条件下),这种需求变得尤为挑战性。本文详细介绍了CompSLAM,这是一种高度健壮和分层的多模态定位与地图构建框架,旨在应对这些挑战。其灵活架构通过冗余实现稳健性,利用了来自不同传感器模式的姿态估计之间的互补特性。在DARPA地下挑战赛期间开发而成的CompSLAM,在Cerberus团队赢得比赛过程中被成功部署于所有空中、腿足式和轮式机器人上。此外,它还在后续多个项目中证明了一种可靠的里程计和地图构建解决方案的能力,并通过扩展实现了多机器人地图共享功能,以支持袋鼠型机器人的部署及协同制图。 本文还引入了一个全面的数据集,该数据集由手动遥控的四足机器人采集,在DARPA地下挑战赛决赛路线的关键部分涵盖了740米的距离。此数据集评估了CompSLAM在传感器性能下降情况下的鲁棒性,特别是在环境几何形状变化极大且照明条件苛刻的情况下。为了促进机器人技术社区的发展,CompSLAM代码和DARPA SubT Finals数据集将公开发布。
https://arxiv.org/abs/2505.06483
Accurate localization is crucial for water robotics, yet traditional onboard Global Navigation Satellite System (GNSS) approaches are difficult or ineffective due to signal reflection on the water's surface and its high cost of aquatic GNSS receivers. Existing approaches, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic-based methods, face challenges like error accumulation and high computational complexity. Therefore, a more efficient and scalable solution remains necessary. This paper proposes an alternative approach that leverages an aerial drone equipped with GNSS localization to track and localize a marine robot once it is near the surface of the water. Our results show that this novel adaptation enables accurate single and multi-robot marine robot localization.
精确的定位对于水下机器人技术至关重要,然而传统的机载全球导航卫星系统(GNSS)方法由于水面反射信号和水上GNSS接收器高昂的成本而难以实施或无效。现有的解决方案,如惯性导航、多普勒测速仪(DVL)、同时定位与地图构建(SLAM)以及声学定位方法,面临着误差积累和高计算复杂度等挑战。因此,仍然需要一个更高效且可扩展的解决方案。 本文提出了一种替代方案:利用搭载GNSS定位系统的无人机追踪并确定接近水面的水下机器人位置。我们的研究结果表明,这种新型适应性方法能够实现精确的单个及多机器人水下机器人定位。
https://arxiv.org/abs/2505.04095
Despite significant progress in autonomous navigation, a critical gap remains in ensuring reliable localization in hazardous environments such as tunnels, urban disaster zones, and underground structures. Tunnels present a uniquely difficult scenario: they are not only prone to GNSS signal loss, but also provide little features for visual localization due to their repetitive walls and poor lighting. These conditions degrade conventional vision-based and LiDAR-based systems, which rely on distinguishable environmental features. To address this, we propose a novel sensor fusion framework that integrates a thermal camera with a LiDAR to enable robust localization in tunnels and other perceptually degraded environments. The thermal camera provides resilience in low-light or smoke conditions, while the LiDAR delivers precise depth perception and structural awareness. By combining these sensors, our framework ensures continuous and accurate localization across diverse and dynamic environments. We use an Extended Kalman Filter (EKF) to fuse multi-sensor inputs, and leverages visual odometry and SLAM (Simultaneous Localization and Mapping) techniques to process the sensor data, enabling robust motion estimation and mapping even in GNSS-denied environments. This fusion of sensor modalities not only enhances system resilience but also provides a scalable solution for cyber-physical systems in connected and autonomous vehicles (CAVs). To validate the framework, we conduct tests in a tunnel environment, simulating sensor degradation and visibility challenges. The results demonstrate that our method sustains accurate localization where standard approaches deteriorate due to the tunnels featureless geometry. The frameworks versatility makes it a promising solution for autonomous vehicles, inspection robots, and other cyber-physical systems operating in constrained, perceptually poor environments.
尽管在自主导航方面取得了重大进展,但在诸如隧道、城市灾难区域和地下结构等危险环境中的可靠定位问题仍然存在关键差距。隧道尤其具有挑战性:不仅容易失去GNSS信号,而且由于其重复的墙壁和照明不足,提供的视觉特征很少。这些条件会降低传统的基于视觉和LiDAR系统的性能,因为它们依赖于可区分的环境特征。为了解决这个问题,我们提出了一种新颖的传感器融合框架,该框架将热像仪与LiDAR结合使用,以实现在隧道和其他感知受限环境中进行稳健定位。热像仪在低光或烟雾条件下提供了稳定性,而LiDAR则提供精确的深度感测和结构意识。通过结合这些传感器,我们的框架确保了即使在多变且动态的环境中的连续和准确定位。 我们使用扩展卡尔曼滤波器(EKF)来融合多传感器输入,并利用视觉里程计和SLAM(即时定位与地图构建)技术处理传感器数据,从而实现在无GNSS环境中的稳健运动估计和制图。这种传感器模态的融合不仅增强了系统的鲁棒性,而且还为互联自动驾驶车辆(CAVs)等网络物理系统提供了可扩展的解决方案。 为了验证该框架的有效性,我们在隧道环境中进行了测试,模拟了传感器退化和可视性挑战的情况。结果显示,在标准方法因隧道特征匮乏而表现不佳的情况下,我们的方法能够保持精确定位。该框架的灵活性使其成为自主车辆、检查机器人和其他在受限且感知条件较差环境中运行的网络物理系统的有前景的解决方案。
https://arxiv.org/abs/2505.03565