The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.
阿拉伯语随着时间的推移经历了显著的变化,包括新词汇的出现、旧词汇的淘汰以及词语使用的转变。这种演变在古典时代和现代阿拉伯时代的区别中尤为明显。虽然历史学家和语言学家已经将阿拉伯文学划分成多个时期,但较少有研究探索自动分类不同时间段的阿拉伯文本,尤其是在诗歌领域之外的研究更为稀缺。本文通过运用神经网络和深度学习技术来填补这一空白,旨在自动将阿拉伯文本划分为不同的时代和地区。所提出的模型使用了两个公开可用语料库派生的数据集进行评估,这些数据集涵盖了从前伊斯兰时期到现代的各种文本。研究考察了从二元分类到15类分类的不同设置,并考虑到了预定义的历史时期和定制的时间段划分。结果显示,在使用OpenITI数据集的二元时代分类任务中,F1分数为0.83;在使用APCD数据集的任务中,为0.79。而在使用OpenITI数据集进行15类时代分类时,F1分数下降到0.20,在使用APCD数据集进行12类时代分类时则降至0.18。
https://arxiv.org/abs/2601.16138
The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
视觉基础模型的出现已经彻底革新了视觉里程计(VO)和同时定位与地图构建(SLAM),使得姿态估计和密集重建可以在单一前馈网络中完成。然而,不同于传统的管道利用关键帧方法来提高效率和精度,目前基于基础模型的方法,例如VGGT-Long,通常会不加区分地处理原始图像序列。这导致了由于低帧间视差引起的计算冗余以及性能下降,因为低帧间视差提供的立体背景信息有限。将传统的几何启发式融入这些方法中颇具挑战性,因为它们的性能依赖于高维潜在表示而非明确的几何度量。 为了弥合这一差距,我们提出了一种新颖的关键帧前馈VO方法。不同于依赖手工设计规则的方法,我们的方法利用强化学习以数据驱动的方式推导出适应性的关键帧策略,并与基础模型的本质特性相匹配。我们在TartanAir数据集上训练了代理,并在几个真实世界的数据库中进行了广泛的评估。实验结果表明,所提出的方法在最先进的前馈VO方法中实现了持续且显著的改进。
https://arxiv.org/abs/2601.16020
Autonomous navigation for nano-scale unmanned aerial vehicles (nano-UAVs) is governed by extreme Size, Weight, and Power (SWaP) constraints (with the weight < 50 g and sub-100 mW onboard processor), distinguishing it fundamentally from standard robotic paradigms. This review synthesizes the state-of-the-art in sensing, computing, and control architectures designed specifically for these sub- 100mW computational envelopes. We critically analyse the transition from classical geometry-based methods to emerging "Edge AI" paradigms, including quantized deep neural networks deployed on ultra-low-power System-on-Chips (SoCs) and neuromorphic event-based control. Beyond algorithms, we evaluate the hardware-software co-design requisite for autonomy, covering advancements in dense optical flow, optimized Simultaneous Localization and Mapping (SLAM), and learning-based flight control. While significant progress has been observed in visual navigation and relative pose estimation, our analysis reveals persistent gaps in long-term endurance, robust obstacle avoidance in dynamic environments, and the "Sim-to-Real" transfer of reinforcement learning policies. This survey provides a roadmap for bridging these gaps, advocating for hybrid architectures that fuse lightweight classical control with data-driven perception to enable fully autonomous, agile nano-UAVs in GPS-denied environments.
纳米级无人飞行器(nano-UAV)的自主导航受极端尺寸、重量和功耗(SWaP)限制的影响,其重量小于50克且机载处理器功率低于100毫瓦,这与标准机器人范式有根本区别。这篇综述总结了为这些低至100毫瓦计算能力设计的传感、计算及控制架构的最新进展。我们批判性地分析了从传统几何方法向新兴“边缘AI”(Edge AI)范式的转变,包括在超低功耗片上系统(SoCs)上部署量化深度神经网络以及基于事件的神经形态控制。除了算法之外,还评估了实现自主性的硬件-软件协同设计需求,涵盖了密集光流、优化的同时定位与地图构建(SLAM)和学习型飞行控制的进步。 尽管在视觉导航和相对姿态估计方面已经取得了显著进展,但我们的分析揭示了长期续航能力不足、动态环境中的鲁棒性避障以及强化学习策略的“仿真到实际”迁移等方面的持续差距。本调查提供了弥合这些差距的道路图,倡导融合轻量级经典控制与数据驱动感知的混合架构,以实现在没有全球定位系统(GPS)支持环境中完全自主且敏捷的纳米无人机飞行。
https://arxiv.org/abs/2601.13252
This paper proposes R-VoxelMap, a novel voxel mapping method that constructs accurate voxel maps using a geometry-driven recursive plane fitting strategy to enhance the localization accuracy of online LiDAR odometry. VoxelMap and its variants typically fit and check planes using all points in a voxel, which may lead to plane parameter deviation caused by outliers, over segmentation of large planes, and incorrect merging across different physical planes. To address these issues, R-VoxelMap utilizes a geometry-driven recursive construction strategy based on an outlier detect-and-reuse pipeline. Specifically, for each voxel, accurate planes are first fitted while separating outliers using random sample consensus (RANSAC). The remaining outliers are then propagated to deeper octree levels for recursive processing, ensuring a detailed representation of the environment. In addition, a point distribution-based validity check algorithm is devised to prevent erroneous plane merging. Extensive experiments on diverse open-source LiDAR(-inertial) simultaneous localization and mapping (SLAM) datasets validate that our method achieves higher accuracy than other state-of-the-art approaches, with comparable efficiency and memory usage. Code will be available on GitHub.
本文提出了一种新的体素映射方法R-VoxelMap,该方法采用基于几何驱动的递归平面拟合策略来构建准确的体素地图,并以此增强在线LiDAR里程计的定位精度。传统的VoxelMap及其变体在处理每个体素时通常会使用所有点进行平面拟合并检查,这可能导致由于异常值导致的平面参数偏差、大面积平面对应的大规模分割问题以及不同物理平面间的错误融合。为了解决这些问题,R-VoxelMap利用了一个基于几何驱动并结合检测与再利用管道(outlier detect-and-reuse pipeline)的递归构造策略。 具体而言,在每个体素内首先使用随机抽样一致性算法(Random Sample Consensus, RANSAC)进行准确的平面拟合,并分离异常值。剩余的异常点随后被传播到更深层的八叉树级别,以确保环境的详细表示并进行递归处理。此外,还设计了一种基于点分布的有效性检查算法,用于防止错误的平面合并。 在多种开源LiDAR(或LiDAR-惯性)同时定位与地图构建(SLAM)数据集上的广泛实验验证了我们提出的方法比其他最先进的方法具有更高的精度,并且效率和内存使用量相当。代码将在GitHub上公开发布。
https://arxiv.org/abs/2601.12377
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
最近在3D形状生成方面的进展取得了令人印象深刻的结果,但大多数现有方法依赖于干净、无遮挡且良好分割的输入。然而,在实际场景中这些条件很少能够满足。我们提出了一种名为ShapeR的新方法,用于从随意捕获的序列中生成有条件的3D对象形状。给定一个图像序列,我们利用现成的视觉惯性SLAM(Simultaneous Localization and Mapping)、3D检测算法和视觉语言模型来提取每个物体的一组稀疏SLAM点、有姿态的多视图图像以及机器生成的文字描述。然后训练一个校正流变压器,使其能够有效地根据这些模式进行条件设定,并由此生成高保真度的度量3D形状。 为了确保该方法在处理随意捕获的数据时具有鲁棒性,我们采用了一系列技术,包括即时组成的增强、跨越对象级和场景级数据集的课程训练方案以及处理背景杂乱的方法。此外,我们还引入了一个新的评估基准,其中包括7个真实世界场景中的178个野外对象,并带有几何标注。 实验表明,在这种具有挑战性的设置中,ShapeR显著优于现有方法,在Chamfer距离方面与最新技术相比提高了2.7倍。
https://arxiv.org/abs/2601.11514
Localization and mapping are core perceptual capabilities for underwater robots. Stereo cameras provide a low-cost means of directly estimating metric depth to support these tasks. However, despite recent advances in stereo depth estimation on land, computing depth from image pairs in underwater scenes remains challenging. In underwater environments, images are degraded by light attenuation, visual artifacts, and dynamic lighting conditions. Furthermore, real-world underwater scenes frequently lack rich texture useful for stereo depth estimation and 3D reconstruction. As a result, stereo estimation networks trained on in-air data cannot transfer directly to the underwater domain. In addition, there is a lack of real-world underwater stereo datasets for supervised training of neural networks. Poor underwater depth estimation is compounded in stereo-based Simultaneous Localization and Mapping (SLAM) algorithms, making it a fundamental challenge for underwater robot perception. To address these challenges, we propose a novel framework that enables sim-to-real training of underwater stereo disparity estimation networks using simulated data and self-supervised finetuning. We leverage our learned depth predictions to develop \algname, a novel framework for real-time underwater SLAM that fuses stereo cameras with IMU, barometric, and Doppler Velocity Log (DVL) measurements. Lastly, we collect a challenging real-world dataset of shipwreck surveys using an underwater robot. Our dataset features over 24,000 stereo pairs, along with high-quality, dense photogrammetry models and reference trajectories for evaluation. Through extensive experiments, we demonstrate the advantages of the proposed training approach on real-world data for improving stereo estimation in the underwater domain and for enabling accurate trajectory estimation and 3D reconstruction of complex shipwreck sites.
本地化与地图构建是水下机器人感知能力的核心。立体相机提供了一种低成本的手段,可以直接估计度量深度以支持这些任务。然而,尽管陆地场景中立体深度估计有了最近的进步,从图像对计算水下场景中的深度仍然具有挑战性。在水下环境中,图像会受到光衰减、视觉伪影以及动态光照条件的影响。此外,实际的水下场景往往缺乏用于立体深度估计和3D重建的丰富纹理。因此,在空气数据上训练的立体估算网络无法直接转移到水下领域中使用。另外,现实中缺少可用于神经网络监督训练的真实水下立体数据集。在基于立体视觉的同时定位与建图(SLAM)算法中的较差水下深度估计问题进一步复杂化了这一挑战。 为解决这些问题,我们提出了一种新的框架,通过使用模拟数据和自我监督微调进行水下立体视差估算网络的仿真到真实训练。利用我们的学习深度预测,开发出了\algname,这是一种实时水下SLAM的新框架,它融合了立体相机、惯性测量单元(IMU)、气压计以及多普勒速度日志(DVL)的数据。 最后,我们收集了一个具有挑战性的实际数据集——使用水下机器人对沉船进行调查。该数据集中有超过24,000张立体图像对,并且包括高质量的密集摄影测量模型和参考轨迹用于评估。通过广泛的实验,我们展示了所提出的训练方法在改进真实世界中水下域中的立体估算方面的优势以及实现准确路径估计和复杂沉船现场的3D重建的能力。 这段文字详细描述了一个研究项目,该项目旨在解决水下机器人感知能力的关键挑战之一——即如何更有效地利用立体视觉技术进行深度估计和SLAM(同时定位与建图)。通过开发新颖的方法来训练神经网络,并结合多传感器融合技术,该研究为提高水下机器人的自主导航能力和复杂环境中的3D重建性能开辟了新的可能性。
https://arxiv.org/abs/2601.10814
Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.
单目视觉SLAM技术可以从互联网视频中进行三维重建,并在资源受限的平台上实现自主导航,但它会遭受尺度漂移的问题,即长时间序列下估计尺度逐渐偏离真实值。现有的帧到帧方法通过局部优化实现了实时性能,但由于缺乏不同窗口之间的全局约束而积累了尺度漂移。为了解决这个问题,我们提出了SCE-SLAM(Scene Coordinate Embedding SLAM),这是一种端到端的SLAM系统,它利用场景坐标嵌入来保持尺度一致性,这些嵌入是学习得到的补丁级表示,在一个标准尺度参考下编码了三维几何关系。 该框架包含两个关键模块:由几何引导的聚合模块和场景坐标联合调整模块。前者通过利用3D空间邻近性,借助于几何调制注意力从历史观测中传播尺度信息;后者则通过明确解码自场景坐标嵌入的三维坐标约束将当前估计锚定到参考尺度上。 在KITTI、Waymo和vKITTI数据集上的实验表明了显著改进:我们的方法相较于最佳先前方法,在KITTI数据集上减少了8.36米的绝对轨迹误差,同时保持了每秒36帧的速度,并实现了大规模场景中的尺度一致性。
https://arxiv.org/abs/2601.09665
In complex environments, autonomous robot navigation and environmental perception pose higher requirements for SLAM technology. This paper presents a novel method for semantically enhancing 3D point cloud maps with thermal information. By first performing pixel-level fusion of visible and infrared images, the system projects real-time LiDAR point clouds onto this fused image stream. It then segments heat source features in the thermal channel to instantly identify high temperature targets and applies this temperature information as a semantic layer on the final 3D map. This approach generates maps that not only have accurate geometry but also possess a critical semantic understanding of the environment, making it highly valuable for specific applications like rapid disaster assessment and industrial preventive maintenance.
在复杂的环境中,自主机器人的导航和环境感知对SLAM(同步定位与映射)技术提出了更高的要求。本文提出了一种新颖的方法,利用热信息来增强3D点云地图的语义信息。该方法首先执行可见光图像和红外图像的像素级融合,然后将实时LiDAR(激光雷达)点云投影到这一融合后的图像流上。接下来,系统在热成像通道中分割出热源特征,从而即时识别高温目标,并将这些温度信息作为语义层添加到最终生成的3D地图中。这种方法生成的地图不仅具有精确的几何结构,还具备对环境的关键性语义理解能力,在快速灾害评估和工业预防维护等特定应用领域中尤为有价值。
https://arxiv.org/abs/2601.09578
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
近期,开源多模态大型语言模型(MLLM)框架如LLaVA的兴起为人工智能开发者和研究人员提供了一个便捷的起点。然而,大多数MLLM框架主要以视觉作为输入模式,并对语音、音频和音乐等模态的支持有限。这种情况阻碍了音頻-语言模型的发展,并迫使研究者花费大量精力在代码编写和超参数调整上。 我们推出了SLAM-LLM,这是一个开源深度学习框架,旨在训练定制化的多模态大型语言模型(MLLM),专注于语音、语言、音频和音乐处理。SLAM-LLM提供了不同编码器、投影器、大语言模型(LLMs)以及高效微调插件的模块化配置。此外,它还包含了主流任务的详细训练和推理配方,并包括高性能检查点,如基于大型语言模型的自动语音识别(ASR)、自动化音频描述(AAC)和音乐描述(MC)。其中一些配方已达到或接近业界最佳性能水平,相关技术也被学术论文接受。 我们希望SLAM-LLM能够加速研究者的迭代、开发、数据工程以及模型训练过程。我们将致力于通过这个开源框架不断推动基于语音的多模态大型语言模型的发展,并呼吁社区贡献于基于大型语言模型的语音、音频和音乐处理工作。
https://arxiv.org/abs/2601.09385
We present an efficient incremental SLAM back-end that achieves the accuracy of full batch optimization while substantially reducing computational cost. The proposed approach combines two complementary ideas: information-guided gating (IGG) and selective partial optimization (SPO). IGG employs an information-theoretic criterion based on the log-determinant of the information matrix to quantify the contribution of new measurements, triggering global optimization only when a significant information gain is observed. This avoids unnecessary relinearization and factorization when incoming data provide little additional information. SPO executes multi-iteration Gauss-Newton (GN) updates but restricts each iteration to the subset of variables most affected by the new measurements, dynamically refining this active set until convergence. Together, these mechanisms retain all measurements to preserve global consistency while focusing computation on parts of the graph where it yields the greatest benefit. We provide theoretical analysis showing that the proposed approach maintains the convergence guarantees of full GN. Extensive experiments on benchmark SLAM datasets show that our approach consistently matches the estimation accuracy of batch solvers, while achieving significant computational savings compared to conventional incremental approaches. The results indicate that the proposed approach offers a principled balance between accuracy and efficiency, making it a robust and scalable solution for real-time operation in dynamic data-rich environments.
我们提出了一种高效的增量SLAM后端,能够在保持全局一致性的同时大幅减少计算成本,并达到全批优化的精度。所提方法结合了两个互补的思想:信息引导门控(IGG)和选择性部分优化(SPO)。IGG利用基于信息矩阵对数行列式的信息论标准来量化新测量值的贡献,在观察到显著的信息增益时才会触发全局优化,从而避免在新增数据提供很少额外信息的情况下进行不必要的重线性和因子化。而SPO执行多迭代高斯-牛顿(GN)更新,但每次迭代仅限于受新测量影响最大的变量子集,并动态地调整这个活动集合直到收敛。这两种机制结合起来可以保留所有测量值以保持全局一致性,同时将计算资源集中在图的最有益部分。 我们提供了理论分析,证明所提方法能够维持全批GN优化的收敛保证。在基准SLAM数据集上的大量实验表明,我们的方法能够在估计精度上与批量求解器相匹配,同时相比于传统的增量方法实现了显著的计算节省。结果表明,该方法提供了一种合理平衡精度和效率的方法,在动态且数据丰富的环境中实现稳健性和可扩展性的同时支持实时操作。
https://arxiv.org/abs/2601.08110
LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur'an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.
大型语言模型(LLMs)在伊斯兰问答中的应用越来越广泛,其中未经验证的回应可能会带来严重的宗教后果。然而,标准的选择题或多选题评估方式并不能捕捉到关键的实际失败模式,尤其是自由形式的幻觉和当缺乏证据时模型是否适当地选择不回答的问题。为了探讨这一方面,我们引入了ISLAMICFAITHQA,这是一个包含3,810个项目的双语(阿拉伯语/英语)生成基准测试集,具有原子化的单一答案,这使得直接测量幻觉和不回答行为成为可能。此外,我们还开发了一套端到端的有根据的伊斯兰建模工具包,包括(i) 25K对阿拉伯文本依据的事实调整(SFT)推理配对,(ii) 5K个双语偏好样本用于奖励引导式的校准,以及(iii)一个包含约6,000条原子经文(节)的《古兰经》检索语料库。基于这些资源,我们开发了一个代理式《古兰经》依据框架(agentic RAG),该框架使用结构化的工具调用来进行迭代证据查找和答案修订。在以阿拉伯语为中心的模型以及多语言LLMs上的实验表明,检索可以提高正确性,并且代理式的RAG相比标准RAG获得了最大的改进,在性能上达到了业界领先水平,甚至在一个较小规模的模型(例如Qwen3 4B)上也能表现出更强的阿拉伯语-英语鲁棒性。我们将公开提供这些实验资源和数据集供社区使用。
https://arxiv.org/abs/2601.07528
Real-time 3D reconstruction is crucial for robotics and augmented reality, yet current simultaneous localization and mapping(SLAM) approaches often struggle to maintain structural consistency and robust pose estimation in the presence of depth noise. This work introduces PointSLAM++, a novel RGB-D SLAM system that leverages a hierarchically constrained neural Gaussian representation to preserve structural relationships while generating Gaussian primitives for scene mapping. It also employs progressive pose optimization to mitigate depth sensor noise, significantly enhancing localization accuracy. Furthermore, it utilizes a dynamic neural representation graph that adjusts the distribution of Gaussian nodes based on local geometric complexity, enabling the map to adapt to intricate scene details in real time. This combination yields high-precision 3D mapping and photorealistic scene rendering. Experimental results show PointSLAM++ outperforms existing 3DGS-based SLAM methods in reconstruction accuracy and rendering quality, demonstrating its advantages for large-scale AR and robotics.
实时三维重建对于机器人技术和增强现实至关重要,但目前的同时定位与地图构建(SLAM)方法在深度噪声存在的情况下往往难以保持结构一致性并进行稳健的姿态估计。本文介绍了一种新型的RGB-D SLAM系统——PointSLAM++,该系统利用分层约束神经高斯表示法来保存结构关系,并生成高斯原始数据以用于场景映射。此外,它还采用渐进式姿态优化方法来减轻深度传感器噪声的影响,从而显著提高定位精度。同时,它使用一种动态神经表示图,在该图中根据局部几何复杂度调整高斯节点的分布,使地图能够实时适应复杂的场景细节。这种组合生成了高精度的三维映射和逼真的场景渲染。 实验结果表明,PointSLAM++在重建准确性和渲染质量方面均优于现有的基于3DGS的SLAM方法,在大规模AR及机器人技术领域中展现出明显的优势。
https://arxiv.org/abs/2601.11617
This paper presents InsSo3D, an accurate and efficient method for large-scale 3D Simultaneous Localisation and Mapping (SLAM) using a 3D Sonar and an Inertial Navigation System (INS). Unlike traditional sonar, which produces 2D images containing range and azimuth information but lacks elevation information, 3D Sonar produces a 3D point cloud, which therefore does not suffer from elevation ambiguity. We introduce a robust and modern SLAM framework adapted to the 3D Sonar data using INS as prior, detecting loop closure and performing pose graph optimisation. We evaluated InsSo3D performance inside a test tank with access to ground truth data and in an outdoor flooded quarry. Comparisons to reference trajectories and maps obtained from an underwater motion tracking system and visual Structure From Motion (SFM) demonstrate that InsSo3D efficiently corrects odometry drift. The average trajectory error is below 21cm during a 50-minute-long mission, producing a map of 10m by 20m with a 9cm average reconstruction error, enabling safe inspection of natural or artificial underwater structures even in murky water conditions.
本文介绍了InsSo3D,这是一种使用三维声纳和惯性导航系统(INS)的大规模三维同时定位与地图构建(SLAM)的准确且高效的方法。不同于传统声纳生成包含范围和方位信息但缺乏高度信息的二维图像,三维声纳能生成三维点云数据,因此不会出现高度模糊的问题。我们引入了一个稳健且现代的SLAM框架,该框架适应于使用INS作为先验知识的三维声纳数据,并执行闭环检测与姿态图优化。 在具有真实数据访问权限的测试水箱和一个户外淹没采石场中,我们评估了InsSo3D的性能。将InsSo3D的结果与通过水下运动跟踪系统以及视觉结构从运动(SFM)获取的参考轨迹和地图进行比较后发现,InsSo3D能够有效校正里程计漂移。在长达50分钟的任务期间,平均轨迹误差低于21厘米;生成的地图大小为10米x 20米,并且重建误差仅为9厘米,这使得即使在浑浊的水下环境中也能安全检查自然或人工结构。 通过这种方法,InsSo3D能够创建精确的地图并进行准确的位置估计,这对于水下设施的安全检查至关重要。
https://arxiv.org/abs/2601.05805
We present a real-time tracking SLAM system that unifies efficient camera tracking with photorealistic feature-enriched mapping using 3D Gaussian Splatting (3DGS). Our main contribution is integrating dense feature rasterization into the novel-view synthesis, aligned with a visual foundation model. This yields strong semantics, going beyond basic RGB-D input, aiding both tracking and mapping accuracy. Unlike previous semantic SLAM approaches (which embed pre-defined class labels) FeatureSLAM enables entirely new downstream tasks via free-viewpoint, open-set segmentation. Across standard benchmarks, our method achieves real-time tracking, on par with state-of-the-art systems while improving tracking stability and map fidelity without prohibitive compute. Quantitatively, we obtain 9\% lower pose error and 8\% higher mapping accuracy compared to recent fixed-set SLAM baselines. Our results confirm that real-time feature-embedded SLAM, is not only valuable for enabling new downstream applications. It also improves the performance of the underlying tracking and mapping subsystems, providing semantic and language masking results that are on-par with offline 3DGS models, alongside state-of-the-art tracking, depth and RGB rendering.
我们提出了一种实时跟踪SLAM(同步定位与建图)系统,该系统将高效的相机跟踪与使用三维高斯点阵化(3D Gaussian Splatting,3DGS)的逼真特征增强映射统一起来。我们的主要贡献是将密集特征栅格化集成到新颖视角合成中,并且这个过程与视觉基础模型对齐。这产生了强大的语义信息,超越了基本的RGB-D输入数据,有助于提高跟踪和建图的准确性。 不同于以往的语义SLAM方法(这些方法嵌入预定义的类别标签),FeatureSLAM通过自由视点、开放集合分割来实现全新的下游任务。在标准基准测试中,我们的方法实现了实时跟踪,并且与最先进的系统表现相当,同时提升了跟踪稳定性和地图精度,而无需过度计算资源。 从量化数据来看,与最近的固定集SLAM基线相比,我们获得了9%更低的姿态误差和8%更高的建图精度。 我们的结果证实了实时光学嵌入式SLAM不仅对支持新下游应用有价值,而且还提升了底层跟踪和建图子系统的性能。它提供了与离线3DGS模型相当的语义和语言掩码结果,并且在追踪、深度和RGB渲染方面也达到了业界领先水平。
https://arxiv.org/abs/2601.05738
Simultaneous Localization and Mapping (SLAM) is an essential technology for the efficiency and reliability of unmanned robotic exploration missions. While the onboard computational capability and communication bandwidth are critically limited, the point cloud data handled by SLAM is large in size, attracting attention to data compression methods. To address such a problem, in this paper, we propose a new method for compressing point cloud maps by exploiting the Discrete Fourier Transform (DFT). The proposed technique converts the Digital Elevation Model (DEM) to the frequency-domain 2D image and omits its high-frequency components, focusing on the exploration of gradual terrains such as planets and deserts. Unlike terrains with detailed structures such as artificial environments, high-frequency components contribute little to the representation of gradual terrains. Thus, this method is effective in compressing data size without significant degradation of the point cloud. We evaluated the method in terms of compression rate and accuracy using camera sequences of two terrains with different elevation profiles.
同步定位与地图构建(SLAM)技术对于无人机器人探索任务的效率和可靠性至关重要。然而,在机载计算能力和通信带宽受到严格限制的情况下,处理大型点云数据会成为问题,从而引起了对数据压缩方法的关注。为了解决这个问题,本文提出了一种利用离散傅里叶变换(DFT)来压缩点云地图的新方法。该技术将数字高程模型(DEM)转换到频域二维图像,并省略了其高频成分,特别关注于如行星和沙漠等平缓地形的探索。与具有详细结构的人造环境地形不同,在平缓地形中高频成分对于表示作用很小。因此,这种方法在不显著降低点云质量的情况下有效压缩数据大小。我们使用两个具有不同高程特征的地表的摄像机序列对该方法进行了压缩率和准确性的评估。
https://arxiv.org/abs/2601.04551
Previous on-manifold approaches to continuum robot state estimation have typically adopted simplified Cosserat rod models, which cannot directly account for actuation inputs or external loads. We introduce a general framework that incorporates uncertainty models for actuation (e.g., tendon tensions), applied forces and moments, process noise, boundary conditions, and arbitrary backbone measurements. By adding temporal priors across time steps, our method additionally performs joint estimation in both the spatial (arclength) and temporal domains, enabling full \textit{spacetime} state estimation. Discretizing the arclength domain yields a factor graph representation of the continuum robot model, which can be exploited for fast batch sparse nonlinear optimization in the style of SLAM. The framework is general and applies to a broad class of continuum robots; as illustrative cases, we show (i) tendon-driven robots in simulation, where we demonstrate real-time kinematics with uncertainty, tip force sensing from position feedback, and distributed load estimation from backbone strain, and (ii) a surgical concentric tube robot in experiment, where we validate accurate kinematics and tip force estimation, highlighting potential for surgical palpation.
之前的流形方法在连续机器人状态估计中通常采用简化的科塞拉特(Cosserat)杆模型,这些模型无法直接考虑驱动输入或外部负载。我们引入了一个通用框架,该框架纳入了驱动不确定性模型(例如腱张力)、施加的力和力矩、过程噪声、边界条件以及任意背骨测量等。通过在时间步之间添加时间先验,我们的方法还能够在空间(弧长)和时间领域同时进行联合估计,从而实现全面的“时空”状态估计。通过对弧长域进行离散化处理,可以得到连续机器人模型的因素图表示,这一表示可用于快速批量稀疏非线性优化,类似于SLAM(即时定位与地图构建)技术中的应用。 该框架具有通用性和广泛适用性,适用于多种类型的连续机器人;为了说明其实用性,我们展示了两个案例:(i) 在模拟环境中,对于腱驱动的机器人,我们展示了实时的动力学姿态估计、不确定性的考量、从位置反馈中获取末端力测量以及通过背骨应变来估算分布负载的能力。(ii) 对于实验环境中的外科同心管机器人,我们验证了其准确的姿态和末端力估算能力,并突出了该方法在手术触诊方面的潜在应用。
https://arxiv.org/abs/2601.04493
Loop closure is crucial for maintaining the accuracy and consistency of visual SLAM. We propose a method to improve loop closure performance in DPV-SLAM. Our approach integrates AnyLoc, a learning-based visual place recognition technique, as a replacement for the classical Bag of Visual Words (BoVW) loop detection method. In contrast to BoVW, which relies on handcrafted features, AnyLoc utilizes deep feature representations, enabling more robust image retrieval across diverse viewpoints and lighting conditions. Furthermore, we propose an adaptive mechanism that dynamically adjusts similarity threshold based on environmental conditions, removing the need for manual tuning. Experiments on both indoor and outdoor datasets demonstrate that our method significantly outperforms the original DPV-SLAM in terms of loop closure accuracy and robustness. The proposed method offers a practical and scalable solution for enhancing loop closure performance in modern SLAM systems.
循环闭合对于保持视觉SLAM(Simultaneous Localization and Mapping,同时定位与地图构建)的准确性和一致性至关重要。我们提出了一种方法来改进DPV-SLAM中的循环闭合性能。我们的方法整合了AnyLoc这一基于学习的地方识别技术作为经典Bag of Visual Words (BoVW, 视觉词袋) 循环检测方法的替代方案。 与依赖手工设计特征的BoVW不同,AnyLoc利用深度特征表示,在不同的视角和光照条件下实现更为稳健的图像检索。此外,我们还提出了一种自适应机制,能够根据环境条件动态调整相似度阈值,从而避免手动调节的需求。 在室内和室外数据集上的实验表明,我们的方法相较于原始DPV-SLAM在循环闭合准确性和鲁棒性方面有了显著提升。所提出的这种方法为提高现代SLAM系统中循环闭合的性能提供了一个实用且可扩展的解决方案。
https://arxiv.org/abs/2601.02723
Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: this https URL
单目全方位视觉里程计(OVO)系统利用360度摄像头来克服透视视角VO系统的视场限制。然而,现有的方法依赖于手工设计的特征或光度目标,在诸如剧烈运动和光照变化等具有挑战性的场景中往往缺乏鲁棒性。为了解决这一问题,我们提出了360DVO,这是第一个基于深度学习的OVO框架。我们的方法引入了一种畸变感知球面特征提取器(DAS-Feat),能够自适应地从360度图像中学习出抗畸变的特征。这些稀疏的特征补丁随后被用于在一个新颖的全方位可微分捆绑调整模块(ODBA)中建立有效的姿态估计约束条件。为了促进在现实场景中的评估,我们还贡献了一个新的真实世界OVO基准测试平台。在此基准和公共合成数据集(TartanAir V2和360VO)上进行的广泛实验表明,360DVO超越了最先进的基线方法(包括360VO和OpenVSLAM),在鲁棒性和准确性方面分别提高了50%和37.5%。 主页:[原文链接](this https URL)
https://arxiv.org/abs/2601.02309
Accurate altitude estimation and reliable floor recognition are critical for mobile robot localization and navigation within complex multi-storey environments. In this paper, we present a robust, low-cost vertical estimation framework leveraging differential barometric sensing integrated within a fully ROS-compliant software package. Our system simultaneously publishes real-time altitude data from both a stationary base station and a mobile sensor, enabling precise and drift-free vertical localization. Empirical evaluations conducted in challenging scenarios -- such as fully enclosed stairwells and elevators, demonstrate that our proposed barometric pipeline achieves sub-meter vertical accuracy (RMSE: 0.29 m) and perfect (100%) floor-level identification. In contrast, our results confirm that standalone height estimates, obtained solely from visual- or LiDAR-based SLAM odometry, are insufficient for reliable vertical localization. The proposed ROS-compatible barometric module thus provides a practical and cost-effective solution for robust vertical awareness in real-world robotic deployments. The implementation of our method is released as open source at this https URL.
准确的海拔估算和可靠的楼层识别对于复杂多层环境中的移动机器人定位和导航至关重要。本文提出了一种稳健且低成本的垂直估计框架,该框架利用差分气压传感,并集成为完全符合ROS标准的软件包。我们的系统同时发布来自静止基站和移动传感器的实时海拔数据,从而实现精确无漂移的垂直定位。 在具有挑战性的场景中(如全封闭楼梯井和电梯)进行的经验评估表明,我们提出的气压管道能够达到亚米级的垂直精度(RMSE:0.29 米),并且楼层级别的识别准确率达到了100%。相比之下,我们的结果证实,仅依赖视觉或激光雷达(LiDAR)基于SLAM里程计的高度估计不足以实现可靠的垂直定位。 因此,我们提出的ROS兼容气压模块为真实世界机器人部署中的稳健垂直感知提供了一个实用且成本效益高的解决方案。本方法的实施以开源形式发布在[此处](https://this-url.com)。
https://arxiv.org/abs/2601.02184
Visual challenges in underwater environments significantly hinder the accuracy of vision-based localisation and the high-fidelity dense reconstruction. In this paper, we propose VISO, a robust underwater SLAM system that fuses a stereo camera, an inertial measurement unit (IMU), and a 3D sonar to achieve accurate 6-DoF localisation and enable efficient dense 3D reconstruction with high photometric fidelity. We introduce a coarse-to-fine online calibration approach for extrinsic parameters estimation between the 3D sonar and the camera. Additionally, a photometric rendering strategy is proposed for the 3D sonar point cloud to enrich the sonar map with visual information. Extensive experiments in a laboratory tank and an open lake demonstrate that VISO surpasses current state-of-the-art underwater and visual-based SLAM algorithms in terms of localisation robustness and accuracy, while also exhibiting real-time dense 3D reconstruction performance comparable to the offline dense mapping method.
水下环境中的视觉挑战显著影响了基于视觉的定位精度和高保真密集重建。在本文中,我们提出了VISO,一种鲁棒的水下SLAM(Simultaneous Localization and Mapping)系统,该系统融合了立体相机、惯性测量单元(IMU)以及3D声呐,以实现准确的6自由度定位,并能够进行高效且具有高光度保真的密集三维重建。我们引入了一种从粗到细的在线校准方法来估计3D声呐与摄像机之间的外在参数。此外,还提出了一种针对3D声呐点云的光度渲染策略,旨在通过视觉信息丰富声呐地图。实验室水箱和开放湖泊中的广泛实验表明,在定位鲁棒性和准确性方面,VISO超越了当前最先进的水下及基于视觉的SLAM算法,并且其实时密集三维重建性能与离线密集映射方法相当。
https://arxiv.org/abs/2601.01144