People can respond to feedback and guidance in different ways, and it is important for robots to personalize their interactions and utilize verbal and nonverbal communication cues. We aim to understand how older adults respond to different cadences of verbal and nonverbal feedback of a robot exercise coach. We conducted an online study of older adults, where participants evaluated videos of the robot giving feedback at different cadences for each modality. The results indicate that changing the cadence of one modality affects the perception of both it and the other modality. We can use the results from this study to better design the frequency of the robot coach's feedback during an exercise session with this population.
人们可以以不同的方式回应反馈和指导,因此机器人需要个性化互动并利用言语和非言语沟通线索。我们的目标是了解老年人如何对机器人健身教练不同节奏的言语和非言语反馈作出反应。我们进行了一项针对老年人的在线研究,在该研究中,参与者评估了机器人在不同节奏下提供反馈的视频片段(每种模式分别)。结果表明,改变一种模态的节奏会影响人们对这种模式及其另一种模式的感知。我们可以利用这项研究的结果来更好地设计机器人教练在与这一群体进行健身会话时的反馈频率。
https://arxiv.org/abs/2601.08819
Localization is a fundamental capability for autonomous robots, enabling them to operate effectively in dynamic environments. In Robocon 2025, accurate and reliable localization is crucial for improving shooting precision, avoiding collisions with other robots, and navigating the competition field efficiently. In this paper, we propose a hybrid localization algorithm that integrates classical techniques with learning based methods that rely solely on visual data from the court's floor to achieve self-localization on the basketball field.
本地化是自主机器人的一项基本能力,使其能够在动态环境中有效运行。在Robocon 2025中,精确且可靠的定位对于提高投篮精度、避免与其他机器人碰撞以及高效地导航比赛场地至关重要。本文提出了一种融合传统技术与基于学习的方法的混合定位算法,该方法仅依赖于球场地板上的视觉数据,在篮球场上实现自我定位。
https://arxiv.org/abs/2601.08713
VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
VLA模型在具身导航中展现出巨大的潜力,通过统一感知与规划,并继承大型视觉语言模型(VLMs)的强大泛化能力。然而,大多数现有的VLA模型依赖于从观察到动作的反应式映射,缺乏执行复杂、长期任务所需的明确推理能力和持久记忆功能。为了解决这些挑战,我们提出了VLingNav,这是一种基于语言驱动认知的具身导航VLA模型。 首先,借鉴人类认知的双过程理论,我们引入了一种适应性的链式思维机制(chain-of-thought),该机制能根据需要动态触发明确推理,使代理能够在快速直观执行和慢速深思熟虑规划之间灵活切换。其次,为了处理长期的空间依赖关系,我们开发了一个辅助语言记忆模块,构建持久的跨模态语义记忆,使代理能够回忆过去的观察结果以避免重复探索,并推断出动态环境中的移动趋势。 在训练策略方面,我们构建了Nav-AdaCoT-2.9M,这是迄今为止最大的具有推理注释的具身导航数据集,包含适应性链式思维(CoT)注释,能够诱导一种既考虑何时思考也考虑思考什么内容的推理模式。此外,我们还融入了一个在线专家指导增强学习阶段,使模型超越纯粹模仿学习,并获得更为稳健、自我探索的导航行为。 广泛的实验表明,VLingNav在各种具身导航基准测试中实现了最先进的性能。值得注意的是,VLingNav能够零样本迁移到真实世界的机器人平台,在执行多种导航任务时表现出强大的跨域和跨任务泛化能力。
https://arxiv.org/abs/2601.08665
This paper introduces a novel modular architecture for ROS2 that decouples the logic required to acquire, validate, and interpolate references from the control laws that track them. The design includes a dedicated component, named Reference Generator, that receives references, in the form of either single points or trajectories, from external nodes (e.g., planners), and writes single-point references at the controller's sampling period via the existing ros2_control chaining mechanism to downstream controllers. This separation removes duplicated reference-handling code from controllers and improves reusability across robot platforms. We implement two reference generators: one for handling joint-space references and one for Cartesian references, along with a set of new controllers (PD with gravity compensation, Cartesian pose, and admittance controllers) and validate the approach on simulated and real Universal Robots and Franka Emika manipulators. Results show that (i) references are tracked reliably in all tested scenarios, (ii) reference generators reduce duplicated reference-handling code across chained controllers to favor the construction and reuse of complex controller pipelines, and (iii) controller implementations remain focused only on control laws.
本文介绍了一种用于ROS2的新颖模块化架构,该架构将获取、验证和插值参考所需的逻辑与跟踪这些参考的控制律分离。设计中包含了一个专门组件,名为“Reference Generator”(参考生成器),它从外部节点(例如规划器)接收单点或轨迹形式的参考,并通过现有的ros2_control链接机制在控制器的采样周期内写入单点参考到下游控制器。这种分离消除了控制器中的重复参考处理代码,并提高了机器人平台之间的可重用性。我们实现了两个参考生成器:一个用于处理关节空间参考,另一个用于处理笛卡尔参考,还实现了一组新的控制器(带重力补偿的PD、笛卡尔姿态和顺应性控制器),并在模拟和真实的Universal Robots及Franka Emika机械臂上验证了该方法的有效性。结果显示: (i) 在所有测试场景中都能可靠地跟踪参考, (ii) 参考生成器将链接控制器中的重复参考处理代码减少,有利于复杂控制器管道的构建与重用; (iii) 控制器实现仅专注于控制律。
https://arxiv.org/abs/2601.08514
Achieving agile and generalized legged locomotion across terrains requires tight integration of perception and control, especially under occlusions and sparse footholds. Existing methods have demonstrated agility on parkour courses but often rely on end-to-end sensorimotor models with limited generalization and interpretability. By contrast, methods targeting generalized locomotion typically exhibit limited agility and struggle with visual occlusions. We introduce AME-2, a unified reinforcement learning (RL) framework for agile and generalized locomotion that incorporates a novel attention-based map encoder in the control policy. This encoder extracts local and global mapping features and uses attention mechanisms to focus on salient regions, producing an interpretable and generalized embedding for RL-based control. We further propose a learning-based mapping pipeline that provides fast, uncertainty-aware terrain representations robust to noise and occlusions, serving as policy inputs. It uses neural networks to convert depth observations into local elevations with uncertainties, and fuses them with odometry. The pipeline also integrates with parallel simulation so that we can train controllers with online mapping, aiding sim-to-real transfer. We validate AME-2 with the proposed mapping pipeline on a quadruped and a biped robot, and the resulting controllers demonstrate strong agility and generalization to unseen terrains in simulation and in real-world experiments.
实现各种地形上敏捷且通用的腿式移动需要感知和控制之间的紧密集成,尤其是在存在遮挡和稀疏支撑点的情况下。现有的方法已经在障碍赛道上展示了敏捷性,但往往依赖于端到端的感觉运动模型,这些模型在泛化能力和可解释性方面表现有限。相比之下,专注于通用机动性的方法通常表现出较低的敏捷性和处理视觉遮挡的能力较弱。我们提出了AME-2,这是一种统一的强化学习(RL)框架,旨在实现既敏捷又通用的移动方式,并在控制策略中引入了一种新的基于注意力的地图编码器。该编码器提取局部和全局地图特征,并利用注意机制聚焦于显著区域,生成可解释且具有泛化的嵌入式表示用于基于RL的控制。 我们还提出了一种学习驱动的地图构建流水线,它提供快速、不确定度感知地形表征,能够有效应对噪声和遮挡问题,作为政策输入。该流水线使用神经网络将深度观察转换为带有不确定性评估的局部高度,并与里程计数据进行融合。此外,此流程可与并行模拟相结合,以便可以在在线地图构建过程中训练控制器,从而帮助实现仿真到现实环境中的迁移。 我们在四足和双足机器人上通过提出的地图流水线验证了AME-2的有效性,结果表明由此生成的控制器在仿真中以及实际实验中对未知地形表现出强大的敏捷性和泛化能力。
https://arxiv.org/abs/2601.08485
Constructing an accurate simulation model of real-world environments requires reliable estimation of physical parameters such as mass, geometry, friction, and contact surfaces. Traditional real-to-simulation (Real2Sim) pipelines rely on manual measurements or fixed, pre-programmed exploration routines, which limit their adaptability to varying tasks and user intents. This paper presents a Real2Sim framework that autonomously generates and executes Behavior Trees for task-specific physical interactions to acquire only the parameters required for a given simulation objective, without relying on pre-defined task templates or expert-designed exploration routines. Given a high-level user request, an incomplete simulation description, and an RGB observation of the scene, a vision-language model performs multi-modal reasoning to identify relevant objects, infer required physical parameters, and generate a structured Behavior Tree composed of elementary robotic actions. The resulting behavior is executed on a torque-controlled Franka Emika Panda, enabling compliant, contact-rich interactions for parameter estimation. The acquired measurements are used to automatically construct a physics-aware simulation. Experimental results on the real manipulator demonstrate estimation of object mass, surface height, and friction-related quantities across multiple scenarios, including occluded objects and incomplete prior models. The proposed approach enables interpretable, intent-driven, and autonomously Real2Sim pipelines, bridging high-level reasoning with physically-grounded robotic interaction.
构建一个真实世界环境的精确仿真模型,需要可靠地估算物理参数,如质量、几何形状、摩擦力和接触面等。传统的真实到模拟(Real-to-Simulation,简称Real2Sim)管道依赖于手动测量或固定预编程的探索程序,这限制了它们适应各种任务和用户意图的能力。本文提出了一种Real2Sim框架,该框架能够自主生成并执行行为树以进行特定任务所需的物理交互,并获取给定仿真目标所需的具体参数,而无需依赖预先定义的任务模板或专家设计的探索程序。 当接收到高级别用户的请求、不完整的模拟描述以及场景的RGB观察数据时,视觉语言模型会进行跨模态推理来识别相关对象,推断所需的物理参数,并生成一个由基本机器人动作组成的结构化行为树。由此产生的行为将被执行在具有扭矩控制功能的Franka Emika Panda机械臂上,以实现用于参数估计的顺应性和接触丰富的交互。 获取到的测量数据被用来自动构建一个基于物理学原理的仿真模型。实验结果表明,在真实操纵器上的多种场景下(包括遮挡物体和不完整的先前模型),本方法能够估算出物体的质量、表面高度以及与摩擦相关的量。该提出的方案使得实现解释性、意图驱动且自主运行的Real2Sim管道成为可能,从而在高层次推理和基于物理原理的机器人交互之间建立了桥梁。
https://arxiv.org/abs/2601.08454
In this work, we aim to enable legged robots to learn how to interpret human social cues and produce appropriate behaviors through physical human guidance. However, learning through physical engagement can place a heavy burden on users when the process requires large amounts of human-provided data. To address this, we propose a human-in-the-loop framework that enables robots to acquire navigational behaviors in a data-efficient manner and to be controlled via multimodal natural human inputs, specifically gestural and verbal commands. We reconstruct interaction scenes using a physics-based simulation and aggregate data to mitigate distributional shifts arising from limited demonstration data. Our progressive goal cueing strategy adaptively feeds appropriate commands and navigation goals during training, leading to more accurate navigation and stronger alignment between human input and robot behavior. We evaluate our framework across six real-world agile navigation scenarios, including jumping over or avoiding obstacles. Our experimental results show that our proposed method succeeds in almost all trials across these scenarios, achieving a 97.15% task success rate with less than 1 hour of demonstration data in total.
在这项工作中,我们旨在使腿部机器人通过物理人类指导学会解读人类社交线索并产生适当的行为。然而,通过身体互动进行学习可能会给用户带来沉重的负担,特别是当过程需要大量的人类提供的数据时。为了解决这个问题,我们提出了一种人机交互框架,使得机器人能够在数据高效的方式下获取导航行为,并能够接受多模态自然人类输入(具体来说是手势和口头命令)。我们使用基于物理的模拟重建互动场景,并汇总数据以缓解由于演示数据有限而产生的分布变化。我们的渐进式目标提示策略在训练过程中适应性地提供适当的指令和导航目标,从而导致更准确的导航以及人机交互与机器人行为之间更强的一致性。 我们在六个现实世界中的敏捷导航场景中评估了该框架,包括跳跃或避开障碍物的情况。实验结果显示,在这些场景下,我们的方法几乎都在所有试验中取得成功,并且在总演示数据不到1小时的情况下达到了97.15%的任务成功率。
https://arxiv.org/abs/2601.08422
Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.
实时多摄像头三维重建对于三维感知、沉浸式交互和机器人技术至关重要。现有的方法在处理多视图融合、摄像机外参不确定性和大规模摄像头设置的可扩展性方面存在困难。我们提出了一种名为SPARK的自校准实时多摄像头点云重建框架,该框架同时处理点云融合和外参不确定性问题。SPARK包含以下两部分: 1. 一种基于几何信息的在线外参估计模块,利用多视图先验并强制执行跨视角和时间上的一致性以实现稳定的自我校准。 2. 一种信心驱动的点云融合策略,在像素级和点级别建模深度可靠性和可见度,以此抑制噪声和视角依赖性不一致。 通过进行帧级别的融合而不积累,SPARK能够在动态场景中生成稳定且准确的点云,并且随着摄像头数量线性扩展。在真实世界的多摄像头系统上的大量实验表明,与现有方法相比,SPARK在外参精度、几何一致性、时间稳定性以及实时性能方面表现出色,证明了其对于大规模多摄像头三维重建的有效性和可扩展性。
https://arxiv.org/abs/2601.08414
Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness. (2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.
最近在机器人操作领域取得的进展利用了预训练的视觉语言模型(VLMs),并探索将3D空间信号整合到这些模型中,以实现有效的动作预测。这催生了一个有前景的视觉语言行动(VLA)范式。然而,大多数现有的方法忽视了主动感知的重要性:它们通常依赖于安装在手腕上的静态摄像头,提供的是末端执行器中心视角。因此,这些模型无法在任务执行过程中自适应地选择最优视点或分辨率,这显著限制了它们在长期任务和精细操作场景中的性能表现。 为了解决这些问题,我们提出了ActiveVLA,这是一种新型的视觉语言行动框架,赋予机器人主动感知能力以实现高精度、细粒度的操作。ActiveVLA采用了从粗到精的方法,将过程分为两个阶段: 1. 关键区域定位:ActiveVLA将3D输入投影到多视图2D图像上,并识别关键的3D区域,支持动态空间意识。 2. 主动感知优化:基于定位的关键区域,ActiveVLA采用主动视点选择策略来选取最优视角。这些视角旨在最大化非模态相关性(即不依赖于特定感官模式的信息关联)和多样性,同时最小化遮挡现象。此外,ActiveVLA应用3D变焦功能以提高关键区域的分辨率。 通过上述步骤,ActiveVLA能够实现更精细级别的主动感知,从而支持精确的操作。广泛的实验表明,ActiveVLA实现了精准的3D操作,并在三个模拟基准测试中超越了最新的基线模型。此外,ActiveVLA能够无缝地转移到真实世界场景中,使机器人能够在复杂环境中学习高精度任务。
https://arxiv.org/abs/2601.08325
Low-cost inertial measurement units (IMUs) are widely utilized in mobile robot localization due to their affordability and ease of integration. However, their complex, nonlinear, and time-varying noise characteristics often lead to significant degradation in localization accuracy when applied directly for dead reckoning. To overcome this limitation, we propose a novel brain-inspired state estimation framework that combines a spiking neural network (SNN) with an invariant extended Kalman filter (InEKF). The SNN is designed to extract motion-related features from long sequences of IMU data affected by substantial random noise and is trained via a surrogate gradient descent algorithm to enable dynamic adaptation of the covariance noise parameter within the InEKF. By fusing the SNN output with raw IMU measurements, the proposed method enhances the robustness and accuracy of pose estimation. Extensive experiments conducted on the KITTI dataset and real-world data collected using a mobile robot equipped with a low-cost IMU demonstrate that the proposed approach outperforms state-of-the-art methods in localization accuracy and exhibits strong robustness to sensor noise, highlighting its potential for real-world mobile robot applications.
低成本惯性测量单元(IMUs)由于其经济性和易于集成的特性,在移动机器人定位中被广泛应用。然而,当直接用于航位推算时,这些设备复杂的、非线性的和随时间变化的噪声特征常常会导致定位精度显著下降。为克服这一限制,我们提出了一种新型仿脑状态估计框架,该框架结合了脉冲神经网络(SNN)与不变扩展卡尔曼滤波器(InEKF)。SNN被设计用于从受大量随机噪声影响的长时间IMU数据序列中提取运动相关特征,并通过代理梯度下降算法进行训练,以使InEKF中的协方差噪声参数能够动态调整。通过融合SNN输出与原始IMU测量值,所提出的方法增强了姿态估计的鲁棒性和准确性。 在KITTI数据集和使用低成本IMU装备的移动机器人采集的真实世界数据上进行了广泛的实验表明,该方法在定位精度方面优于现有的先进方法,并且对传感器噪声表现出强大的鲁棒性。这一结果凸显了其在现实世界的移动机器人应用中的潜力。
https://arxiv.org/abs/2601.08248
Dexterous grasp synthesis remains a central challenge: the high dimensionality and kinematic diversity of multi-fingered hands prevent direct transfer of algorithms developed for parallel-jaw grippers. Existing approaches typically depend on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials, hindering scalability as new dexterous hand designs emerge. To this end, we propose a data-efficient framework, which is designed to bypass robot grasp data collection by exploiting the rich, object-centric semantic priors latent in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. A kinematics-aware retargeting module then maps these affordance representations to diverse dexterous hands without per-hand retraining. The resulting system produces stable, functionally appropriate multi-contact grasps that remain reliably successful across common objects and tools, while exhibiting strong generalization across previously unseen object instances within a category, pose variations, and multiple hand embodiments. This work (i) introduces a semantic affordance extraction pipeline leveraging vision-language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.
灵巧抓取合成仍然是一项核心挑战:多指手的高维度和运动学多样性使得无法直接转移为并指夹具开发的算法。现有的方法通常依赖于在仿真或通过昂贵的实际试验中收集的大规模硬件特定抓取数据集,这阻碍了随着新灵巧手设计出现时的可扩展性。为此,我们提出了一种数据高效框架,该框架旨在绕过机器人抓取数据采集过程,利用预训练生成扩散模型中的丰富、以对象为中心的语义先验知识。时间对齐且细粒度的抓握行为从原始的人类视频演示中提取,并与深度图像提供的3D场景几何结合,用于推断具有语义依据的接触目标。随后,一个运动学感知重定位模块将这些抓取表示映射到各种灵巧手中,无需针对每种手重新训练。最终系统生成了稳定且功能适当的多点接触抓握,能够在常见物体和工具上可靠地成功操作,并在同类新实例、姿态变化以及多种手形态中表现出强大的泛化能力。这项工作(i)引入了一条利用视觉语言生成先验知识的灵巧抓取语义行为提取流水线;(ii)展示了跨手一般化的性能,无需构建特定硬件的抓取数据集;以及(iii)确立了当与基础模型语义结合时,单一深度模态足以实现高性能的抓握合成。我们的成果强调了一条通过人类演示和预训练生成模型驱动的规模化、设备无关灵巧操作的道路。
https://arxiv.org/abs/2601.08246
Low-cost inertial navigation systems (INS) are prone to sensor biases and measurement noise, which lead to rapid degradation of navigation accuracy during global positioning system (GPS) outages. To address this challenge and improve positioning continuity in GPS-denied environments, this paper proposes a brain-inspired GPS/INS fusion network (BGFN) based on spiking neural networks (SNNs). The BGFN architecture integrates a spiking Transformer with a spiking encoder to simultaneously extract spatial features from inertial measurement unit (IMU) signals and capture their temporal dynamics. By modeling the relationship between vehicle attitude, specific force, angular rate, and GPS-derived position increments, the network leverages both current and historical IMU data to estimate vehicle motion. The effectiveness of the proposed method is evaluated through real-world field tests and experiments on public datasets. Compared to conventional deep learning approaches, the results demonstrate that BGFN achieves higher accuracy and enhanced reliability in navigation performance, particularly under prolonged GPS outages.
低成本惯性导航系统(INS)容易受到传感器偏差和测量噪声的影响,这会导致全球定位系统(GPS)中断期间导航精度迅速下降。为了解决这一挑战并提高在无GPS环境下定位的连续性,本文提出了一种基于脉冲神经网络(SNNs)的脑启发式GPS/INS融合网络(BGFN)。BGFN架构结合了脉冲Transformer和脉冲编码器,同时从惯性测量单元(IMU)信号中提取空间特征并捕捉其时间动态。通过建模车辆姿态、特定力、角速率与基于GPS的位置增量之间的关系,该网络利用当前及历史IMU数据来估计车辆运动。通过实地测试和公开数据集上的实验评估了所提出方法的有效性。相比传统的深度学习方法,结果显示BGFN在导航性能上实现了更高的精度和增强的可靠性,尤其是在长时间GPS中断的情况下。
https://arxiv.org/abs/2601.08244
Autonomous experimentation holds the potential to accelerate materials development by combining artificial intelligence (AI) with modular robotic platforms to explore extensive combinatorial chemical and processing spaces. Such self-driving laboratories can not only increase the throughput of repetitive experiments, but also incorporate human domain expertise to drive the search towards user-defined objectives, including improved materials performance metrics. We present an autonomous materials synthesis extension to SARA, the Scientific Autonomous Reasoning Agent, utilizing phase information provided by an automated probabilistic phase labeling algorithm to expedite the search for targeted phase regions. By incorporating human input into an expanded SARA-H (SARA with human-in-the-loop) framework, we enhance the efficiency of the underlying reasoning process. Using synthetic benchmarks, we demonstrate the efficiency of our AI implementation and show that the human input can contribute to significant improvement in sampling efficiency. We conduct experimental active learning campaigns using robotic processing of thin-film samples of several oxide material systems, including Bi$_2$O$_3$, SnO$_x$, and Bi-Ti-O, using lateral-gradient laser spike annealing to synthesize and kinetically trap metastable phases. We showcase the utility of human-in-the-loop autonomous experimentation for the Bi-Ti-O system, where we identify extensive processing domains that stabilize $\delta$-Bi$_2$O$_3$ and Bi$_2$Ti$_2$O$_7$, explore dwell-dependent ternary oxide phase behavior, and provide evidence confirming predictions that cationic substitutional doping of TiO$_2$ with Bi inhibits the unfavorable transformation of the metastable anatase to the ground-state rutile phase. The autonomous methods we have developed enable the discovery of new materials and new understanding of materials synthesis and properties.
自主实验能够通过结合人工智能(AI)与模块化机器人平台来探索广泛组合的化学和工艺空间,从而加速材料开发。这样的自驱动实验室不仅可以提高重复性实验的吞吐量,还可以整合人类的专业知识以推动研究朝向用户定义的目标发展,包括改善材料性能指标等。我们为科学自主推理代理(SARA)提供了一个自主材料合成扩展,利用自动化概率相位标签算法提供的相位信息来加快目标相位区域的搜索速度。通过将人类输入纳入扩大的SARA-H(具有人在回路中的SARA)框架中,我们可以增强底层推理过程的有效性。使用合成基准测试,我们展示了我们的AI实现效率,并证明了人类输入可以显著提高采样效率。 我们在几种氧化物材料系统(包括Bi₂O₃、SnOₓ和Bi-Ti-O)的薄膜样品上进行实验主动学习活动,采用横向梯度激光脉冲退火来合成并动力学捕获亚稳态相。我们展示了人在回路中的自主实验在Bi-Ti-O系统的实用性,在该系统中,我们识别了大量的加工领域以稳定δ-Bi₂O₃和Bi₂Ti₂O₇,并探索了滞留依赖的三元氧化物相位行为,并提供了证据支持预测:即对TiO₂进行Bi阳离子置换掺杂可抑制亚稳态锐钛矿向体心结构金红石相变。我们开发的自主方法能够发现新材料并深入理解材料合成和性质的新知识。
https://arxiv.org/abs/2601.08185
This paper presents a gripper capable of grasping and recognizing terrain shapes for mobile robots in extreme environments. Multi-limbed climbing robots with grippers are effective on rough terrains, such as cliffs and cave walls. However, such robots may fall over by misgrasping the surface or getting stuck owing to the loss of graspable points in unknown natural environments. To overcome these issues, we need a gripper capable of adaptive grasping to irregular terrains, not only for grasping but also for measuring the shape of the terrain surface accurately. We developed a gripper that can grasp both convex and concave terrains and simultaneously measure the terrain shape by introducing a pin-array structure. We demonstrated the mechanism of the gripper and evaluated its grasping and terrain recognition performance using a prototype. Moreover, the proposed pin-array design works well for 3D terrain mapping as well as adaptive grasping for irregular terrains.
本文介绍了一种能够抓取并识别地形形状的机械手,适用于移动机器人在极端环境中的应用。多肢攀爬机器人配以抓手,在崎岖不平的地面上(如悬崖和洞穴壁)表现有效。然而,在未知自然环境中,这些机器人可能会因误抓表面或由于可抓握点丧失而卡住而导致跌落。为解决这些问题,需要开发一种能够适应不规则地形的自适应机械手,不仅能进行抓取,还能精确测量地形表面形状。 我们研制了一种抓手,它能够抓住凸形和凹形地形,并通过引入针阵列结构同时测量地形形状。我们展示了该抓手的工作机制,并使用原型机对其抓取能力和地形识别性能进行了评估。此外,提出的针阵列设计不仅适用于三维地形测绘,还非常适合不规则地形的自适应抓取。 本文的研究成果为移动机器人在复杂和未知环境中实现高效操作提供了新的可能性。
https://arxiv.org/abs/2601.08143
Animal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to dissect algorithms for behavioral control. This is enabled by leveraging neuromechanical digital twins: computational models that embed artificial neural controllers within realistic body models in simulated environments. Here we review advances in the creation and use of neuromechanical digital twins while also highlighting emerging opportunities for the future. First, we illustrate how neuromechanical models allow researchers to infer hidden biophysical variables that may be difficult to measure experimentally. Additionally, by perturbing these models, one can generate new experimentally testable hypotheses. Next, we explore how neuromechanical twins have been used to foster a deeper exchange between neuroscience, robotics, and machine learning. Finally, we show how neuromechanical twins can advance healthcare. We envision that coupling studies on animals with active probing of their neuromechanical twins will greatly accelerate neuroscientific discovery.
动物行为反映了神经系统、身体和环境之间的相互作用。因此,为了剖析控制行为的算法,必须考虑生物力学和环境背景。这可以通过利用神经机械数字孪生来实现:将人工神经控制器嵌入到模拟环境中真实的机体模型中的计算模型。在这里,我们回顾了在创建和使用神经机械数字孪生方面的进展,并同时强调未来出现的新机遇。 首先,我们将展示神经机械模型如何使研究人员能够推断出可能难以通过实验测量的隐含生物物理变量。此外,通过扰动这些模型,可以生成新的可实验验证假设。其次,我们探讨了神经机械孪生如何被用于促进神经科学、机器人技术与机器学习之间的更深层次交流。最后,我们将展示神经机械孪生如何推进医疗保健领域的发展。 我们设想将对动物的研究与其神经机械孪生的主动探索相结合,这将极大地加速神经科学研究的进展。
https://arxiv.org/abs/2601.08056
We introduce Fiducial Exoskeletons, an image-based reformulation of 3D robot state estimation that replaces cumbersome procedures and motor-centric pipelines with single-image inference. Traditional approaches - especially robot-camera extrinsic estimation - often rely on high-precision actuators and require time-consuming routines such as hand-eye calibration. In contrast, modern learning-based robot control is increasingly trained and deployed from RGB observations on lower-cost hardware. Our key insight is twofold. First, we cast robot state estimation as 6D pose estimation of each link from a single RGB image: the robot-camera base transform is obtained directly as the estimated base-link pose, and the joint state is recovered via a lightweight global optimization that enforces kinematic consistency with the observed link poses (optionally warm-started with encoder readings). Second, we make per-link 6D pose estimation robust and simple - even without learning - by introducing the fiducial exoskeleton: a lightweight 3D-printed mount with a fiducial marker on each link and known marker-link geometry. This design yields robust camera-robot extrinsics, per-link SE(3) poses, and joint-angle state from a single image, enabling robust state estimation even on unplugged robots. Demonstrated on a low-cost robot arm, fiducial exoskeletons substantially simplify setup while improving calibration, state accuracy, and downstream 3D control performance. We release code and printable hardware designs to enable further algorithm-hardware co-design.
我们介绍了一种名为“基准外骨骼”的图像基三维机器人状态估计方法,这种方法用单张图片的推断取代了复杂的操作和以电机为中心的工作流程。传统的方法——尤其是机器人的相机外部参数估算——常常依赖于高精度执行器,并且需要诸如手动眼睛校准之类的耗时过程。相比之下,现代基于学习的机器人控制越来越多地使用低成本硬件上的RGB观察结果进行训练和部署。 我们的关键见解有两个方面。首先,我们将机器人状态估计重新定义为从单张RGB图像中估算每个链节的6D姿态:机器人相机的基础变换直接通过基础-链节的姿态估计获得,并且关节状态可以通过轻量级全局优化恢复,该过程强制执行与观察到的链节姿态一致的运动学一致性(在使用编码器读数进行热启动时可选)。 其次,我们通过引入基准外骨骼来使每个链节的6D姿态估算更加稳健和简单——即使不采用学习方法。这种设计包括一个轻量级的3D打印支架,在每个链节上安装了一个基准标记,并且已知标记-链节几何关系。这一设计理念能够生成稳健的相机机器人外部参数,单个图像中的每个链接SE(3)姿态以及关节角度状态,从而即使在断电的情况下也能实现精确的状态估算。 在一个低成本的机械臂上进行演示后,我们发现使用基准外骨骼大大简化了设置过程,并且提高了校准、状态精度和下游三维控制性能。为了进一步促进算法与硬件的设计协同工作,我们将发布代码及可打印的硬件设计。
https://arxiv.org/abs/2601.08034
Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.
视频生成模型已经发展成为高保真的物理世界模拟器,能够根据多模态用户输入合成高质量的视频,捕捉代理与其环境之间精细互动。这些模型的出色能力解决了基于物理的仿真器长期面临的许多挑战,并在多个领域得到了广泛应用,例如机器人技术。例如,视频模型能够在不做出禁止性简化假设的情况下实现逼真且物理一致性的可变形体模拟,这一直是一个物理基础仿真中的重大瓶颈。此外,视频模型可以作为细粒度和表达力强的基础世界模型,克服了仅使用语言抽象描述复杂物理互动的局限性。 在这篇综述中,我们回顾了视频模型及其在机器人领域的应用,包括低成本数据生成、模仿学习中的动作预测、强化学习中的动力学与奖励建模、视觉规划以及政策评估。此外,我们也指出了阻碍视频模型在机器人领域可信整合的重要挑战,这些问题包括指令执行能力差、诸如违反物理定律的幻觉效应及不安全内容生成等,并且还包括重大数据整理、训练和推理成本等基本限制。 为了应对这些开放性研究挑战,我们提出了未来的发展方向以激发进一步的研究并最终推动更广泛的应用,特别是在对安全性要求极高的场景中。
https://arxiv.org/abs/2601.07823
Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a new paradigm minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training. Videos and code are available at this https URL.
基于深度强化学习的训练后算法可以推动特定目标(如泛化能力、准确性及鲁棒性)下的机器人模型边界。然而,在现实世界中的探索中,不可避免地会出现需要干预的操作失败情况(IR Failure,例如机器人打翻水或打破易碎玻璃),这阻碍了这种范式的实际部署。为此,我们引入了一种新的方法——Failure-Aware Offline-to-Online 强化学习 (FARL),该方法在真实世界中的强化学习探索期间最小化故障。我们创建了一个新基准 FailureBench,它包含了需要人工干预的常见故障场景,并提出了一种结合基于世界模型的安全评估器和离线训练恢复策略的方法,以预防在线探索过程中的故障。广泛的模拟实验与现实世界实验都展示了 FARL 在减少 IR Failures 的同时,在实际强化学习后培训期间提高了性能及泛化的有效性。FARL 将 IR Failure 减少了 73.1%,并且平均提升了 11.3% 的性能,具体是在真实世界的 RL 后训练阶段。相关视频和代码可在以下链接获取:[提供链接]。
https://arxiv.org/abs/2601.07821
The teleoperation of robotic hands is limited by the high costs of depth cameras and sensor gloves, commonly used to estimate hand relative joint positions (XYZ). We present a novel, cost-effective approach using three webcams for triangulation-based tracking to approximate relative joint angles (theta) of human fingers. We also introduce a modified DexHand, a low-cost robotic hand from TheRobotStudio, to demonstrate THETA's real-time application. Data collection involved 40 distinct hand gestures using three 640x480p webcams arranged at 120-degree intervals, generating over 48,000 RGB images. Joint angles were manually determined by measuring midpoints of the MCP, PIP, and DIP finger joints. Captured RGB frames were processed using a DeepLabV3 segmentation model with a ResNet-50 backbone for multi-scale hand segmentation. The segmented images were then HSV-filtered and fed into THETA's architecture, consisting of a MobileNetV2-based CNN classifier optimized for hierarchical spatial feature extraction and a 9-channel input tensor encoding multi-perspective hand representations. The classification model maps segmented hand views into discrete joint angles, achieving 97.18% accuracy, 98.72% recall, F1 Score of 0.9274, and a precision of 0.8906. In real-time inference, THETA captures simultaneous frames, segments hand regions, filters them, and compiles a 9-channel tensor for classification. Joint-angle predictions are relayed via serial to an Arduino, enabling the DexHand to replicate hand movements. Future research will increase dataset diversity, integrate wrist tracking, and apply computer vision techniques such as OpenAI-Vision. THETA potentially ensures cost-effective, user-friendly teleoperation for medical, linguistic, and manufacturing applications.
机器人手的遥操作受到深度相机和传感器手套高昂成本的限制,这些设备通常用于估算相对关节位置(XYZ)。我们提出了一种新颖且成本效益高的方法,使用三台网络摄像头进行三角测量跟踪,以近似人类手指的相对关节角度(θ)。此外,还引入了经过修改的DexHand,这是一种来自TheRobotStudio的低成本机器人手,用以展示THETA在实时应用中的效果。数据收集涉及40种不同的手势,通过三个间隔120度排列的640x480p网络摄像头捕捉到超过48,000张RGB图像。关节角度由测量掌指(MCP)、近侧指间(PIP)和远侧指间(DIP)关节中点的手动确定得出。 收集的RGB帧使用带有ResNet-50骨干的DeepLabV3分割模型进行多尺度手部分割处理。随后,对分段图像应用HSV滤波,并将其输入到THETA架构中,该架构由基于MobileNetV2的CNN分类器和编码多种视角的手部表示的9通道张量组成,用于优化层级空间特征提取。分类模型将分割后的手部视图映射为离散关节角度,在准确性、召回率、F1分数以及精度方面分别达到了97.18%、98.72%、0.9274和0.8906。 在实时推断过程中,THETA同时捕捉图像帧,分割手部区域,并过滤这些区域以生成用于分类的9通道张量。关节角度预测通过串行通信发送到Arduino,使DexHand能够复制人类的手部动作。未来的研究将进一步丰富数据集多样性、整合手腕跟踪以及应用计算机视觉技术如OpenAI-Vision。 THETA有望确保成本效益高且用户友好的远程操作,在医疗、语言和制造应用中具有潜在价值。
https://arxiv.org/abs/2601.07768
Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift; for instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present \textit{Hiking in the Wild}, a scalable, end-to-end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable \textit{Terrain Edge Detection} with \textit{Foot Volume Points} to prevent catastrophic slippage on edges, and a \textit{Flat Patch Sampling} strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
在复杂且未结构化的环境中实现稳健的人形机器人徒步需要从反应性的本体感觉过渡到前瞻性的感知。然而,整合外感受(exteroception)仍然是一个重大挑战:基于地图的方法会遭受状态估计漂移的困扰;例如,LiDAR方法难以处理躯干抖动的问题。现有的端到端方法通常面临可扩展性和训练复杂度的问题;一些以前使用虚拟障碍物的工作往往是特定案例实施的。在本文中,我们提出了“野外徒步”(Hiking in the Wild),这是一个针对稳健人形机器人徒步设计的、具有可扩展性的端到端跑酷感知框架。为了确保安全和训练稳定性,我们引入了两种关键机制:一种是结合可扩展性“地形边缘检测”与“足部体积点”的踏脚点安全性机制,防止在边缘发生灾难性滑动;另一种是一个通过生成可行导航目标来缓解奖励操纵的“平面块采样”策略。我们的方法采用单阶段强化学习方案,直接将原始深度输入和本体感觉映射到关节动作上,并不依赖外部状态估计。广泛的实地实验表明,在一个全尺寸的人形机器人上使用该策略可以以高达2.5米/秒的速度稳健地穿越复杂地形。训练和部署代码已开源,便于可重复的研究以及在实际机器人上的最小硬件修改下进行部署。
https://arxiv.org/abs/2601.07718