To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code and data are available at this https URL.
为了教导机器人执行复杂的操作任务,目前的常用做法是针对特定任务的数据对预训练的视觉-语言-动作模型(VLA)进行微调。然而,由于这种方法会更新现有的表示形式,因此不适合在现实世界中长期运行,因为在这种情况下,机器人必须不断适应新的任务和环境,同时保留已经获得的知识。目前为机器人设计的持续学习方法通常需要存储先前的数据(示例),难以处理长序列的任务,并且依赖于任务标识符来进行部署。 为了克服这些限制,我们提出了CLARE框架——一个通用且参数高效的无示例数据的持续学习框架,适用于VLA模型。CLARE在选定的前馈层中引入了轻量级模块化适配器,并在学习新任务时只在必要的情况下通过逐层特征相似性来自主扩展模型。部署期间,基于自动编码器的路由机制可以根据相关性动态激活最相关的适配器,而无需使用任务标签。 通过LIBERO基准测试的广泛实验,我们证明了CLARE能够在执行新的任务时保持高水平的表现,同时避免对之前学习的任务产生灾难性的遗忘,从而显著优于基于示例的方法。代码和数据可在以下链接获得:[此URL](请将括号内的文本替换为实际提供的网址)。
https://arxiv.org/abs/2601.09512
Generalization of imitation-learned navigation policies to environments unseen in training remains a major challenge. We address this by conducting the first large-scale study of how data quantity and data diversity affect real-world generalization in end-to-end, map-free visual navigation. Using a curated 4,565-hour crowd-sourced dataset collected across 161 locations in 35 countries, we train policies for point goal navigation and evaluate their closed-loop control performance on sidewalk robots operating in four countries, covering 125 km of autonomous driving. Our results show that large-scale training data enables zero-shot navigation in unknown environments, approaching the performance of policies trained with environment-specific demonstrations. Critically, we find that data diversity is far more important than data quantity. Doubling the number of geographical locations in a training set decreases navigation errors by ~15%, while performance benefit from adding data from existing locations saturates with very little data. We also observe that, with noisy crowd-sourced data, simple regression-based models outperform generative and sequence-based architectures. We release our policies, evaluation setup and example videos on the project page.
将模仿学习的导航策略推广到训练中未见过的环境中仍然是一个重大挑战。我们通过开展首个大规模研究来解决这一问题,该研究探讨了数据量和数据多样性如何影响无地图端到端视觉导航中的现实世界泛化效果。利用一套精心策划的由4,565小时众包收集的数据集(跨越35个国家的161个地点),我们训练了用于点目标导航的策略,并在四个国家运行的步行机器人上对其闭环控制性能进行了评估,这些机器人自主驾驶了总计125公里的距离。研究结果表明,大规模训练数据能够使零样本导航在未知环境中得以实现,接近于使用特定环境演示进行训练的策略的表现水平。尤为重要的是,我们发现数据多样性远比数据量更重要:增加训练集中地理位置的数量能将导航误差降低约15%,而从现有地点添加的数据所带来的性能提升则会在少量数据后趋于饱和。此外,我们观察到,在处理嘈杂的众包数据时,基于回归的简单模型优于生成式和序列式架构。 我们将发布的策略、评估设置及示例视频放在项目页面上。
https://arxiv.org/abs/2601.09444
Generating safe and reliable trajectories for autonomous vehicles in long-tail scenarios remains a significant challenge, particularly for high-lateral-acceleration maneuvers such as sharp turns, which represent critical safety situations. Existing trajectory planners exhibit systematic failures in these scenarios due to data imbalance. This results in insufficient modelling of vehicle dynamics, road geometry, and environmental constraints in high-risk situations, leading to suboptimal or unsafe trajectory prediction when vehicles operate near their physical limits. In this paper, we introduce ReflexDiffusion, a novel inference-stage framework that enhances diffusion-based trajectory planners through reflective adjustment. Our method introduces a gradient-based adjustment mechanism during the iterative denoising process: after each standard trajectory update, we compute the gradient between the conditional and unconditional noise predictions to explicitly amplify critical conditioning signals, including road curvature and lateral vehicle dynamics. This amplification enforces strict adherence to physical constraints, particularly improving stability during high-lateral-acceleration maneuvers where precise vehicle-road interaction is paramount. Evaluated on the nuPlan Test14-hard benchmark, ReflexDiffusion achieves a 14.1% improvement in driving score for high-lateral-acceleration scenarios over the state-of-the-art (SOTA) methods. This demonstrates that inference-time trajectory optimization can effectively compensate for training data sparsity by dynamically reinforcing safety-critical constraints near handling limits. The framework's architecture-agnostic design enables direct deployment to existing diffusion-based planners, offering a practical solution for improving autonomous vehicle safety in challenging driving conditions.
在长尾场景中为自动驾驶汽车生成安全且可靠的轨迹仍然是一个重大挑战,尤其是在需要高横向加速度的操作(如急转弯)的情况下,这种情况对于车辆的安全性尤为重要。现有的轨迹规划器由于数据不平衡,在这些场景中表现出系统性的失败,这导致对车辆动力学、道路几何和环境约束的建模不足,从而在车辆接近物理极限时产生次优或不安全的路径预测。 本文介绍了一种名为ReflexDiffusion的新颖推理阶段框架,通过反射调整增强了基于扩散的方法的轨迹规划器。我们的方法引入了迭代去噪过程中的梯度基调整机制:每次标准轨迹更新后,我们计算条件和非条件噪声预测之间的梯度,以明确放大关键调控信号,包括道路曲率和横向车辆动态。这种放大强制执行严格的物理约束遵守,尤其是在高横向加速度操作中,精确的车路相互作用至关重要。 在nuPlan Test14-hard基准测试上进行评估时,ReflexDiffusion在高横向加速度场景下比最先进的方法(SOTA)提高了驾驶评分14.1%。这表明推理时间轨迹优化可以通过动态强化处理极限附近的临界安全约束来有效补偿训练数据的稀疏性。 该框架的设计与架构无关,可以直接部署到现有的基于扩散的方法规划器中,提供了一种在挑战性驾驶条件下提高自动驾驶车辆安全性的确切解决方案。
https://arxiv.org/abs/2601.09377
This technical report presents the construction and analysis of polynomial navigation functions for motion planning in 3-D workspaces populated by spherical and cylindrical obstacles. The workspace is modeled as a bounded spherical region, and obstacles are encoded using smooth polynomial implicit functions. We establish conditions under which the proposed navigation functions admit a unique non-degenerate minimum at the target while avoiding local minima, including in the presence of pairwise intersecting obstacles. Gradient and Hessian analyses are provided, and the theoretical results are validated through numerical simulations in obstacle rich 3-D environments.
这份技术报告介绍了在三维工作空间中,针对由球形和圆柱形障碍物构成的环境进行运动规划时所构造和分析的多项式导航函数。工作空间被建模为一个有界的球形区域,而障碍物则通过平滑的多项式隐含函数来编码。我们建立了条件,在这些条件下提出的导航函数在目标处会有一个唯一的非退化最小值,并且能够避免局部极小值,即使存在成对相交的障碍物也是如此。报告中提供了梯度和黑塞矩阵(Hessian)分析,并通过障碍物丰富的三维环境中的数值模拟验证了理论结果的有效性。
https://arxiv.org/abs/2601.09318
An emerging class of trajectory optimization methods enforces collision avoidance by jointly optimizing the robot's configuration and a separating hyperplane. However, as linear separators only apply to convex sets, these methods require convex approximations of both the robot and obstacles, which becomes an overly conservative assumption in cluttered and narrow environments. In this work, we unequivocally remove this limitation by introducing nonlinear separating hypersurfaces parameterized by polynomial functions. We first generalize the classical separating hyperplane theorem and prove that any two disjoint bounded closed sets in Euclidean space can be separated by a polynomial hypersurface, serving as the theoretical foundation for nonlinear separation of arbitrary geometries. Building on this result, we formulate a nonlinear programming (NLP) problem that jointly optimizes the robot's trajectory and the coefficients of the separating polynomials, enabling geometry-aware collision avoidance without conservative convex simplifications. The optimization remains efficiently solvable using standard NLP solvers. Simulation and real-world experiments with nonconvex robots demonstrate that our method achieves smooth, collision-free, and agile maneuvers in environments where convex-approximation baselines fail.
一类新兴的轨迹优化方法通过同时优化机器人的配置和一个分隔超平面来强制执行碰撞避免。然而,由于线性分离器仅适用于凸集,这些方法需要对机器人和障碍物进行凸近似,这在拥挤且狭窄的环境中会成为一个过于保守的前提假设。在这项工作中,我们明确地去除了这一限制,通过引入由多项式函数参数化的非线性分隔超平面来实现这一点。 首先,我们将经典的分隔超平面定理进行了推广,并证明了欧几里得空间中任何两个不相交的有界闭集都可以用一个多项式超平面进行分离。这为任意几何形状的非线性分离提供了理论基础。基于这一结果,我们提出了一种非线性规划(NLP)问题,该问题同时优化机器人的轨迹和分隔多项式的系数,从而在不采用保守的凸简化的情况下实现了具有几何感知能力的碰撞避免。此优化问题仍然可以通过标准的NLP求解器高效地解决。 通过模拟实验和真实世界的实验(使用非凸机器人),我们证明了我们的方法能够在传统的基于凸近似的基线上失效的环境中实现平滑、无碰撞且灵活的操作。
https://arxiv.org/abs/2601.09231
Agile control of robotic systems often requires anticipating how the environment affects system behavior. For example, a driver must perceive the road ahead to anticipate available friction and plan actions accordingly. Achieving such proactive adaptation within autonomous frameworks remains a challenge, particularly under rapidly changing conditions. Traditional modeling approaches often struggle to capture abrupt variations in system behavior, while adaptive methods are inherently reactive and may adapt too late to ensure safety. We propose a vision-conditioned variational Bayesian last-layer dynamics model that leverages visual context to anticipate changes in the environment. The model first learns nominal vehicle dynamics and is then fine-tuned with feature-wise affine transformations of latent features, enabling context-aware dynamics prediction. The resulting model is integrated into an optimal controller for vehicle racing. We validate our method on a Lexus LC500 racing through water puddles. With vision-conditioning, the system completed all 12 attempted laps under varying conditions. In contrast, all baselines without visual context consistently lost control, demonstrating the importance of proactive dynamics adaptation in high-performance applications.
敏捷控制机器人系统通常需要预测环境如何影响系统的运行行为。例如,驾驶员必须感知前方道路以预估可用的摩擦力,并据此规划行动。在自主框架内实现这种主动适应性仍然是一项挑战,尤其是在快速变化条件下更是如此。传统建模方法往往难以捕捉到系统行为中的突然变化,而自适应方法本质上是反应性的,并且可能无法及时调整以确保安全性。 我们提出了一种基于视觉上下文的变分贝叶斯模型,该模型专注于预测环境变化并据此动态调整。具体来说,首先学习车辆的基本动力学特性,然后通过潜在特征的空间仿射变换进行微调,从而实现根据不同的视觉信息来预测相应的动力学行为。这种经过训练后的模型被集成到赛车用最优控制器中。 我们使用雷克萨斯LC500在水坑赛道上的性能验证了这种方法的有效性。借助于视觉上下文的条件设定,系统能够在各种条件下完成全部12圈的比赛任务。相比之下,所有不具备视觉信息输入的基本方法最终都失去了控制能力,这证明了在高性能应用中主动适应动力学变化的重要性。
https://arxiv.org/abs/2601.09178
Robotic foundation models trained on large-scale manipulation datasets have shown promise in learning generalist policies, but they often overfit to specific viewpoints, robot arms, and especially parallel-jaw grippers due to dataset biases. To address this limitation, we propose Cross-Embodiment Interface (\CEI), a framework for cross-embodiment learning that enables the transfer of demonstrations across different robot arm and end-effector morphologies. \CEI introduces the concept of \textit{functional similarity}, which is quantified using Directional Chamfer Distance. Then it aligns robot trajectories through gradient-based optimization, followed by synthesizing observations and actions for unseen robot arms and end-effectors. In experiments, \CEI transfers data and policies from a Franka Panda robot to \textbf{16} different embodiments across \textbf{3} tasks in simulation, and supports bidirectional transfer between a UR5+AG95 gripper robot and a UR5+Xhand robot across \textbf{6} real-world tasks, achieving an average transfer ratio of 82.4\%. Finally, we demonstrate that \CEI can also be extended with spatial generalization and multimodal motion generation capabilities using our proposed techniques. Project website: this https URL
大规模操作数据集上训练的机器人基础模型展示了学习通用策略的潜力,但它们通常会过度拟合特定视角、机械臂类型和特别平行颚夹持器的原因是数据集偏差。为了解决这一局限性,我们提出了跨主体界面(Cross-Embodiment Interface, \CEI),这是一种框架,用于不同机器人手臂和末端执行器形态之间的跨主体学习,使得能够转移演示操作。在该方法中引入了“功能相似度”的概念,并使用方向查姆弗距离来量化这一指标。然后通过基于梯度的优化对齐机械臂轨迹,接着为未见过的不同类型的机械臂和末端执行器综合观测数据与动作。实验表明,在模拟环境中,\CEI 能够将数据和策略从 Franka Panda 机器人转移到 16 种不同主体形态之上,并且针对3项任务实现了双向跨现实世界 UR5+AG95 夹持器机器人与 UR5+Xhand 机器人的转移,平均转移比率为82.4%。最后,我们展示了 \CEI 可以通过我们的提议技术延伸具备空间泛化和多模态运动生成能力。 项目网站:[请访问此链接](this https URL)
https://arxiv.org/abs/2601.09163
This paper presents a design methodology of a hydraulically-driven soft robotic gripper for grasping a large and heavy object -- approximately 10 - 20 kg with 20 - 30 cm diameter. Most existing soft grippers are pneumatically actuated with several hundred kPa pressure, and cannot generate output force sufficient for such a large and heavy object. Instead of pneumatic actuation, hydraulic actuation has a potential to generate much larger power by several MPa pressure. In this study, we develop a hydraulically-driven soft gripper, in which its basic design parameters are determined based on a mathematical model that represents the relationship among the driving pressure, bending angle, object mass and grasping force. Moreover, we selected materials suitable for grasping a heavier object, based on the finite element analysis result of the detailed design. We report experimental results on a 20 kg object grasping and closed-loop control of the finger bending angle.
本文提出了一种用于抓取大型和重物(约10-20公斤,直径为20-30厘米)的液压驱动软机器人夹爪的设计方法。大多数现有的软夹爪采用气动方式,并且工作压力约为几百千帕,无法产生足够的输出力来抓取如此大的重型物体。相比之下,液压驱动可以利用高达数兆帕的压力生成更大的功率。 在这项研究中,我们开发了一种液压驱动的软机器人夹爪,在设计过程中根据一个数学模型确定了基本的设计参数,该模型描述了驱动力压力、弯曲角度、被握物质量和抓取力之间的关系。此外,基于详细设计的有限元分析结果,我们选择了适合抓取更重物体的材料。 我们在实验中展示了对20公斤物体进行抓取的过程,并且实现了手指弯曲角度的闭环控制。
https://arxiv.org/abs/2601.09104
Humanoid robot manipulation is a crucial research area for executing diverse human-level tasks, involving high-level semantic reasoning and low-level action generation. However, precise scene understanding and sample-efficient learning from human demonstrations remain critical challenges, severely hindering the applicability and generalizability of existing frameworks. This paper presents a novel RGMP-S, Recurrent Geometric-prior Multimodal Policy with Spiking features, facilitating both high-level skill reasoning and data-efficient motion synthesis. To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases to enable precise 3D scene understanding within the vision-language model. Specifically, we construct a Long-horizon Geometric Prior Skill Selector that effectively aligns the semantic instructions with spatial constraints, ultimately achieving robust generalization in unseen environments. For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network. We parameterize robot-object interactions via recursive spiking for spatiotemporal consistency, fully distilling long-horizon dynamic features while mitigating the overfitting issue in sparse demonstration scenarios. Extensive experiments are conducted across the Maniskill simulation benchmark and three heterogeneous real-world robotic systems, encompassing a custom-developed humanoid, a desktop manipulator, and a commercial robotic platform. Empirical results substantiate the superiority of our method over state-of-the-art baselines and validate the efficacy of the proposed modules in diverse generalization scenarios. To facilitate reproducibility, the source code and video demonstrations are publicly available at this https URL.
人形机器人操纵是执行多样化的、人类级别的任务的关键研究领域,涉及高层次的语义推理和低层次的动作生成。然而,精确的场景理解和从人的演示中进行样本高效学习仍然是严峻挑战,极大地阻碍了现有框架的应用性和通用性。本文提出了一种名为RGMP-S(Recurrent Geometric-prior Multimodal Policy with Spiking features)的新方法,它能够促进高层次技能推理和数据高效的运动合成。 为了将高层次的推理与物理现实结合,我们利用轻量级的二维几何归纳偏置来实现视觉-语言模型内的精确三维场景理解。具体来说,我们构建了一个长时域几何先验技能选择器,该选择器能有效地将语义指令与空间约束对齐,在未知环境中实现了稳健的泛化能力。 为了解决机器人动作生成中的数据效率问题,我们引入了一种递归自适应脉冲网络(Recursive Adaptive Spiking Network)。通过递归脉冲参数化来描述机器人与物体之间的交互,保证了时空一致性,并且在稀疏演示场景中彻底提炼出了长期动态特征,同时缓解了过拟合的问题。 我们在Maniskill仿真基准以及三个不同的真实世界机器人系统上进行了广泛的实验测试,包括一个自定义开发的人形机器人、台式操作器和商用机器人平台。实证结果证明了我们方法相对于现有前沿基线的优越性,并验证了所提出模块在多样化的泛化场景中的有效性。 为了促进可重复性研究,我们的源代码和视频演示已经公开发布在这个链接:[此URL](请将“this https URL”替换为实际提供的链接)。
https://arxiv.org/abs/2601.09031
Complex decision-making by autonomous machines and algorithms could underpin the foundations of future society. Generative AI is emerging as a powerful engine for such transitions. However, we show that Generative AI-driven developments pose a critical pitfall: fairness concerns. In robotic applications, although intuitions about fairness are common, a precise and implementable definition that captures user utility and inherent data randomness is missing. Here we provide a utility-aware fairness metric for robotic decision making and analyze fairness jointly with user-data privacy, deriving conditions under which privacy budgets govern fairness metrics. This yields a unified framework that formalizes and quantifies fairness and its interplay with privacy, which is tested in a robot navigation task. In view of the fact that under legal requirements, most robotic systems will enforce user privacy, the approach shows surprisingly that such privacy budgets can be jointly used to meet fairness targets. Addressing fairness concerns in the creative combined consideration of privacy is a step towards ethical use of AI and strengthens trust in autonomous robots deployed in everyday environments.
自主机器和算法的复杂决策可能会成为未来社会的基础。生成式人工智能(Generative AI)正迅速成为推动此类转变的强大引擎。然而,我们指出,由生成式AI驱动的发展带来了一个关键陷阱:公平性问题。在机器人应用中,虽然有关公平性的直觉很常见,但缺乏一种能够捕捉用户效用和数据内在随机性的精确且可实施的定义。在这里,我们提供了一种以效用为导向的公平性指标用于机器人的决策,并分析了公平性和用户数据隐私之间的关系,推导出在何种情况下隐私预算会决定公平性指标。这产生了一个统一框架,该框架形式化并量化了公平性及其与隐私的关系,并在一个机器人导航任务中进行了测试。 考虑到大多数机器人系统为了满足法律要求必须保障用户隐私这一事实,我们的方法发现,这些隐私预算可以被联合使用以达到公平性目标。在综合考虑隐私的情况下解决公平性问题,是向伦理地利用AI和增强公众对部署于日常环境中的自主机器人的信任迈进的一步。
https://arxiv.org/abs/2601.08953
People can respond to feedback and guidance in different ways, and it is important for robots to personalize their interactions and utilize verbal and nonverbal communication cues. We aim to understand how older adults respond to different cadences of verbal and nonverbal feedback of a robot exercise coach. We conducted an online study of older adults, where participants evaluated videos of the robot giving feedback at different cadences for each modality. The results indicate that changing the cadence of one modality affects the perception of both it and the other modality. We can use the results from this study to better design the frequency of the robot coach's feedback during an exercise session with this population.
人们可以以不同的方式回应反馈和指导,因此机器人需要个性化互动并利用言语和非言语沟通线索。我们的目标是了解老年人如何对机器人健身教练不同节奏的言语和非言语反馈作出反应。我们进行了一项针对老年人的在线研究,在该研究中,参与者评估了机器人在不同节奏下提供反馈的视频片段(每种模式分别)。结果表明,改变一种模态的节奏会影响人们对这种模式及其另一种模式的感知。我们可以利用这项研究的结果来更好地设计机器人教练在与这一群体进行健身会话时的反馈频率。
https://arxiv.org/abs/2601.08819
The incorporation of advanced control algorithms into prosthetic hands significantly enhances their ability to replicate the intricate motions of a human hand. This work introduces a model-based controller that combines an Artificial Neural Network (ANN) approach with a Sliding Mode Controller (SMC) designed for a tendon-driven soft continuum wrist integrated into a prosthetic hand known as "PRISMA HAND II". Our research focuses on developing a controller that provides a fast dynamic response with reduced computational effort during wrist motions. The proposed controller consists of an ANN for computing bending angles together with an SMC to regulate tendon forces. Kinematic and dynamic models of the wrist are formulated using the Piece-wise Constant Curvature (PCC) hypothesis. The performance of the proposed controller is compared with other control strategies developed for the same wrist. Simulation studies and experimental validations of the fabricated wrist using the controller are included in the paper.
将先进的控制算法融入假肢手显著提高了其模仿人类手部复杂动作的能力。本文介绍了一种基于模型的控制器,它结合了人工神经网络(ANN)方法与滑模控制器(SMC),用于集成在名为“PRISMA HAND II”的假手中的一种肌腱驱动软连续腕关节。我们的研究重点是开发一个能够快速响应并减少计算工作量的控制器,在手腕运动期间提供良好的性能。 所提出的控制器包括一个用于计算弯曲角度的人工神经网络和一个用于调节肌腱力的滑模控制器。使用分段常数曲率(PCC)假设来制定腕关节的动力学和运动模型。我们将本文提出的方法与为同一腕关节开发的其他控制策略进行了比较,并且论文中包含了使用该控制器对制造出的手腕进行的仿真研究和实验验证。
https://arxiv.org/abs/2601.08711
This paper presents a constraint-aware control framework for underactuated aerial manipulators, enabling accurate end-effector trajectory tracking while explicitly accounting for safety and feasibility constraints. The control problem is formulated as a quadratic program that computes dynamically consistent generalized accelerations subject to underactuation, actuator bounds, and system constraints. To enhance robustness against disturbances, modeling uncertainties, and steady-state errors, a passivity-based integral action is incorporated at the torque level without compromising feasibility. The effectiveness of the proposed approach is demonstrated through high-fidelity physics-based simulations, which include parameter perturbations, viscous joint friction, and realistic sensing and state-estimation effects. This demonstrates accurate tracking, smooth control inputs, and reliable constraint satisfaction under realistic operating conditions.
本文提出了一种针对欠驱动空中机械臂的约束感知控制框架,该框架能够在确保安全和可行性的同时实现末端执行器轨迹的精确跟踪。控制问题被形式化为一个二次规划问题,计算出在欠驱动、致动器限值及系统约束条件下的动力学一致的广义加速度。为了增强对扰动、建模不确定性以及稳态误差的鲁棒性,在扭矩水平上集成了基于被动性的积分动作,同时不损害可行性。通过高保真的物理基础仿真展示了所提出方法的有效性,这些仿真包括参数变化、粘滞关节摩擦及现实感测和状态估计效果。这证明了在实际操作条件下实现了精确跟踪、平滑控制输入以及可靠地满足约束条件。
https://arxiv.org/abs/2601.08523
This paper introduces a novel modular architecture for ROS2 that decouples the logic required to acquire, validate, and interpolate references from the control laws that track them. The design includes a dedicated component, named Reference Generator, that receives references, in the form of either single points or trajectories, from external nodes (e.g., planners), and writes single-point references at the controller's sampling period via the existing ros2_control chaining mechanism to downstream controllers. This separation removes duplicated reference-handling code from controllers and improves reusability across robot platforms. We implement two reference generators: one for handling joint-space references and one for Cartesian references, along with a set of new controllers (PD with gravity compensation, Cartesian pose, and admittance controllers) and validate the approach on simulated and real Universal Robots and Franka Emika manipulators. Results show that (i) references are tracked reliably in all tested scenarios, (ii) reference generators reduce duplicated reference-handling code across chained controllers to favor the construction and reuse of complex controller pipelines, and (iii) controller implementations remain focused only on control laws.
本文介绍了一种用于ROS2的新颖模块化架构,该架构将获取、验证和插值参考所需的逻辑与跟踪这些参考的控制律分离。设计中包含了一个专门组件,名为“Reference Generator”(参考生成器),它从外部节点(例如规划器)接收单点或轨迹形式的参考,并通过现有的ros2_control链接机制在控制器的采样周期内写入单点参考到下游控制器。这种分离消除了控制器中的重复参考处理代码,并提高了机器人平台之间的可重用性。我们实现了两个参考生成器:一个用于处理关节空间参考,另一个用于处理笛卡尔参考,还实现了一组新的控制器(带重力补偿的PD、笛卡尔姿态和顺应性控制器),并在模拟和真实的Universal Robots及Franka Emika机械臂上验证了该方法的有效性。结果显示: (i) 在所有测试场景中都能可靠地跟踪参考, (ii) 参考生成器将链接控制器中的重复参考处理代码减少,有利于复杂控制器管道的构建与重用; (iii) 控制器实现仅专注于控制律。
https://arxiv.org/abs/2601.08514
Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain's fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.
水下物联网(IoUT)正逐渐吸引人们的关注,其目的是监测海洋生物和深海环境、进行水下监视以及维护海底设施。然而,传统的依赖电池供电的IoUT设备在使用寿命方面存在局限性,并且废弃后可能对环境造成危害。本文提出了一种可持续的方法,通过自主水下航行器(AUV)同时从IoUT设备上传信息并传输声能(AET),从而有可能使这些设备实现无限期运行。为应对时间敏感性问题,我们采用了信息时效(AoI)和Jain的公平指数作为评估指标。 为了优化性能,我们开发了两种深度强化学习(DRL)算法:一种是复杂度高但性能卓越的频分双工(FDD)解决方案;另一种则是复杂度低、性能中等的时间分集双工(TDD)方法。实验结果表明,所提出的FDD和TDD解决方案在降低平均信息时效的同时,也显著提高了收集到的能量及数据采集的公平性,相较于基线方法有了明显改进。
https://arxiv.org/abs/2601.08491
Achieving agile and generalized legged locomotion across terrains requires tight integration of perception and control, especially under occlusions and sparse footholds. Existing methods have demonstrated agility on parkour courses but often rely on end-to-end sensorimotor models with limited generalization and interpretability. By contrast, methods targeting generalized locomotion typically exhibit limited agility and struggle with visual occlusions. We introduce AME-2, a unified reinforcement learning (RL) framework for agile and generalized locomotion that incorporates a novel attention-based map encoder in the control policy. This encoder extracts local and global mapping features and uses attention mechanisms to focus on salient regions, producing an interpretable and generalized embedding for RL-based control. We further propose a learning-based mapping pipeline that provides fast, uncertainty-aware terrain representations robust to noise and occlusions, serving as policy inputs. It uses neural networks to convert depth observations into local elevations with uncertainties, and fuses them with odometry. The pipeline also integrates with parallel simulation so that we can train controllers with online mapping, aiding sim-to-real transfer. We validate AME-2 with the proposed mapping pipeline on a quadruped and a biped robot, and the resulting controllers demonstrate strong agility and generalization to unseen terrains in simulation and in real-world experiments.
实现各种地形上敏捷且通用的腿式移动需要感知和控制之间的紧密集成,尤其是在存在遮挡和稀疏支撑点的情况下。现有的方法已经在障碍赛道上展示了敏捷性,但往往依赖于端到端的感觉运动模型,这些模型在泛化能力和可解释性方面表现有限。相比之下,专注于通用机动性的方法通常表现出较低的敏捷性和处理视觉遮挡的能力较弱。我们提出了AME-2,这是一种统一的强化学习(RL)框架,旨在实现既敏捷又通用的移动方式,并在控制策略中引入了一种新的基于注意力的地图编码器。该编码器提取局部和全局地图特征,并利用注意机制聚焦于显著区域,生成可解释且具有泛化的嵌入式表示用于基于RL的控制。 我们还提出了一种学习驱动的地图构建流水线,它提供快速、不确定度感知地形表征,能够有效应对噪声和遮挡问题,作为政策输入。该流水线使用神经网络将深度观察转换为带有不确定性评估的局部高度,并与里程计数据进行融合。此外,此流程可与并行模拟相结合,以便可以在在线地图构建过程中训练控制器,从而帮助实现仿真到现实环境中的迁移。 我们在四足和双足机器人上通过提出的地图流水线验证了AME-2的有效性,结果表明由此生成的控制器在仿真中以及实际实验中对未知地形表现出强大的敏捷性和泛化能力。
https://arxiv.org/abs/2601.08485
Constructing an accurate simulation model of real-world environments requires reliable estimation of physical parameters such as mass, geometry, friction, and contact surfaces. Traditional real-to-simulation (Real2Sim) pipelines rely on manual measurements or fixed, pre-programmed exploration routines, which limit their adaptability to varying tasks and user intents. This paper presents a Real2Sim framework that autonomously generates and executes Behavior Trees for task-specific physical interactions to acquire only the parameters required for a given simulation objective, without relying on pre-defined task templates or expert-designed exploration routines. Given a high-level user request, an incomplete simulation description, and an RGB observation of the scene, a vision-language model performs multi-modal reasoning to identify relevant objects, infer required physical parameters, and generate a structured Behavior Tree composed of elementary robotic actions. The resulting behavior is executed on a torque-controlled Franka Emika Panda, enabling compliant, contact-rich interactions for parameter estimation. The acquired measurements are used to automatically construct a physics-aware simulation. Experimental results on the real manipulator demonstrate estimation of object mass, surface height, and friction-related quantities across multiple scenarios, including occluded objects and incomplete prior models. The proposed approach enables interpretable, intent-driven, and autonomously Real2Sim pipelines, bridging high-level reasoning with physically-grounded robotic interaction.
构建一个真实世界环境的精确仿真模型,需要可靠地估算物理参数,如质量、几何形状、摩擦力和接触面等。传统的真实到模拟(Real-to-Simulation,简称Real2Sim)管道依赖于手动测量或固定预编程的探索程序,这限制了它们适应各种任务和用户意图的能力。本文提出了一种Real2Sim框架,该框架能够自主生成并执行行为树以进行特定任务所需的物理交互,并获取给定仿真目标所需的具体参数,而无需依赖预先定义的任务模板或专家设计的探索程序。 当接收到高级别用户的请求、不完整的模拟描述以及场景的RGB观察数据时,视觉语言模型会进行跨模态推理来识别相关对象,推断所需的物理参数,并生成一个由基本机器人动作组成的结构化行为树。由此产生的行为将被执行在具有扭矩控制功能的Franka Emika Panda机械臂上,以实现用于参数估计的顺应性和接触丰富的交互。 获取到的测量数据被用来自动构建一个基于物理学原理的仿真模型。实验结果表明,在真实操纵器上的多种场景下(包括遮挡物体和不完整的先前模型),本方法能够估算出物体的质量、表面高度以及与摩擦相关的量。该提出的方案使得实现解释性、意图驱动且自主运行的Real2Sim管道成为可能,从而在高层次推理和基于物理原理的机器人交互之间建立了桥梁。
https://arxiv.org/abs/2601.08454
In this work, we aim to enable legged robots to learn how to interpret human social cues and produce appropriate behaviors through physical human guidance. However, learning through physical engagement can place a heavy burden on users when the process requires large amounts of human-provided data. To address this, we propose a human-in-the-loop framework that enables robots to acquire navigational behaviors in a data-efficient manner and to be controlled via multimodal natural human inputs, specifically gestural and verbal commands. We reconstruct interaction scenes using a physics-based simulation and aggregate data to mitigate distributional shifts arising from limited demonstration data. Our progressive goal cueing strategy adaptively feeds appropriate commands and navigation goals during training, leading to more accurate navigation and stronger alignment between human input and robot behavior. We evaluate our framework across six real-world agile navigation scenarios, including jumping over or avoiding obstacles. Our experimental results show that our proposed method succeeds in almost all trials across these scenarios, achieving a 97.15% task success rate with less than 1 hour of demonstration data in total.
在这项工作中,我们旨在使腿部机器人通过物理人类指导学会解读人类社交线索并产生适当的行为。然而,通过身体互动进行学习可能会给用户带来沉重的负担,特别是当过程需要大量的人类提供的数据时。为了解决这个问题,我们提出了一种人机交互框架,使得机器人能够在数据高效的方式下获取导航行为,并能够接受多模态自然人类输入(具体来说是手势和口头命令)。我们使用基于物理的模拟重建互动场景,并汇总数据以缓解由于演示数据有限而产生的分布变化。我们的渐进式目标提示策略在训练过程中适应性地提供适当的指令和导航目标,从而导致更准确的导航以及人机交互与机器人行为之间更强的一致性。 我们在六个现实世界中的敏捷导航场景中评估了该框架,包括跳跃或避开障碍物的情况。实验结果显示,在这些场景下,我们的方法几乎都在所有试验中取得成功,并且在总演示数据不到1小时的情况下达到了97.15%的任务成功率。
https://arxiv.org/abs/2601.08422
Benefiting from the rapid advancements in large language models (LLMs), human-drone interaction has reached unprecedented opportunities. In this paper, we propose a method that integrates a fine-tuned CodeT5 model with the Unreal Engine-based AirSim drone simulator to efficiently execute multi-task operations using natural language commands. This approach enables users to interact with simulated drones through prompts or command descriptions, allowing them to easily access and control the drone's status, significantly lowering the operational threshold. In the AirSim simulator, we can flexibly construct visually realistic dynamic environments to simulate drone applications in complex scenarios. By combining a large dataset of (natural language, program code) command-execution pairs generated by ChatGPT with developer-written drone code as training data, we fine-tune the CodeT5 to achieve automated translation from natural language to executable code for drone tasks. Experimental results demonstrate that the proposed method exhibits superior task execution efficiency and command understanding capabilities in simulated environments. In the future, we plan to extend the model functionality in a modular manner, enhancing its adaptability to complex scenarios and driving the application of drone technologies in real-world environments.
得益于大型语言模型(LLM)的快速进步,人类与无人机之间的互动达到了前所未有的机遇。在本文中,我们提出了一种方法,该方法整合了经过微调的CodeT5模型和基于Unreal Engine的AirSim无人机模拟器,以高效地通过自然语言命令执行多任务操作。这种方法使用户能够通过提示或命令描述来与仿真中的无人机交互,从而轻松访问和控制无人机的状态,显著降低了操作门槛。在AirSim模拟器中,我们可以灵活构建视觉上逼真的动态环境,用于模拟复杂场景下的无人机应用。 我们结合由ChatGPT生成的(自然语言、程序代码)命令执行对大数据集以及开发人员编写的无人机代码作为训练数据,微调CodeT5模型,以实现从自然语言到可执行代码的自动化翻译,从而完成无人机任务。实验结果表明,所提出的方法在模拟环境中表现出卓越的任务执行效率和命令理解能力。 未来,我们计划通过模块化方式扩展该模型的功能,增强其对复杂场景的适应性,并推动无人机技术在现实环境中的应用。
https://arxiv.org/abs/2601.08405
Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness. (2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.
最近在机器人操作领域取得的进展利用了预训练的视觉语言模型(VLMs),并探索将3D空间信号整合到这些模型中,以实现有效的动作预测。这催生了一个有前景的视觉语言行动(VLA)范式。然而,大多数现有的方法忽视了主动感知的重要性:它们通常依赖于安装在手腕上的静态摄像头,提供的是末端执行器中心视角。因此,这些模型无法在任务执行过程中自适应地选择最优视点或分辨率,这显著限制了它们在长期任务和精细操作场景中的性能表现。 为了解决这些问题,我们提出了ActiveVLA,这是一种新型的视觉语言行动框架,赋予机器人主动感知能力以实现高精度、细粒度的操作。ActiveVLA采用了从粗到精的方法,将过程分为两个阶段: 1. 关键区域定位:ActiveVLA将3D输入投影到多视图2D图像上,并识别关键的3D区域,支持动态空间意识。 2. 主动感知优化:基于定位的关键区域,ActiveVLA采用主动视点选择策略来选取最优视角。这些视角旨在最大化非模态相关性(即不依赖于特定感官模式的信息关联)和多样性,同时最小化遮挡现象。此外,ActiveVLA应用3D变焦功能以提高关键区域的分辨率。 通过上述步骤,ActiveVLA能够实现更精细级别的主动感知,从而支持精确的操作。广泛的实验表明,ActiveVLA实现了精准的3D操作,并在三个模拟基准测试中超越了最新的基线模型。此外,ActiveVLA能够无缝地转移到真实世界场景中,使机器人能够在复杂环境中学习高精度任务。
https://arxiv.org/abs/2601.08325