Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.
最近,通过大型语言模型和基于强化学习的微调所驱动的视觉-语言-行动(VLA)模型,在机器人操作领域取得了显著进展。现有的方法通常将长期动作序列视为语言序列,并应用轨迹级优化方法,如偏好优化轨迹(TPO)或近端策略优化(PPO),这种方法导致了粗略的责任分配和不稳定的训练过程。然而,与语言不同的是,尽管句子顺序可以灵活变化但其语义意义是统一的,而行动轨迹则通过因果链接的不同阶段进行发展,并且这些阶段的学习难度各不相同。因此,逐步阶段优化成为必要。 为此,我们提出了Stage-Aware Reinforcement(STARE)模块,该模块将长期的动作轨迹分解为具有语义意义的阶段,并提供密集、可解释和与阶段对齐的强化信号。我们将STARE集成到TPO和PPO中,分别生成了Stage-Aware TPO (STA-TPO) 和 Stage-Aware PPO (STA-PPO),用于离线分阶段偏好优化以及在线跨阶段互动。 在监督微调作为初始步骤的基础上,我们提出了模仿-偏好-交互(IPI)这一序列微调流水线,旨在提高VLA模型中动作的准确性。在SimplerEnv和ManiSkill3环境中的实验表明了显著的进步,在这两个环境中分别达到了98.0% 和 96.4% 的任务成功率,超过了现有方法的最佳水平。
https://arxiv.org/abs/2512.05107
This thesis presents a unified modeling and simulation framework for analyzing sidewinding and tumbling locomotion of the COBRA snake robot across rigid, compliant, and granular terrains. A contact-implicit formulation is used to model distributed frictional interactions during sidewinding, and validated through MATLAB Simscape simulations and physical experiments on rigid ground and loose sand. To capture terrain deformation effects, Project Chrono's Soil Contact Model (SCM) is integrated with the articulated multibody dynamics, enabling prediction of slip, sinkage, and load redistribution that reduce stride efficiency on deformable substrates. For high-energy rolling locomotion on steep slopes, the Chrono DEM Engine is used to simulate particle-resolved granular interactions, revealing soil failure, intermittent lift-off, and energy dissipation mechanisms not captured by rigid models. Together, these methods span real-time control-oriented simulation and high-fidelity granular physics. Results demonstrate that rigid-ground models provide accurate short-horizon motion prediction, while continuum and particle-based terrain modeling becomes necessary for reliable mobility analysis in soft and highly dynamic environments. This work establishes a hierarchical simulation pipeline that advances robust, terrain-aware locomotion for robots operating in challenging unstructured settings.
这篇论文提出了一种统一的建模和仿真框架,用于分析COBRA蛇形机器人在刚性、弹性及颗粒状地形上的侧蠕动和翻滚运动。采用接触隐式公式来模拟侧蠕动过程中的分布式摩擦交互,并通过MATLAB Simscape仿真和物理实验验证了该模型在坚硬地面和松散沙地上的准确性。为了捕捉地形变形的影响,将Project Chrono的土壤接触模型(SCM)与连杆多体动力学集成在一起,这使得能够预测滑移、沉降及载荷重新分布的情况,这些情况会降低软性基底上步态效率。 对于陡坡上的高能量滚动运动,使用Chrono DEM引擎来模拟颗粒级的颗粒交互作用,揭示了土壤失效、间歇式离地和能量耗散机制等刚体模型无法捕捉的现象。总体而言,这些方法涵盖了实时控制导向仿真与高保真颗粒物理。实验结果表明,在短时间范围内,坚硬地面模型可提供准确的动作预测,而在软质及高度动态环境中,连续介质和基于粒子的地形建模成为可靠移动性分析所必需。 这项工作建立了一个分层仿线流程,推进了在具有挑战性的非结构化环境中的机器人稳健、地形感知运动的发展。
https://arxiv.org/abs/2512.05008
Current upper limb prostheses aim to enhance user independence in daily activities by incorporating basic motor functions. However, they fall short of replicating the natural movement and interaction capabilities of the human arm. In contrast, human limbs leverage intrinsic compliance and actively modulate joint stiffness, enabling adaptive responses to varying tasks, impact absorption, and efficient energy transfer during dynamic actions. Inspired by this adaptability, we developed a transhumeral prosthesis with Variable Stiffness Actuators (VSAs) to replicate the controllable compliance found in biological joints. The proposed prosthesis features a modular design, allowing customization for different residual limb shapes and accommodating a range of independent control signals derived from users' biological cues. Integrated elastic elements passively support more natural movements, facilitate safe interactions with the environment, and adapt to diverse task requirements. This paper presents a comprehensive overview of the platform and its functionalities, highlighting its potential applications in the field of prosthetics.
目前的上肢假肢旨在通过融入基本的运动功能来增强用户在日常活动中的独立性,但它们无法复制人类手臂自然运动和互动能力。相比之下,人体四肢利用内在的柔韧性,并能主动调节关节刚度,从而能够对各种任务做出适应性反应、吸收冲击并有效传递动态动作过程中的能量。受这种适应性的启发,我们开发了一种采用可变刚度执行器(VSAs)的前臂假肢,以复制生物关节中可控的柔韧性。该假肢具有模块化设计,可以针对不同残肢形状进行定制,并能够容纳由用户生物信号衍生的各种独立控制信号。集成的弹性元件支持更自然的动作、确保与环境的安全互动,并适应各种任务需求。 本文全面概述了该平台及其功能,强调其在假肢领域的潜在应用。
https://arxiv.org/abs/2512.04998
Variable Stiffness Actuators prove invaluable for robotics applications in unstructured environments, fostering safe interactions and enhancing task adaptability. Nevertheless, their mechanical design inevitably results in larger and heavier structures compared to classical rigid actuators. This paper introduces a novel 3 Degrees of Freedom (DoFs) parallel wrist that achieves variable stiffness through redundant elastic actuation. Leveraging its parallel architecture, the device employs only four motors, rendering it compact and lightweight. This characteristic makes it particularly well-suited for applications in prosthetics or humanoid robotics. The manuscript delves into the theoretical model of the device and proposes a sophisticated control strategy for independent regulation of joint position and stiffness. Furthermore, it validates the proposed controller through simulation, utilizing a comprehensive analysis of the system dynamics. The reported results affirm the ability of the device to achieve high accuracy and disturbance rejection in rigid configurations while minimizing interaction forces with its compliant behavior.
可变刚度执行器在非结构化环境中的机器人应用中证明了其不可替代的价值,促进了安全互动并增强了任务适应性。然而,它们的机械设计不可避免地导致与传统刚性执行器相比体积更大、更重。本文介绍了一种新颖的三自由度(DoFs)平行腕部装置,该装置通过冗余弹性驱动实现了可变刚度。利用其并联架构,该设备仅使用四个电机,使其既紧凑又轻便。这一特性尤其适合假肢或仿人机器人应用。 论文深入探讨了该装置的理论模型,并提出了一个复杂的控制策略,用于独立调节关节位置和刚度。此外,通过模拟验证提出的控制器的有效性,并利用系统动力学的全面分析来支持这一点。报告的结果证实,该设备在刚性配置下能够实现高精度和干扰抑制,在顺应行为中则可以最小化与环境的相互作用力。
https://arxiv.org/abs/2512.04973
Despite the fact that visuomotor-based policies obtained via imitation learning demonstrate good performances in complex manipulation tasks, they usually struggle to achieve the same accuracy and speed as traditional control based methods. In this work, we introduce Hybrid-Diffusion models that combine open-loop routines with visuomotor diffusion policies. We develop Teleoperation Augmentation Primitives (TAPs) that allow the operator to perform predefined routines, such as locking specific axes, moving to perching waypoints, or triggering task-specific routines seamlessly during demonstrations. Our Hybrid-Diffusion method learns to trigger such TAPs during inference. We validate the method on challenging real-world tasks: Vial Aspiration, Open-Container Liquid Transfer, and container unscrewing. All experimental videos are available on the project's website: this https URL
尽管基于模仿学习获得的视觉运动策略在复杂的操作任务中表现出良好的性能,但它们通常难以达到传统控制方法相同的准确性和速度。在这项工作中,我们介绍了混合扩散模型(Hybrid-Diffusion models),该模型结合了开环程序和视觉-运动扩散策略。我们开发了一种称为远程操作增强原语(Teleoperation Augmentation Primitives, TAPs)的技术,允许操作员在演示期间无缝执行预定义的常规任务,例如锁定特定轴、移动到停靠点或触发特定任务的例行程序。我们的混合扩散方法学习如何在推理过程中触发这些TAPs。我们在具有挑战性的现实世界任务上验证了该方法的有效性:试管吸取(Vial Aspiration)、开容器液体转移(Open-Container Liquid Transfer)和容器松开(container unscrewing)。所有实验视频可在项目网站上查看:[此链接](this https URL)
https://arxiv.org/abs/2512.04960
This work investigates how disturbance-aware, robustness-embedded reference trajectories translate into driving performance when executed by professional drivers in a dynamic simulator. Three planned reference trajectories are compared against a free-driving baseline (NOREF) to assess trade-offs between lap time (LT) and steering effort (SE): NOM, the nominal time-optimal trajectory; TLC, a track-limit-robust trajectory obtained by tightening margins to the track edges; and FLC, a friction-limit-robust trajectory obtained by tightening against axle and tire saturation. All trajectories share the same minimum lap-time objective with a small steering-smoothness regularizer and are evaluated by two professional drivers using a high-performance car on a virtual track. The trajectories derive from a disturbance-aware minimum-lap-time framework recently proposed by the authors, where worst-case disturbance growth is propagated over a finite horizon and used to tighten tire-friction and track-limit constraints, preserving performance while providing probabilistic safety margins. LT and SE are used as performance indicators, while RMS lateral deviation, speed error, and drift angle characterize driving style. Results show a Pareto-like LT-SE trade-off: NOM yields the shortest LT but highest SE; TLC minimizes SE at the cost of longer LT; FLC lies near the efficient frontier, substantially reducing SE relative to NOM with only a small LT increase. Removing trajectory guidance (NOREF) increases both LT and SE, confirming that reference trajectories improve pace and control efficiency. Overall, the findings highlight reference-based and disturbance-aware planning, especially FLC, as effective tools for training and for achieving fast yet stable trajectories.
这项研究探讨了扰动感知、鲁棒性嵌入的参考轨迹如何在动态模拟器中由专业驾驶员执行时转化为驾驶性能。比较了三种规划好的参考轨迹与自由驾驶基准(NOREF),以评估圈速(LT)和转向努力(SE)之间的权衡:NOM,名义时间最优轨迹;TLC,通过收紧到赛道边缘的边界获得的抗迹线限制鲁棒性轨迹;FLC,通过收紧轴和轮胎饱和度而获得的抗摩擦极限鲁棒性轨迹。所有这些轨迹都具有相同的最小圈时目标,并添加了一个微小的方向平滑性的正则化器,由两位使用高性能汽车在虚拟赛道上进行评估的专业驾驶员进行了评测。 这些轨迹源自作者最近提出的扰动感知最短圈速框架,在该框架中,最坏情况下的干扰增长被在整个有限的时间范围内传播并用于收紧轮胎摩擦和赛道限制约束。在此过程中保持性能的同时还提供了概率性安全边界。使用圈速(LT)和转向努力(SE)作为绩效指标,同时用横向偏移的标准差、速度误差以及侧滑角来描述驾驶风格。 研究结果显示出类似于帕累托的LT-SE权衡:NOM产生最短的LT但最高的SE;TLC通过延长LT来最小化SE;FLC位于高效前沿附近,在转向努力上显著减少相对NOM的同时只增加了很短的时间。移除轨迹指导(NOREF)则同时增加圈速和转向努力,确认了参考轨迹在提升速度及控制效率方面的优势。 总体而言,研究结果强调基于参考的以及扰动感知的规划方法——尤其是FLC——作为有效训练工具的重要性,并且有助于实现既快速又稳定的驾驶轨迹。
https://arxiv.org/abs/2512.04917
We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.
我们提供了一个数据集,用于力感知的跨视角连杆操作,该数据集将人在真实互动中所见、所为和所感结合在一起。该数据集中包含3048个序列,涉及在38种环境中的381个连杆物体。每个对象在四种执行器表现形式下被操作:(i) 人手;(ii) 配备腕部安装摄像头的人手;(iii) 手持UMI夹爪;以及(iv) 自定义Hoi! 夹爪,其中工具的形态提供了同步的末端效应器力和触觉感知。我们的数据集从视频的角度为交互理解提供了一个全面的观点,使研究人员能够评估方法在人与机器人视角之间转移的效果,并研究尚未充分探索的模式,如力感知和预测。
https://arxiv.org/abs/2512.04884
Imitation learning method has shown immense promise for robotic manipulation, yet its practical deployment is fundamentally constrained by the data scarcity. Despite prior work on collecting large-scale datasets, there still remains a significant gap to robust spatial generalization. We identify a key limitation: individual trajectories, regardless of their length, are typically collected from a \emph{single, static spatial configuration} of the environment. This includes fixed object and target spatial positions as well as unchanging camera viewpoints, which significantly restricts the diversity of spatial information available for learning. To address this critical bottleneck in data efficiency, we propose \textbf{MOtion-Based Variability Enhancement} (\emph{MOVE}), a simple yet effective data collection paradigm that enables the acquisition of richer spatial information from dynamic demonstrations. Our core contribution is an augmentation strategy that injects motion into any movable objects within the environment for each demonstration. This process implicitly generates a dense and diverse set of spatial configurations within a single trajectory. We conduct extensive experiments in both simulation and real-world environments to validate our approach. For example, in simulation tasks requiring strong spatial generalization, \emph{MOVE} achieves an average success rate of 39.1\%, a 76.1\% relative improvement over the static data collection paradigm (22.2\%), and yields up to 2--5$\times$ gains in data efficiency on certain tasks. Our code is available at this https URL.
模仿学习方法在机器人操作领域展示了巨大的潜力,但其实用部署从根本上受到数据稀缺的限制。尽管此前的研究已经在收集大规模的数据集方面取得了进展,但在实现稳健的空间泛化能力上仍然存在显著差距。我们识别出一个关键限制:无论轨迹长度如何,单个轨迹通常是从环境中的单一静态空间配置中采集的,包括固定的目标物体位置以及固定的摄像机视角,这极大地限制了可供学习的空间信息多样性。 为了克服数据效率这一瓶颈,我们提出了基于动作变化性增强的数据收集范式——**MOtion-Based Variability Enhancement (MOVE)**。该方法通过在每次演示中为环境中的可移动对象注入运动,来简单且有效地获取更丰富的空间信息。此策略隐含地在一个轨迹内生成密集而多样的空间配置。 我们通过模拟和现实世界环境中的一系列实验验证了我们的方法的有效性。例如,在需要强大空间泛化的仿真任务中,MOVE达到了39.1%的平均成功率,相对于静态数据收集范式(22.2%)有了76.1%的相对改进,并在某些任务上实现了高达2到5倍的数据效率提升。 我们项目的代码可以在[此处](https://this.http URL.com)获取。
https://arxiv.org/abs/2512.04813
We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.
我们介绍了SIMA 2,这是一个通才型具身代理,能够理解和在各种3D虚拟世界中采取行动。SIMA 2基于Gemini基础模型构建,在实现主动、目标导向的具身环境中互动方面迈出了重要一步。与之前的工作(例如SIMA 1)仅限于简单的语言指令不同,SIMA 2能够作为交互式伙伴工作,具备关于高层次目标进行推理的能力,并能通过语言和图像处理复杂的指令与用户对话。 在一系列多样化的游戏中,SIMA 2显著缩小了与人类表现的差距,并展示了其向之前未见过环境中的稳健泛化能力,同时保留了基础模型的核心推理功能。此外,我们还展示了开放式的自我改进能力:利用Gemini生成任务并提供奖励,SIMA 2能够在新环境中自主学习新的技能。 这项工作验证了一条通往创建既适应虚拟世界又最终能应用于现实世界的多才多艺且持续学习代理的路径。
https://arxiv.org/abs/2512.04797
Drones are becoming indispensable in many application domains. In data-driven missions, besides sensing, the drone must process the collected data at runtime to decide whether additional action must be taken on the spot, before moving to the next point of interest. If processing does not reveal an event or situation that requires such an action, the drone has waited in vain instead of moving to the next point. If, however, the drone starts moving to the next point and it turns out that a follow-up action is needed at the previous point, it must spend time to fly-back. To take this decision, we propose different machine-learning methods based on branch prediction and reinforcement learning. We evaluate these methods for a wide range of scenarios where the probability of event occurrence changes with time. Our results show that the proposed methods consistently outperform the regression-based method proposed in the literature and can significantly improve the worst-case mission time by up to 4.1x. Also, the achieved median mission time is very close, merely up to 2.7% higher, to that of a method with perfect knowledge of the current underlying event probability at each point of interest.
无人机在许多应用领域中变得不可或缺。在数据驱动的任务中,除了收集传感器数据外,无人机还必须在运行时处理这些数据以决定是否需要在现场采取进一步行动,然后再前往下一个兴趣点。如果处理结果没有发现需要此类操作的事件或情况,则无人机本可以继续前往下一个兴趣点,却在此处浪费了时间等待。然而,若无人机开始前往下一个兴趣点后才发现需要返回上一个兴趣点进行后续操作,则必须花费更多的时间飞回原地。 为了做出这种决策,我们提出了基于分支预测和强化学习的多种机器学习方法,并在一系列广泛场景中对其进行了评估——这些场景中的事件发生概率会随时间发生变化。我们的研究结果表明,所提出的方法在各种情况下始终优于文献中提出的回归法,并且可以显著减少最糟糕的任务执行时间(最多提高4.1倍)。此外,达成的中位任务执行时间也非常接近于一种假想情况下的方法——即无人机能够在每个兴趣点都准确知道当前事件发生的概率。
https://arxiv.org/abs/2512.04773
In recent years, precision agriculture has been introducing groundbreaking innovations in the field, with a strong focus on automation. However, research studies in robotics and autonomous navigation often rely on controlled simulations or isolated field trials. The absence of a realistic common benchmark represents a significant limitation for the diffusion of robust autonomous systems under real complex agricultural conditions. Vineyards pose significant challenges due to their dynamic nature, and they are increasingly drawing attention from both academic and industrial stakeholders interested in automation. In this context, we introduce the TEMPO-VINE dataset, a large-scale multi-temporal dataset specifically designed for evaluating sensor fusion, simultaneous localization and mapping (SLAM), and place recognition techniques within operational vineyard environments. TEMPO-VINE is the first multi-modal public dataset that brings together data from heterogeneous LiDARs of different price levels, AHRS, RTK-GPS, and cameras in real trellis and pergola vineyards, with multiple rows exceeding 100 m in length. In this work, we address a critical gap in the landscape of agricultural datasets by providing researchers with a comprehensive data collection and ground truth trajectories in different seasons, vegetation growth stages, terrain and weather conditions. The sequence paths with multiple runs and revisits will foster the development of sensor fusion, localization, mapping and place recognition solutions for agricultural fields. The dataset, the processing tools and the benchmarking results will be available at the dedicated webpage upon acceptance.
近年来,精准农业在自动化方面引入了许多突破性的创新。然而,关于机器人技术和自主导航的研究通常依赖于受控仿真或孤立的实地试验。缺乏一个现实中的通用基准是限制可靠自主系统在复杂真实农业条件下普及的重要因素之一。由于葡萄园具有动态性,因而带来了独特的挑战,并吸引了越来越多学术界和工业界的自动化研究者关注。 在此背景下,我们推出了TEMPO-VINE数据集,这是一个专门为评估传感器融合、同时定位与地图构建(SLAM)以及地点识别技术在操作中的葡萄园环境而设计的大规模多时段数据集。TEMPO-VINE是第一个集合了不同价格水平的异构LiDARs、AHRS、RTK-GPS和相机的数据的多模态公开数据集,这些设备在实际的架式和棚架葡萄园中部署使用,并且涵盖了超过100米长的多个行道。通过这项工作,我们填补了农业数据集中一个关键空白,为研究人员提供了不同季节、植被生长阶段、地形和天气条件下的全面数据收集及地面实况轨迹。多路径序列中的多次运行和重访将促进在农田中开发传感器融合、定位、制图以及地点识别解决方案的发展。 该数据集、处理工具以及基准测试结果将在接受后于专门的网页上发布。
https://arxiv.org/abs/2512.04772
Brain-body co-evolution enables animals to develop complex behaviors in their environments. Inspired by this biological synergy, embodied co-design (ECD) has emerged as a transformative paradigm for creating intelligent agents-from virtual creatures to physical robots-by jointly optimizing their morphologies and controllers rather than treating control in isolation. This integrated approach facilitates richer environmental interactions and robust task performance. In this survey, we provide a systematic overview of recent advances in ECD. We first formalize the concept of ECD and position it within related fields. We then introduce a hierarchical taxonomy: a lower layer that breaks down agent design into three fundamental components-controlling brain, body morphology, and task environment-and an upper layer that integrates these components into four major ECD frameworks: bi-level, single-level, generative, and open-ended. This taxonomy allows us to synthesize insights from more than one hundred recent studies. We further review notable benchmarks, datasets, and applications in both simulated and real-world scenarios. Finally, we identify significant challenges and offer insights into promising future research directions. A project associated with this survey has been created at this https URL.
脑-体协同进化使得动物能够在环境中发展出复杂的行为。受此生物协同作用的启发,实体协同设计(ECD)作为一种变革性范式应运而生,用于创建从虚拟生物到物理机器人的智能代理,通过同时优化它们的形态和控制器而非孤立地处理控制问题。这种集成方法促进了更丰富的环境互动并提高了任务性能。在这篇综述中,我们系统地概述了近期ECD领域的最新进展。首先,我们将ECD的概念进行形式化,并将其定位在相关领域中。接着,我们介绍了一个分层的分类法:底层将代理设计分解为三个基本组成部分——控制大脑、身体形态和任务环境;顶层则整合这些组件,形成四大主要的ECD框架:双层级(bi-level)、单层级(single-level)、生成式(generative)及开放端(open-ended)。这种分类方法使我们能够综合超过一百项近期研究中的见解。此外,我们进一步回顾了在模拟和现实场景中重要的基准、数据集以及应用案例。最后,我们确定了一些主要挑战,并提供对未来有前景的研究方向的洞察。 与本综述相关的项目可以在以下链接找到:[请在此处插入实际URL]
https://arxiv.org/abs/2512.04770
Cross-domain transfer in robotic manipulation remains a longstanding challenge due to the significant domain gap between simulated and real-world environments. Existing methods such as domain randomization, adaptation, and sim-real calibration often require extensive tuning or fail to generalize to unseen scenarios. To address this issue, we observe that if domain-invariant features are utilized during policy training in simulation, and the same features can be extracted and provided as the input to policy during real-world deployment, the domain gap can be effectively bridged, leading to significantly improved policy generalization. Accordingly, we propose Semantic 2D Gaussian Splatting (S2GS), a novel representation method that extracts object-centric, domain-invariant spatial features. S2GS constructs multi-view 2D semantic fields and projects them into a unified 3D space via feature-level Gaussian splatting. A semantic filtering mechanism removes irrelevant background content, ensuring clean and consistent inputs for policy learning. To evaluate the effectiveness of S2GS, we adopt Diffusion Policy as the downstream learning algorithm and conduct experiments in the ManiSkill simulation environment, followed by real-world deployment. Results demonstrate that S2GS significantly improves sim-to-real transferability, maintaining high and stable task performance in real-world scenarios.
跨域机器人操作迁移依然是一个长期存在的挑战,主要是因为模拟环境和真实世界环境之间存在显著的领域差距。现有方法如领域随机化、适应以及仿真-现实校准往往需要大量的调整或无法泛化到未见过的情景中去。为了解决这个问题,我们观察到如果在模拟训练策略时使用域不变特征,并且这些相同的特征可以在实际部署时提取并提供给策略作为输入,则可以有效缩小领域差距,从而显著提升策略的泛化能力。 基于此观察,我们提出了语义二维高斯光栅(Semantic 2D Gaussian Splatting, S2GS),这是一种新的表示方法,用于提取以对象为中心且具有域不变性的空间特征。S2GS通过多视角构建二维语义字段,并通过特征级别的高斯投影将其统一到一个三维空间中。此外,采用了一个语义过滤机制来移除不相关的背景内容,确保输入的清洁和一致性,以便于策略学习。 为了评估S2GS的有效性,我们采用了Diffusion Policy作为下游的学习算法,在ManiSkill模拟环境中进行实验,并随后在真实世界进行了部署。结果表明,S2GS显著提升了仿真到现实的迁移能力,在实际应用场景中保持了高且稳定的任务性能。
https://arxiv.org/abs/2512.04731
Model Predictive Path Integral (MPPI) control is a sampling-based optimization method that has recently attracted attention, particularly in the robotics and reinforcement learning communities. MPPI has been widely applied as a GPU-accelerated random search method to deterministic direct single-shooting optimal control problems arising in model predictive control (MPC) formulations. MPPI offers several key advantages, including flexibility, robustness, ease of implementation, and inherent parallelizability. However, its performance can deteriorate in high-dimensional settings since the optimal control problem is solved via Monte Carlo sampling. To address this limitation, this paper proposes an enhanced MPPI method that incorporates a Jacobian reconstruction technique and the second-order Generalized Gauss-Newton method. This novel approach is called \textit{Gauss-Newton accelerated MPPI}. The numerical results show that the Gauss-Newton accelerated MPPI approach substantially improves MPPI scalability and computational efficiency while preserving the key benefits of the classical MPPI framework, making it a promising approach even for high-dimensional problems.
模型预测路径积分(MPPI)控制是一种基于采样的优化方法,最近在机器人和强化学习领域引起了广泛关注。MPPI作为一种GPU加速的随机搜索方法,在模型预测控制(MPC)形式中出现的确定性直接单一射击最优控制问题上得到了广泛应用。MPPI具有多个关键优势,包括灵活性、鲁棒性、易于实现以及固有的并行化特性。然而,在高维设置下其性能可能会下降,因为最优控制问题是通过蒙特卡洛采样来解决的。为了解决这一限制,本文提出了一种增强版的MPPI方法,该方法结合了雅可比重建技术和二阶广义高斯-牛顿法(Generalized Gauss-Newton)。这种新方法被命名为“**高斯-牛顿加速MPPI**”。 数值结果显示,高斯-牛顿加速MPPI方法显著提升了MPPI在大规模问题中的扩展性和计算效率,并保留了经典MPPI框架的关键优势。因此,这种方法对于处理高维问题也是一种很有前途的方法。
https://arxiv.org/abs/2512.04579
Ensemble control aims to steer a population of dynamical systems using a shared control input. This paper introduces a constrained ensemble control framework for parameterized, heterogeneous robotic systems operating under state and environmental constraints, such as obstacle avoidance. We develop a moment kernel transform that maps the parameterized ensemble dynamics to the moment system in a kernel space, enabling the characterization of population-level behavior. The state-space constraints, such as polyhedral waypoints to be visited and obstacles to be avoided, are also transformed into the moment space, leading to a unified formulation for safe, large-scale ensemble control. Expressive signal temporal logic specifications are employed to encode complex visit-avoid tasks, which are achieved through a single shared controller synthesized from our constrained ensemble control formulation. Simulation and hardware experiments demonstrate the effectiveness of the proposed approach in safely and efficiently controlling robotic ensembles within constrained environments.
群集控制旨在通过共享的控制输入来引导一组动力学系统。本文介绍了一种约束下的群集控制系统框架,适用于参数化、异构机器人系统,在这种环境中,机器人的运行会受到状态和环境(如障碍物规避)的限制。我们开发了一个时刻核变换器,该变换器将参数化的群集动态映射到内核空间中的时刻系统中,从而能够表征整个群体的行为。状态空间约束,例如需要访问的多面体航路点以及必须避开的障碍物,在此框架下也会被转换成内核空间中的形式,这为大规模的安全群集控制提供了一个统一的形式化方法。通过使用表达式的信号时态逻辑规范,可以编码复杂的访问和避免任务,并且这些任务可以通过从我们提出的约束群集控制系统中综合出的一个单一共享控制器来实现。模拟实验和硬件实验验证了该方法在受限环境中安全有效地控制机器人集合方面的有效性。
https://arxiv.org/abs/2512.04502
We present a comparative study of multi-agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX's value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning -- particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML-Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small-scale deployments (2-4 robots), significant scaling challenges remain. Code and analyses: this https URL
我们介绍了一项关于合作仓库机器人多智能体强化学习(MARL)算法的比较研究。我们在Robotic Warehouse (RWARE)环境和一个自定义的Unity 3D仿真中评估了QMIX和IPPO的表现。我们的实验结果显示,QMIX的价值分解显著优于独立学习方法,在平均回报方面,QMIX达到了3.25,而先进的IPPO仅为0.38。然而,QMIX需要进行大量的超参数调整——特别是扩展的epsilon退火(超过5M步)以发现稀疏奖励。我们在Unity ML-Agents中成功部署了该算法,并在1M训练步骤后实现了持续的包裹配送。尽管对于小规模部署(2-4个机器人),MARL显示出前景,但仍然存在显著的可扩展性挑战。代码和分析:[请在此处插入链接]
https://arxiv.org/abs/2512.04463
To collaborate with humans, robots must infer goals that are often ambiguous, difficult to articulate, or not drawn from a fixed set. Prior approaches restrict inference to a predefined goal set, rely only on observed actions, or depend exclusively on explicit instructions, making them brittle in real-world interactions. We present BALI (Bidirectional Action-Language Inference) for goal prediction, a method that integrates natural language preferences with observed human actions in a receding-horizon planning tree. BALI combines language and action cues from the human, asks clarifying questions only when the expected information gain from the answer outweighs the cost of interruption, and selects supportive actions that align with inferred goals. We evaluate the approach in collaborative cooking tasks, where goals may be novel to the robot and unbounded. Compared to baselines, BALI yields more stable goal predictions and significantly fewer mistakes.
为了与人类合作,机器人必须推断出常常模糊、难以表达或不属于固定集合的目标。先前的方法将推理限制在一个预定义的目标集内,仅依赖于观察到的动作,或者完全依靠明确的指令,这在现实世界的互动中显得脆弱。我们提出了BALI(双向动作-语言推断),这是一种用于目标预测的方法,它结合了自然语言偏好和人类行为观察,在一个后退地平线规划树中进行整合。BALI 方法综合了来自人类的语言和行动线索,并且仅在其预期的回答信息增益超过中断成本时才会提出澄清问题。此外,它选择与推断出的目标一致的支持性动作。 我们在协作烹饪任务中评估了这种方法,这些任务中的目标可能是对机器人来说新颖的并且没有界限。相较于基准方法,BALI 产生了更稳定的目标预测,并且显著减少了错误的发生。
https://arxiv.org/abs/2512.04453
Automating disassembly of critical components from end-of-life (EoL) desktops, such as high-value items like RAM modules and CPUs, as well as sensitive parts like hard disk drives, remains challenging due to the inherent variability and uncertainty of these products. Moreover, their disassembly requires sequential, precise, and dexterous operations, further increasing the complexity of automation. Current robotic disassembly processes are typically divided into several stages: perception, sequence planning, task planning, motion planning, and manipulation. Each stage requires explicit modeling, which limits generalization to unfamiliar scenarios. Recent development of vision-language-action (VLA) models has presented an end-to-end approach for general robotic manipulation tasks. Although VLAs have demonstrated promising performance on simple tasks, the feasibility of applying such models to complex disassembly remains largely unexplored. In this paper, we collected a customized dataset for robotic RAM and CPU disassembly and used it to fine-tune two well-established VLA approaches, OpenVLA and OpenVLA-OFT, as a case study. We divided the whole disassembly task into several small steps, and our preliminary experimental results indicate that the fine-tuned VLA models can faithfully complete multiple early steps but struggle with certain critical subtasks, leading to task failure. However, we observed that a simple hybrid strategy that combines VLA with a rule-based controller can successfully perform the entire disassembly operation. These findings highlight the current limitations of VLA models in handling the dexterity and precision required for robotic EoL product disassembly. By offering a detailed analysis of the observed results, this study provides insights that may inform future research to address current challenges and advance end-to-end robotic automated disassembly.
从寿命终止(EoL)台式机中自动拆解关键组件,例如高价值的RAM模块和CPU以及敏感部件如硬盘驱动器,仍然是一个挑战。这主要是由于这些产品本身的可变性和不确定性。此外,它们的拆解需要顺序、精确且灵巧的操作,进一步增加了自动化复杂性。目前的机器人拆解过程通常分为几个阶段:感知、序列规划、任务规划、运动规划和操作。每个阶段都需要显式的建模,从而限制了在不熟悉的场景中的泛化能力。 近期视觉-语言-动作(VLA)模型的发展为通用机器人操控任务提供了一种端到端的方法。尽管这些VLA模型在简单任务中展示了有前景的表现,但它们是否适用于复杂的拆解过程尚未得到充分探索。在这篇论文中,我们收集了一个专门用于机器人RAM和CPU拆解的定制数据集,并使用它来对两种已建立的VLA方法——OpenVLA和OpenVLA-OFT进行微调,作为案例研究。我们将整个拆解任务划分为几个小步骤,初步实验结果显示,经过微调的VLA模型可以忠实完成多个早期步骤,但在某些关键子任务上遇到了困难,导致任务失败。 然而,我们观察到一种简单的混合策略——将VLA与基于规则的控制器结合使用——能够成功执行整个拆解操作。这些发现强调了当前VLA模型在处理机器人EoL产品拆解所需的灵巧性和精度方面存在限制。通过详细分析所观察的结果,这项研究提供了有关未来如何应对现有挑战并推进端到端机器人自动化拆解的研究见解。 该论文对这一领域的探索不仅为解决现有的技术难题提供了新的视角,同时也强调了跨学科合作(如计算机视觉、自然语言处理和机器学习)的重要性,在此过程中,VLA模型展示了其在复杂任务中的应用潜力。研究结果表明,尽管现有方法存在局限性,但通过结合不同的技术和策略可以显著提高自动化拆解的性能。
https://arxiv.org/abs/2512.04446
Physical feasibility in 3D bin packing is a key requirement in modern industrial logistics and robotic automation. With the growing adoption of industrial automation, online bin packing has gained increasing attention. However, inconsistencies in problem settings, test datasets, and evaluation metrics have hindered progress in the field, and there is a lack of a comprehensive benchmarking system. Direct testing on real hardware is costly, and building a realistic simulation environment is also challenging. To address these limitations, we introduce RoboBPP, a benchmarking system designed for robotic online bin packing. RoboBPP integrates a physics-based simulator to assess physical feasibility. In our simulation environment, we introduce a robotic arm and boxes at real-world scales to replicate real industrial packing workflows. By simulating conditions that arise in real industrial applications, we ensure that evaluated algorithms are practically deployable. In addition, prior studies often rely on synthetic datasets whose distributions differ from real-world industrial data. To address this issue, we collect three datasets from real industrial workflows, including assembly-line production, logistics packing, and furniture manufacturing. The benchmark comprises three carefully designed test settings and extends existing evaluation metrics with new metrics for structural stability and operational safety. We design a scoring system and derive a range of insights from the evaluation results. RoboBPP is fully open-source and is equipped with visualization tools and an online leaderboard, providing a reproducible and extensible foundation for future research and industrial applications (this https URL).
在现代工业物流和机器人自动化中,3D装箱问题的物理可行性是一个关键需求。随着工业自动化的日益普及,在线装箱问题受到了越来越多的关注。然而,由于问题设定、测试数据集以及评估标准不一致的问题,该领域的发展受到阻碍,并且缺乏一个全面基准测试系统。在真实硬件上进行直接测试成本高昂,而构建真实的模拟环境也颇具挑战性。为了解决这些问题,我们引入了RoboBPP,这是一个针对机器人在线装箱问题设计的基准测试系统。RoboBPP集成了基于物理的仿真器来评估物理可行性。在我们的仿真环境中,我们引入了与真实世界尺度相匹配的机械臂和箱子,以模拟实际工业包装流程。通过模拟现实工业应用中的条件,我们可以确保被评估算法具有实际部署的价值。 此外,先前的研究经常依赖于分布不同于现实工业数据的人工生成的数据集。为了解决这个问题,我们从真实的工业工作流程中收集了三个数据集,包括生产线生产、物流打包和家具制造。该基准测试包括三种精心设计的测试设置,并扩展了现有的评估指标,加入了新的结构稳定性和操作安全性的衡量标准。我们还设计了一个评分系统并从评价结果中提炼出一系列洞见。 RoboBPP是完全开源的,并配备了可视化工具以及一个在线排行榜,为未来的研究和工业应用提供了一个可重复验证且具有延展性的基础(参考链接:this https URL)。
https://arxiv.org/abs/2512.04415
This paper proposes an Interactive Inference Behavior Tree (IIBT) framework that integrates behavior trees (BTs) with active inference under the free energy principle for distributed multi-robot decision-making. The proposed IIBT node extends conventional BTs with probabilistic reasoning, enabling online joint planning and execution across multiple robots. It remains fully com- patible with standard BT architectures, allowing seamless integration into existing multi-robot control systems. Within this framework, multi-robot cooperation is formulated as a free-energy minimization process, where each robot dynamically updates its preference matrix based on perceptual inputs and peer intentions, thereby achieving adaptive coordination in partially observ- able and dynamic environments. The proposed approach is validated through both simulation and real-world experiments, including a multi-robot maze navigation and a collaborative ma- nipulation task, compared against traditional BTs(this https URL). Experimental results demonstrate that the IIBT framework reduces BT node complexity by over 70%, while maintaining robust, interpretable, and adaptive cooperative behavior under environmental uncertainty.
本文提出了一种交互式推理行为树(IIBT)框架,该框架将行为树(BTs)与自由能原理下的主动推断相结合,用于分布式多机器人决策制定。所提出的IIBT节点扩展了传统的BTs,加入了概率推理功能,从而实现了跨多个机器人的在线联合规划和执行。它完全兼容标准的BT架构,允许无缝集成到现有的多机器人控制系统中。在此框架内,多机器人合作被形式化为自由能最小化的过程,每个机器人根据感知输入和同伴意图动态更新其偏好矩阵,从而在部分可观测且动态变化的环境中实现自适应协调。通过模拟实验和现实世界实验(包括多机器人迷宫导航和协作操作任务)验证了所提出的方法,并将其与传统的BT进行了比较(此链接)。实验结果表明,IIBT框架将BT节点复杂度减少了超过70%,同时在环境不确定性下保持了稳健、可解释且自适应的协作行为。
https://arxiv.org/abs/2512.04404