Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL
最近在图像到三维(image-to-3D)领域的进展为设计、增强现实/虚拟现实(AR/VR)和机器人技术开辟了巨大的可能性。然而,要在实际应用中使用人工智能生成的三维资产,一个关键的要求就是能够轻松编辑它们。我们提出了一种前馈方法Steer3D,它可以向图像到三维模型添加文本控制能力,从而支持利用语言对生成的三维资源进行编辑。我们的方法受到了ControlNet的启发,并将其适应于图像到三维生成,使得可以直接在一次正向传递中实现文本引导。 为了构建一个可扩展的数据引擎以自动生成数据,我们开发了一个两阶段的训练配方,该配方基于流匹配训练和直接偏好优化(DPO)。与竞争的方法相比,Steer3D更忠实地遵循语言指令,并且能够更好地保持与原始三维资产的一致性,同时速度快2.4到28.5倍。 此外,Steer3D展示了将新的模式(文本)添加到预训练的图像到三维生成模型中以引导生成过程是可能的,这只需要10万数据就可以实现。项目网站: [链接](this https URL)
https://arxiv.org/abs/2512.13678
Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements-an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation-verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following. Project website: this https URL
空间时间逻辑(SpaTiaL)提供了一种原则性的形式化方法,用于表达几何空间需求——这是机器人操作中的一个关键组成部分,在这种情况下,物体的位置、邻近关系、姿态约束和交互直接决定了任务的成功与否。然而,之前的大多数研究主要依赖于标准的时间逻辑(TL),这仅建模了机器人的轨迹,并忽略了对象级别的交互。基于随机生成的TL公式与自然语言描述配对而创建的现有数据集虽然涵盖了时间操作符,但未能代表机器人抓取任务所依赖的多层次空间关系。 为了填补这一空白,我们引入了一个数据集生成框架,该框架综合出SpaTiaL规范,并通过一种确定性的、语义保持的逆向翻译过程将其转换为自然语言描述。此流程产生NL2SpaTiaL数据集,使自然语言与多层次的空间关系和时间目标相匹配,反映了抓取任务的组成结构。 在此基础上,我们提出了一个翻译验证框架,该框架配备了一个基于语言的语义检查器,确保生成的SpaTiaL公式准确地编码了输入描述所指定的语义。在一系列操作任务上的实验显示,基于SpaTiaL表示的方法产生了更为可解释、可验证和组成化的指令遵循基础。 项目网站:[这个链接](this%20https%20URL)
https://arxiv.org/abs/2512.13670
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
空间追踪作为机器人本体互动的基本能力,由于其需要进行多步骤的度量推理并结合复杂的空间指代和现实世界的度量测量,因而具有固有的挑战性。然而,现有的方法难以处理这一组合任务。为此,我们提出了RoboTracer,这是一种3D感知视觉语言模型(VLM),它首先通过一个通用的空间编码器和受回归监督的解码器来实现空间指代与度量,从而在有监督微调(SFT)期间增强对尺度的认识。此外,RoboTracer通过带有度量子敏感过程奖励的强化学习微调(RFT)进一步推进了多步骤度量推理,并指导关键中间感知线索以准确生成空间轨迹。 为了支持SFT和RFT训练,我们引入了TraceSpatial,这是一个包含30M问题-答案对的大规模数据集,涵盖了户外、室内和平面场景,并且能够支持复杂的推理过程(多达9步)。此外,我们还推出了TraceSpatial-Bench,这是用于评估空间追踪性能的具有挑战性的基准测试,填补了现有评价方法的空白。实验结果表明,在空间理解、度量和指代方面,RoboTracer超越了基线模型,并在TraceSpatial-Bench上表现出色,大幅优于Gemini-2.5-Pro,准确性提高了36%。 值得注意的是,RoboTracer可以与各种控制策略集成在一起,以执行跨多种机器人(UR5、G1人形)的复杂现实场景中的长期动态任务。
https://arxiv.org/abs/2512.13660
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
灵巧操作具有挑战性,因为它要求理解微妙的手部动作如何通过与物体接触来影响环境。我们介绍了DexWM(灵巧操作世界模型),这是一种可以根据过去的状态和灵巧的操作预测下一个潜在环境状态的模型。为了克服灵巧操作数据集稀缺的问题,DexWM在超过900小时的人类和非灵巧机器人视频上进行了训练。 为了实现精细程度更高的灵巧性,我们发现仅预测视觉特征是不够的;因此,我们引入了一个辅助的手部一致性损失函数,以确保手部配置准确。与以前基于文本、导航及全身动作条件下的世界模型相比,DexWM在未来的状态预测方面表现出更佳性能。 当部署于配备Allegro机械手的Franka Panda机械臂上时,DexWM还能展示出强大的零样本泛化能力,适用于未见过的操作技巧。在抓取、放置和到达任务中,它比Diffusion Policy平均高出50%以上的表现。
https://arxiv.org/abs/2512.13644
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
我们引入了Do-Undo任务和基准,以解决视觉语言模型的一个关键缺口:理解并生成由真实世界动作驱动的物理上合理的场景变换。与以往专注于对象级别编辑的工作不同,Do-Undo要求模型模拟一个物理行动的结果,并准确地反向操作,这反映了现实世界的因果关系。我们从真实世界的视频中整理了一个大规模可逆行为的数据集,并设计了一种训练策略来强制执行一致性以增强动作定位的稳健性。我们的实验表明,目前的模型在处理物理上的可逆性时存在困难,突显了这一任务对于具身AI、机器人技术和物理学感知生成建模的重要性。Do-Undo为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。
https://arxiv.org/abs/2512.13609
Near-field perception is essential for the safe operation of autonomous mobile robots (AMRs) in manufacturing environments. Conventional ranging sensors such as light detection and ranging (LiDAR) and ultrasonic devices provide broad situational awareness but often fail to detect small objects near the robot base. To address this limitation, this paper presents a three-tier near-field perception framework. The first approach employs light-discontinuity detection, which projects a laser stripe across the near-field zone and identifies interruptions in the stripe to perform fast, binary cutoff sensing for obstacle presence. The second approach utilizes light-displacement measurement to estimate object height by analyzing the geometric displacement of a projected stripe in the camera image, which provides quantitative obstacle height information with minimal computational overhead. The third approach employs a computer vision-based object detection model on embedded AI hardware to classify objects, enabling semantic perception and context-aware safety decisions. All methods are implemented on a Raspberry Pi 5 system, achieving real-time performance at 25 or 50 frames per second. Experimental evaluation and comparative analysis demonstrate that the proposed hierarchy balances precision, computation, and cost, thereby providing a scalable perception solution for enabling safe operations of AMRs in manufacturing environments.
近距离感知对于自主移动机器人(AMR)在制造环境中的安全操作至关重要。传统的测距传感器,如激光雷达和超声波设备,提供了广泛的情境感知能力,但往往无法检测到靠近机器人基座的小型物体。为了解决这一局限性,本文提出了一种三级近场感知框架。 第一种方法采用了光不连续性检测技术,通过在近场区域投射一条激光线,并识别其中的中断来快速、二元地进行障碍物存在与否的判断。 第二种方法使用光线位移测量技术,通过分析投影线条在相机图像中的几何位移,估计物体高度。这种方法能够提供具有最小计算开销的高度信息,量化了障碍物的具体高度数据。 第三种方法则利用嵌入式AI硬件上的计算机视觉对象检测模型来识别和分类物体,实现语义感知,并据此做出情境相关的安全决策。 所有这些方法都在Raspberry Pi 5系统上实现了实时性能,达到每秒25帧或50帧的速度。实验评估和比较分析表明,所提出的层次结构在精度、计算需求以及成本之间取得了平衡,为AMR在制造环境中的安全操作提供了可扩展的感知解决方案。
https://arxiv.org/abs/2512.13561
Autonomous free-flyers play a critical role in intravehicular tasks aboard the International Space Station (ISS), where their precise docking under sensing noise, small actuation mismatches, and environmental variability remains a nontrivial challenge. This work presents a reinforcement learning (RL) framework for six-degree-of-freedom (6-DoF) docking of JAXA's Int-Ball2 robot inside a high-fidelity Isaac Sim model of the Japanese Experiment Module (JEM). Using Proximal Policy Optimization (PPO), we train and evaluate controllers under domain-randomized dynamics and bounded observation noise, while explicitly modeling propeller drag-torque effects and polarity structure. This enables a controlled study of how Int-Ball2's propulsion physics influence RL-based docking performance in constrained microgravity interiors. The learned policy achieves stable and reliable docking across varied conditions and lays the groundwork for future extensions pertaining to Int-Ball2 in collision-aware navigation, safe RL, propulsion-accurate sim-to-real transfer, and vision-based end-to-end docking.
自主自由飞行器在国际空间站(ISS)舱内任务中扮演着关键角色,它们需要在传感噪声、执行偏差以及环境变化的情况下进行精准对接,这一直是一个具有挑战性的难题。本文介绍了一种基于强化学习(RL)的框架,用于日本宇宙航空研究开发机构(JAXA)Int-Ball2机器人在日本实验模组(JEM)高保真Isaac Sim模型中的六自由度(6-DoF)对接任务。我们使用近端策略优化(PPO)算法,在领域随机化的动力学和受限观测噪声条件下训练并评估控制器,并明确建模了螺旋桨的拖拽力矩效应及极性结构的影响。这使得我们可以系统地研究Int-Ball2推进物理特性如何影响基于RL的微重力约束环境下的对接性能。 学习到的策略能够在各种条件下实现稳定可靠的对接,为未来扩展至碰撞感知导航、安全强化学习、推力精确的真实与模拟转换以及视觉为基础的一站式对接奠定了基础。
https://arxiv.org/abs/2512.13514
To address the issues that arise due to the manual navigation of guidewires in endovascular interventions, research in medical robotics has taken a strong interest in developing robotically steerable guidewires, which offer the possibility of enhanced maneuverability and navigation, as the tip of the guidewire can be actively steered. The COaxially Aligned STeerable (COAST) guidewire robot has the ability to generate a wide variety of motions including bending motion with different bending lengths, follow-the-leader motion, and feedforward motion. In our past studies, we have explored different designs of the COAST guidewire robot and developed modeling, control, and sensing strategies for the COAST guidewire robot. In this study, the performance of a modified COAST guidewire robot is evaluated by conducting navigation experiments in an anatomical phantom model with pulsatile flow. The modified COAST guidewire robot is a simplified version of the COAST guidewire robot and consists of two tubes as opposed to three tubes. Through this study, we demonstrate the effectiveness of the modified COAST guidewire robot in navigating the tortuous phantom vasculature.
为了应对手动操作导丝在血管内介入治疗中出现的问题,医学机器人研究领域对开发可遥控操控的导丝产生了浓厚兴趣。这种类型的导丝能够提供增强的操作灵活性和导航能力,因为其尖端可以被主动控制以进行转向。COAXIALLY ALIGNED STEERABLE (COAST) 导丝机器人具备生成多种运动的能力,包括不同弯曲长度下的弯曲运动、跟随引导的运动以及前馈运动。 在我们之前的研究所中,我们探讨了不同的 COAST 导丝机器人的设计,并为该机器人开发了建模、控制和感测策略。在这项研究中,通过在一个具有脉动流动的人体解剖模型中进行导航实验来评估改进后的 COAST 导丝机器人的性能。此改进版的 COAST 导丝机器人是一个简化版本,它由两个管子组成,而原来的版本则有三个管子。本研究表明了改进后 COAST 导丝机器人在导航曲折的人体血管模型中的有效性。
https://arxiv.org/abs/2512.13477
This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.
这篇论文介绍了一种新颖的管道,用于生成大规模、高度逼真且自动标注的数据集,适用于机器人环境中计算机视觉任务。我们的方法解决了合成图像和现实世界图像之间的领域差距以及手动注释耗时瓶颈的关键挑战。我们利用3D高斯点阵法(3DGaussian Splatting,简称3DGS)来创建操作环境及其对象的逼真表示。然后在游戏引擎中使用这些资产,并通过物理模拟产生自然排列。一种新颖的两步渲染技术将点阵的真实感与从代理网格生成的阴影图相结合。该地图随后被算法合成到图像中,以添加具有物理学依据的阴影和微妙高光,显著增强了逼真度。像素级精确分割掩码会自动生成,并格式化为直接用于YOLO等目标检测模型。 我们的实验表明,结合少量真实图片与大量合成数据的混合训练策略能够实现最佳的目标检测和分割性能,这证实了这种策略是高效获得强大且准确模型的最佳途径。
https://arxiv.org/abs/2512.13411
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
有效奖励函数的设计是强化学习(RL)中的一个核心且常常艰难的挑战,尤其是在为复杂的推理任务开发自主代理时。虽然存在自动化奖励优化的方法,但这些方法通常依赖于无导数演化的启发式算法来处理奖励函数作为黑盒的问题,无法捕捉到奖励结构与任务表现之间的因果关系。为了弥合这一差距,我们提出了可微演化强化学习(DERL),这是一种双层框架,能够实现最优奖励信号的自主发现。 在DERL中,一个元优化器通过组合结构化的原子原语来进化奖励函数(即元奖励),并指导内循环策略的训练。关键在于,与以往的演算法不同,DERL在其元优化过程中是可微分的:它将内循环验证性能视为更新元优化器以强化学习方式传递信号的方法。这使得DERL能够近似“元梯度”,逐渐学会生成更密集和更具操作性的反馈。 我们在三个不同的领域中对DERL进行了验证:机器人代理(ALFWorld)、科学仿真(ScienceWorld)以及数学推理(GSM8k、MATH)。实验结果显示,DERL在ALFWorld和ScienceWorld上达到了最先进的性能,在基于启发式奖励的方法特别是在分布外场景下明显超越。对于演化的轨迹分析表明,DERL成功地捕捉到了任务的内在结构,使得代理能够在没有人类干预的情况下实现自我改进与对齐。 通过这一创新方法,DERL不仅提高了自主学习系统的效率和泛化能力,还展示了演化算法在智能体奖励设计中的潜力,为解决复杂推理任务带来了新的视角。
https://arxiv.org/abs/2512.13399
This study proposes a step adaptation framework for running through spring-mass trajectories and deadbeat control gain libraries. It includes four main parts: (1) Automatic spring-mass trajectory library generation; (2) Deadbeat control gain library generation through an actively controlled template model that resembles the whole-body dynamics well; (3) Trajectory selection policy development for step adaptation; (4) Mapping spring-mass trajectories to a humanoid model through a whole-body control (WBC) framework also accounting for closed-kinematic chain systems, self collisions, and reactive limb swinging. We show the inclusiveness and the robustness of the proposed framework through various challenging and agile behaviors such as running through randomly generated stepping stones, jumping over random obstacles, performing slalom motions, changing the running direction suddenly with a random leg, and rejecting significant disturbances and uncertainties through the MuJoCo physics simulator. We also perform additional simulations under a comprehensive set of uncertainties and noise to better justify the proposed method's robustness to real-world challenges, including signal noise, imprecision, modeling errors, and delays. All the aforementioned behaviors are performed with a single library and the same set of WBC control parameters without additional tuning. The spring-mass and the deadbeat control gain library are automatically computed in 4.5 seconds in total for 315 different trajectories.
这项研究提出了一种步态适应框架,用于通过弹簧-质量轨迹和死区控制增益库进行跑步。该框架包括四个主要部分: 1. 自动生成弹簧-质量轨迹库; 2. 利用类似于全身动力学的主动控制模板模型生成死区控制增益库; 3. 为步态适应开发轨迹选择策略; 4. 在考虑闭合运动链系统、自我碰撞和反应性肢体摆动的情况下,通过全身控制(WBC)框架将弹簧-质量轨迹映射到人形模型。 我们通过使用MuJoCo物理仿真器进行的各种挑战性和敏捷行为展示了所提出的框架的包容性和鲁棒性。这些行为包括穿过随机生成的踏板跑步、跳过随机障碍物、执行绕杆运动、突然改变跑动方向(用一条随机腿)、以及在受到重大干扰和不确定性时保持稳定性。此外,我们还进行了一系列不确定性和噪声条件下的额外仿真,以更好地证明所提出方法对现实世界挑战(如信号噪声、不准确性、建模误差和延迟)的鲁棒性。 所有上述行为都是使用单个库和相同的WBC控制参数集完成的,无需额外调优。315条不同轨迹的弹簧-质量和死区控制增益库总共在4.5秒内自动计算得出。
https://arxiv.org/abs/2512.13304
This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: this https URL.
本文研究了强化学习(RL)在多机器人社交编队导航中的应用,这是实现人机无缝共存的关键能力。尽管强化学习提供了一种有前景的方法框架,但行人的行为固有的不可预测性和往往不合作的动态特性为机器人协调探索带来了重大挑战,特别是在效率方面。为此,我们提出了一种新颖的协作探索多机器人RL算法,引入了内在动机探索机制。该算法的核心组成部分是一个自学习的内在奖励机制,旨在集体减轻策略保守性问题。 此外,此算法在集中的训练和分散执行框架内整合了一个双采样模式,以增强导航策略和内在奖励的表现力,并利用两时间尺度更新规则来解耦参数更新。实证研究表明,在社交编队导航基准测试中,所提出的算法在其关键指标上优于现有的最先进方法。 我们的代码和视频演示可在以下网址获取:[此链接](this https URL)。
https://arxiv.org/abs/2512.13293
Cable-driven continuum robots (CDCRs) require accurate, real-time dynamic models for high-speed dynamics prediction or model-based control, making such capability an urgent need. In this paper, we propose the Lightweight Actuation-Space Energy Modeling (LASEM) framework for CDCRs, which formulates actuation potential energy directly in actuation space to enable lightweight yet accurate dynamic modeling. Through a unified variational derivation, the governing dynamics reduce to a single partial differential equation (PDE), requiring only the Euler moment balance while implicitly incorporating the Newton force balance. By also avoiding explicit computation of cable-backbone contact forces, the formulation simplifies the model structure and improves computational efficiency while preserving geometric accuracy and physical consistency. Importantly, the proposed framework for dynamic modeling natively supports both force-input and displacement-input actuation modes, a capability seldom achieved in existing dynamic formulations. Leveraging this lightweight structure, a Galerkin space-time modal discretization with analytical time-domain derivatives of the reduced state further enables an average 62.3% computational speedup over state-of-the-art real-time dynamic modeling approaches.
电缆驱动的连续机器人(CDCR)需要准确、实时的动力学模型来进行高速动力学预测或基于模型的控制,因此具备这样的能力显得尤为迫切。本文提出了轻量化的作动空间能量建模(LASEM)框架用于CDCRs,该框架直接在作动空间中表述潜在能量,从而实现既轻量又准确的动力学建模。通过统一变分推导,主导动力学简化为单一的偏微分方程(PDE),仅需欧拉力矩平衡,并隐含地包含牛顿力平衡。同时避免显式计算电缆和主体接触力,该公式简化了模型结构并提高了计算效率,同时保持了几何精度和物理一致性。重要的是,所提出的动力学建模框架原生支持力输入和位移输入的作动模式,这是现有动力学公式中很少实现的能力之一。利用这种轻量级结构,结合伽辽金时空模态离散化以及简化状态的时间域导数解析表示,使该方法在实时动力学建模方面的计算速度平均提高了62.3%,超过了当前最先进的方法。 这段翻译概述了LASEM框架的创新点、原理和优势,并强调其在提高CDCR动态建模效率方面的能力。
https://arxiv.org/abs/2512.13271
Autonomous Mobile Robots (AMRs) have become indispensable in industrial applications due to their operational flexibility and efficiency. Navigation serves as a crucial technical foundation for accomplishing complex tasks. However, navigating AMRs in dense, cluttered, and semi-structured environments remains challenging, primarily due to nonholonomic vehicle dynamics, interactions with mixed static/dynamic obstacles, and the non-convex constrained nature of such operational spaces. To solve these problems, this paper proposes an Improved Sequential Model Predictive Control (ISMPC) navigation framework that systematically reformulates navigation tasks as sequential switched optimal control problems. The framework addresses the aforementioned challenges through two key innovations: 1) Implementation of a Multi-Directional Safety Rectangular Corridor (MDSRC) algorithm, which encodes the free space through rectangular convex regions to avoid collision with static obstacles, eliminating redundant computational burdens and accelerating solver convergence; 2) A sequential MPC navigation framework that integrates corridor constraints with barrier function constraints is proposed to achieve static and dynamic obstacle avoidance. The ISMPC navigation framework enables direct velocity generation for AMRs, simplifying traditional navigation algorithm architectures. Comparative experiments demonstrate the framework's superiority in free-space utilization ( an increase of 41.05$\%$ in the average corridor area) while maintaining real-time computational performance (average corridors generation latency of 3 ms).
自主移动机器人(AMR)由于其操作灵活性和效率,在工业应用中变得不可或缺。导航是完成复杂任务的关键技术基础之一。然而,要在密集、拥挤且半结构化的环境中导航AMR仍然是一项挑战,这主要是因为非完整车辆动力学、与静态/动态障碍物的相互作用以及此类作业空间的非凸约束性质。为了解决这些问题,本文提出了一种改进的序列模型预测控制(ISMPC)导航框架,该框架将导航任务系统地重新表述为一系列切换最优控制问题。 该框架通过两个关键创新解决了上述挑战:1) 实施多方向安全矩形走廊算法 (MDSRC),通过矩形凸区域编码自由空间以避免与静态障碍物碰撞,消除了冗余计算负担,并加速了解决方案的收敛;2) 提出了一种结合走廊约束和屏障函数约束的顺序MPC导航框架,以实现对静态和动态障碍物的规避。ISMPC导航框架使得AMR可以直接生成速度指令,简化了传统导航算法架构。 比较实验表明,该框架在自由空间利用率方面具有优越性(平均走廊面积增加了41.05%),同时保持了实时计算性能(平均走廊生成延迟为3毫秒)。
https://arxiv.org/abs/2512.13215
As battery technologies advance toward higher stability and energy density, the need for extensive cell-level testing across various component configurations becomes critical. To evaluate performance and understand the operating principles of batteries in laboratory scale, fabrication and evaluation of coin cells are essential processes. However, the conventional coin-cell assembly and testing processes require significant time and labor from researchers, posing challenges to high-throughput screening research. In this study, we introduce an Automated Li-ion BAttery Testing RObot SyStem (ALBATROSS), an automated system capable of electrolyte formulation, coin-cell assembly, and electrochemical evaluation. The system, integrated within a argon-filled glovebox, enables fully automated assembly and testing of up to 48 cells without researcher intervention. By incorporating custom-designed robot gripper and 3D-printed structures optimized for precise cell handling, ALBATROSS achieved high assembly reliability, yielding a relative standard deviation (RSD) of less than 1.2% in discharge capacity and a standard deviation of less than 3 {\Omega} in EIS measurements for NCM811||Li half cells. Owing to its high reliability and automation capability, ALBATROSS allows for the acquisition of high-quality coin-cell datasets, which are expected to accelerate the development of next-generation electrolytes.
随着电池技术的进步,向着更高的稳定性和能量密度发展,对不同组件配置下进行全面的单体电池测试变得至关重要。为了评估实验室规模下电池的性能和理解其工作原理,制造和评价纽扣电池是必不可少的过程。然而,传统的纽扣电池组装和测试过程需要研究人员投入大量的时间和劳动力,这对高通量筛选研究提出了挑战。在本研究中,我们引入了一种名为ALBATROSS(Automated Li-ion BAttery Testing RObot SyStem)的自动化系统,该系统能够实现电解液配制、纽扣电池组装和电化学评估。此系统集成在一个充满氩气的手套箱内,能够在研究人员不进行干预的情况下实现多达48个单元的全自动装配和测试。 通过采用定制设计的机器人夹具和优化用于精准电池处理的3D打印结构,ALBATROSS实现了高装配可靠性,在NCM811||Li 半电池的放电容量方面达到了小于1.2%的相对标准偏差(RSD),在EIS测量中的标准偏差小于3 Ω。由于其高度可靠性和自动化能力,ALBATROSS能够获取高质量的纽扣电池数据集,这有望加速下一代电解质的发展进程。
https://arxiv.org/abs/2512.13198
Most path following and trajectory tracking algorithms in mobile robotics require the desired path or trajectory to be defined by at least twice continuously differentiable functions to guarantee key properties such as global convergence, especially for nonholonomic robots like unicycles with speed constraints. Consequently, these algorithms typically exclude continuous but non-differentiable paths, such as piecewise functions. Despite this exclusion, such paths provide convenient high-level inputs for describing robot missions or behavior. While techniques such as spline interpolation or optimization-based methods are commonly used to smooth non-differentiable paths or create feasible ones from sequences of waypoints, they either can produce unnecessarily complex trajectories or are computationally expensive. In this work, we present a method to regularize non-differentiable functions and generate feasible paths through mollification. Specifically, we approximate an arbitrary path with a differentiable function that can converge to it with arbitrary precision. Additionally, we provide a systematic method for bounding the curvature of generated paths, which we demonstrate by applying it to paths resulting from linking a sequence of waypoints with segments. The proposed approach is computationally efficient, enabling real-time implementation on microcontrollers and compatibility with standard trajectory tracking and path following algorithms.
大多数移动机器人的路径跟随和轨迹追踪算法要求期望的路径或轨迹必须由至少二阶连续可微函数定义,以保证全局收敛等关键属性,尤其是在像单轮车这样的受约束非完整机器人中。因此,这些算法通常排除了虽然连续但不可微的路径,如分段函数。尽管如此,这类路径仍然为描述机器人的任务或行为提供了方便且高层次的输入。然而,常用的技术如样条插值或基于优化的方法在平滑非可微路径或将一系列航点序列化成可行路径时,可能会生成不必要的复杂轨迹或者计算成本高昂。 在这项工作中,我们提出了一种方法来正则化不可微函数并通过“磨光”(mollification)生成可行的路径。具体而言,我们将任意一条路径近似为一个可微函数,并且这个函数能够以任意精度收敛到原路径。此外,我们提供了一套系统的方法来限制生成路径的曲率,并通过将其应用于连接一系列航点片段所形成的路径上进行展示。 该方法在计算效率方面表现出色,使得其实时实现在微控制器上的部署成为可能,并与标准的轨迹跟踪和路径跟随算法兼容。
https://arxiv.org/abs/2512.13183
Manufacturing processes are often perturbed by drifts in the environment and wear in the system, requiring control re-tuning even in the presence of repetitive operations. This paper presents an iterative learning framework for automatic tuning of Nonlinear Model Predictive Control (NMPC) weighting matrices based on task-level performance feedback. Inspired by norm-optimal Iterative Learning Control (ILC), the proposed method adaptively adjusts NMPC weights Q and R across task repetitions to minimize key performance indicators (KPIs) related to tracking accuracy, control effort, and saturation. Unlike gradient-based approaches that require differentiating through the NMPC solver, we construct an empirical sensitivity matrix, enabling structured weight updates without analytic derivatives. The framework is validated through simulation on a UR10e robot performing carbon fiber winding on a tetrahedral core. Results demonstrate that the proposed approach converges to near-optimal tracking performance (RMSE within 0.3% of offline Bayesian Optimization (BO)) in just 4 online repetitions, compared to 100 offline evaluations required by BO algorithm. The method offers a practical solution for adaptive NMPC tuning in repetitive robotic tasks, combining the precision of carefully optimized controllers with the flexibility of online adaptation.
制造过程常常会受到环境漂移和系统磨损的影响,即使在重复操作中也需要重新调整控制。本文提出了一种迭代学习框架,用于根据任务级性能反馈自动调整个非线性模型预测控制(NMPC)的加权矩阵。该方法借鉴了范数最优迭代学习控制(ILC)的思想,能够自适应地跨任务重复调整NMPC权重Q和R,以最小化与跟踪精度、控制努力以及饱和度相关的关键性能指标(KPI)。不同于需要通过NMPC求解器进行微分的梯度法方法,我们构建了一个经验灵敏度矩阵,从而在没有解析导数的情况下实现结构化的权重更新。该框架通过UR10e机器人执行碳纤维四面体芯缠绕任务的模拟进行了验证。结果显示,所提出的方法仅需4次在线重复操作就能收敛到接近最优跟踪性能(均方根误差比离线贝叶斯优化(BO)低0.3%),而BO算法需要进行100次离线评估。该方法为在重复机器人任务中适应性地调整个非线性模型预测控制提供了一种实用的解决方案,结合了精心优化控制器的精度与在线调整的灵活性。
https://arxiv.org/abs/2512.13170
Traversing terrains with sparse footholds like legged animals presents a promising yet challenging task for quadruped robots, as it requires precise environmental perception and agile control to secure safe foot placement while maintaining dynamic stability. Model-based hierarchical controllers excel in laboratory settings, but suffer from limited generalization and overly conservative behaviors. End-to-end learning-based approaches unlock greater flexibility and adaptability, but existing state-of-the-art methods either rely on heightmaps that introduce noise and complex, costly pipelines, or implicitly infer terrain features from egocentric depth images, often missing accurate critical geometric cues and leading to inefficient learning and rigid gaits. To overcome these limitations, we propose START, a single-stage learning framework that enables agile, stable locomotion on highly sparse and randomized footholds. START leverages only low-cost onboard vision and proprioception to accurately reconstruct local terrain heightmap, providing an explicit intermediate representation to convey essential features relevant to sparse foothold regions. This supports comprehensive environmental understanding and precise terrain assessment, reducing exploration cost and accelerating skill acquisition. Experimental results demonstrate that START achieves zero-shot transfer across diverse real-world scenarios, showcasing superior adaptability, precise foothold placement, and robust locomotion.
在不平坦的地形上,像四足机器人这样的机器设备模仿腿足动物行走是一项既充满希望又极具挑战性的任务。这要求精确的环境感知和灵活的控制能力来确保安全的脚步放置并保持动态稳定性。基于模型的分层控制器在实验室环境中表现出色,但在实际应用中却受限于其有限的适应性和过于保守的行为模式。而端到端的学习方法虽然可以解锁更大的灵活性与适应性,但现有的先进技术要么依赖引入噪声和复杂、昂贵管道的高度图,要么从自我的深度图像隐式推断地形特征,这往往忽略了关键的几何线索,并导致学习效率低下及步态僵化。 为了克服这些限制,我们提出了一种名为START的新框架。这是一个单阶段的学习框架,旨在实现四足机器人在高度稀疏和随机脚点分布环境中的敏捷且稳定的运动能力。通过仅使用低成本的内置视觉与本体感受传感器,START可以精确重建局部地形的高度图,提供一种明确的中间表示形式来传递有关稀疏脚点区域的相关特征信息。这有助于全面了解环境并做出准确的地形评估,从而降低探索成本并加速技能获取过程。 实验结果表明,START在各种真实世界场景中实现了零样本迁移,展示了其出色的适应性、精确的脚步放置能力和稳健的运动能力。
https://arxiv.org/abs/2512.13153
Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
多模态3D物体检测在机器人技术和自动驾驶中对于可靠的感知至关重要。然而,由于恶劣天气引起的扭曲和不同数据模式之间的不匹配,在恶劣天气条件下其有效性仍然受限。在这项工作中,我们提出了DiffFusion,这是一种新颖的框架,旨在通过基于扩散的方法恢复和自适应跨模式融合来增强恶劣天气条件下的鲁棒性。我们的关键见解是,扩散模型具有强大的去噪能力和可以适应各种天气条件的数据生成能力。在此基础上,DiffFusion引入了Diffusion-IR(用于修复受天气影响而退化的图像)以及点云修复(PCR)(使用图像对象线索来补偿受损的LiDAR数据)。为了解决两种模式之间的不匹配问题,我们开发了一种双向自适应融合和对齐模块(BAFAM)。它能够实现动态多模态融合并进行双向俯视图(BEV)对齐以保持一致的空间对应关系。 在三个公开数据集上的大量实验表明,DiffFusion在恶劣天气条件下实现了最先进的鲁棒性,并且同时保持了强大的干净数据性能。在现实世界的DENSE数据集上进行的零样本结果进一步验证了其泛化能力。我们将开放源代码发布我们的DiffFusion实现。
https://arxiv.org/abs/2512.13107
Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodiments -- robot arm and gripper combinations -- across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datasets, has been widely used as the foundation for training generalist policies. However, it is highly imbalanced: the top four robot types account for over 85\% of its real data, which risks overfitting to robot--scene combinations. We present AugE-Toolkit, a scalable robot augmentation pipeline, and OXE-AugE, a high-quality open-source dataset that augments OXE with 9 different robot embodiments. OXE-AugE provides over 4.4 million trajectories, more than triple the size of the original OXE. We conduct a systematic study of how scaling robot augmentation impacts cross-embodiment learning. Results suggest that augmenting datasets with diverse arms and grippers improves policy performance not only on the augmented robots, but also on unseen robots and even the original robots under distribution shifts. In physical experiments, we demonstrate that state-of-the-art generalist policies such as OpenVLA and $\pi_0$ benefit from fine-tuning on OXE-AugE, improving success rates by 24-45% on previously unseen robot--gripper combinations across four real-world manipulation tasks. Project website: this https URL.
为了训练能够控制各种机器人实体(如机械臂和夹爪组合)在多种任务和环境中操作的通用型机器人策略,需要大规模且多样化的数据集。由于重新收集演示并为每种新的硬件平台进行再培训成本高昂,我们展示了如何利用现有机器人数据进行增强以实现迁移学习与泛化能力提升。Open X-Embodiment(OXE)数据集整合了来自超过60个机器人数据集的演示资料,并被广泛用作训练通用策略的基础。然而,该数据集存在严重的不平衡问题:前四种机器人的类型占据了其实际数据量的85%以上,这增加了过度拟合到特定机器人场景组合的风险。 为此,我们提出了AugE-Toolkit这一可扩展的机器人增强流水线,以及OXE-AugE这个高质量开源数据集,它通过增加9种不同的机器人实体来扩充原有的OXE数据集。OXE-AugE提供了超过440万条轨迹的数据量,几乎是原版OXE的三倍之多。 我们系统地研究了扩展机器人增强如何影响跨实体学习的效果。研究表明:使用包含多样化机械臂和夹爪的增强数据集能够提升策略在增强过的机器人、未曾见过的新机器人以及原始机器人在分布变化情况下的性能表现。 在实际实验中,我们的结果表明最先进的通用型政策(如OpenVLA和$\pi_0$)从OXE-AugE进行微调后,在四个现实世界的操作任务上对之前未见过的机械臂-夹爪组合的成功率提高了24%-45%。项目的官方网站为:[此链接](this https URL)。
https://arxiv.org/abs/2512.13100