Recent advances in robot learning have generated significant interest in capable platforms that may eventually approach human-level competence. This interest, combined with the commoditization of actuators, has propelled growth in low-cost robotic platforms. However, the optimal form factor for mobile manipulation, especially on a budget, remains an open question. We introduce YOR, an open-source, low-cost mobile manipulator that integrates an omnidirectional base, a telescopic vertical lift, and two arms with grippers to achieve whole-body mobility and manipulation. Our design emphasizes modularity, ease of assembly using off-the-shelf components, and affordability, with a bill-of-materials cost under 10,000 USD. We demonstrate YOR's capability by completing tasks that require coordinated whole-body control, bimanual manipulation, and autonomous navigation. Overall, YOR offers competitive functionality for mobile manipulation research at a fraction of the cost of existing platforms. Project website: this https URL
近期在机器人学习领域的进展引起了对可能最终接近人类级能力的平台的高度兴趣。这种兴趣,加上执行器(actuators)商品化的趋势,推动了低成本机器人平台的增长。然而,在预算有限的情况下,移动操作的最佳形式因素仍是一个开放性问题。我们介绍了一款名为YOR的开源、低成本移动机械臂平台,它集成了全方位底盘、伸缩式垂直提升装置和两个带有抓手的手臂,以实现全身运动能力和操作能力。我们的设计注重模块化,并且可以通过使用现成的组件轻松组装,同时保持成本效益,物料清单(bill-of-materials)的成本低于10,000美元。我们通过完成需要协调全身控制、双臂操作和自主导航的任务来展示YOR的能力。总体而言,YOR为移动机械手研究提供了一种功能强大且价格仅为现有平台一小部分的解决方案。 项目网站:[此URL](this%20https%20URL)
https://arxiv.org/abs/2602.11150
Humanoid locomotion has advanced rapidly with deep reinforcement learning (DRL), enabling robust feet-based traversal over uneven terrain. Yet platforms beyond leg length remain largely out of reach because current RL training paradigms often converge to jumping-like solutions that are high-impact, torque-limited, and unsafe for real-world deployment. To address this gap, we propose APEX, a system for perceptive, climbing-based high-platform traversal that composes terrain-conditioned behaviors: climb-up and climb-down at vertical edges, walking or crawling on the platform, and stand-up and lie-down for posture reconfiguration. Central to our approach is a generalized ratchet progress reward for learning contact-rich, goal-reaching maneuvers. It tracks the best-so-far task progress and penalizes non-improving steps, providing dense yet velocity-free supervision that enables efficient exploration under strong safety regularization. Based on this formulation, we train LiDAR-based full-body maneuver policies and reduce the sim-to-real perception gap through a dual strategy: modeling mapping artifacts during training and applying filtering and inpainting to elevation maps during deployment. Finally, we distill all six skills into a single policy that autonomously selects behaviors and transitions based on local geometry and commands. Experiments on a 29-DoF Unitree G1 humanoid demonstrate zero-shot sim-to-real traversal of 0.8 meter platforms (approximately 114% of leg length), with robust adaptation to platform height and initial pose, as well as smooth and stable multi-skill transitions.
基于深度强化学习(DRL)的人形机器人行走技术已经取得了快速进展,实现了在不平坦地形上稳健的脚部穿越。然而,越过高于腿部长度的平台仍然是一个挑战,因为目前的强化学习训练模式往往会收敛于跳跃式的解决方案,这种方案高冲击、扭矩受限且不适合实际部署。为了解决这个问题,我们提出了APEX系统——一种感知驱动、基于攀爬的高空平台穿越系统,该系统能够组合地形条件行为:在垂直边缘处向上和向下攀爬,在平台上行走或爬行以及站立和躺下以重新配置姿势。我们的方法的核心是一种通用的棘轮进度奖励机制,用于学习接触密集型、目标导向的动作。这种机制追踪迄今为止的最佳任务进展,并对非改进步骤进行惩罚,提供了稠密而无需考虑速度的监督,这有助于在强烈的安全约束条件下有效探索。 基于这一理论框架,我们使用LiDAR数据训练全身操控策略,并通过双管齐下的方法来减少仿真与现实之间的感知差距:在训练过程中模拟映射误差,在部署时对高程地图进行过滤和填补。最后,我们将所有六项技能整合到一个单一的策略中,该策略能够根据局部几何形状和指令自主选择行为并实现多技能转换。 实验结果表明,基于29自由度的Unitree G1人形机器人能够在不进行额外训练的情况下实现在0.8米高的平台上(大约为腿部长度的114%)自如穿越。此外,该系统还展示了对平台高度和初始姿势变化的强大适应性,并能够实现平稳、稳定的多技能转换。
https://arxiv.org/abs/2602.11143
We study the \textit{min-sum uniform coverage} problem for a swarm of $n$ mobile robots on a given finite line segment and on a circle having finite positive radius, where the circle is given as an input. The robots must coordinate their movements to reach a uniformly spaced configuration that minimizes the total distance traveled by all robots. The robots are autonomous, anonymous, identical, and homogeneous, and operate under the \textit{Look-Compute-Move} (LCM) model with \textit{non-rigid} motion controlled by a fair asynchronous scheduler. They are oblivious and silent, possessing neither persistent memory nor a means of explicit communication. In the \textbf{line-segment setting}, the \textit{min-sum uniform coverage} problem requires placing the robots at uniformly spaced points along the segment so as to minimize the total distance traveled by all robots. In the \textbf{circle setting} for this problem, the robots have to arrange themselves uniformly around the given circle to form a regular $n$-gon. There is no fixed orientation or designated starting vertex, and the goal is to minimize the total distance traveled by all the robots. We present a deterministic distributed algorithm that achieves uniform coverage in the line-segment setting with minimum total movement cost. For the circle setting, we characterize all initial configurations for which the \textit{min-sum uniform coverage} problem is deterministically unsolvable under the considered robot model. For all the other remaining configurations, we provide a deterministic distributed algorithm that achieves uniform coverage while minimizing the total distance traveled. These results characterize the deterministic solvability of min-sum coverage for oblivious robots and achieve optimal cost whenever solvable.
我们研究了一组 $n$ 个移动机器人的 \textit{最小总和均匀覆盖} 问题,这些机器人在给定的有限线段和具有正半径的圆上运行。其中,圆作为输入被提供给系统。机器人必须协调它们的动作以达到一个等间距配置,在这个配置中所有机器人所行进的总距离是最小的。这些机器人是自主、匿名且同构的,并在\textit{观察-计算-移动}(LCM)模型下操作,该模型具有非刚性的运动控制并且由公平异步调度程序管理。它们不具备持久性记忆和显式通信的能力。 在线段设置中,最小总和均匀覆盖问题的目标是将机器人放置于线段上的等间距点上,以使所有机器人的行进总距离最小化。在圆的设定下,这个问题要求机器人围绕给定的圆形成一个正 $n$-边形。没有固定的定向或指定的起始顶点,并且目标是最小化所有机器人所行进的总距离。 我们提出了一种确定性分布算法,在线段设置中实现了最小总和均匀覆盖的同时达到了最低的整体移动成本。对于圆形设定,我们表征了在给定机器人的模型下使得 \textit{最小总和均匀覆盖} 问题无法通过确定性方法解决的所有初始配置。对于所有其他剩余的配置,我们提供了一种实现等间距覆盖并使总行进距离最小化的确定性分布算法。 这些结果为无意识机器人情况下的最小总和覆盖率确定可解性进行了表征,并且在可行的情况下实现了最优成本。
https://arxiv.org/abs/2602.11125
Accurate localization of maritime targets by unmanned aerial vehicles (UAVs) remains challenging in GPS-denied environments. UAVs equipped with gimballed electro-optical sensors are typically used to localize targets, however, reliance on these sensors increases mechanical complexity, cost, and susceptibility to single-point failures, limiting scalability and robustness in multi-UAV operations. This work presents a new trajectory optimization framework that enables cooperative target localization using UAVs with fixed, non-gimballed cameras operating in coordination with a surface vessel. This estimation-aware optimization generates dynamically feasible trajectories that explicitly account for mission constraints, platform dynamics, and out-of-frame events. Estimation-aware trajectories outperform heuristic paths by reducing localization error by more than a factor of two, motivating their use in cooperative operations. Results further demonstrate that coordinated UAVs with fixed, non-gimballed cameras achieve localization accuracy that meets or exceeds that of single gimballed systems, while substantially lowering system complexity and cost, enabling scalability, and enhancing mission resilience.
在没有GPS信号的环境中,无人飞行器(UAV)准确地定位海上目标仍然面临挑战。通常,装备有云台电光传感器的无人机被用来定位这些目标,然而这种依赖增加了机械复杂性、成本以及单点故障的风险,从而限制了多无人机操作中的可扩展性和鲁棒性。这项工作提出了一种新的轨迹优化框架,该框架允许配备固定且非旋转相机的无人机与水面舰艇协同作业以进行目标定位。这一基于估计意识的优化方法能够生成动态可行的飞行路径,这些路径明确地考虑到了任务限制、平台动力学以及图像画面外发生的事件。相比基于直觉的方法,这种基于估计的轨迹能将定位误差降低超过两倍,从而激励在协作操作中采用它们。实验结果进一步证明,协同工作的无人机通过使用固定且非旋转相机进行目标定位时,其准确性至少与单个云台系统相当甚至更优,并显著降低了系统的复杂性和成本,从而增强了任务的可扩展性和韧性。
https://arxiv.org/abs/2602.11116
We present a novel receding-horizon multi-contact motion planner for legged robots in challenging scenarios, able to plan motions such as chimney climbing, navigating very narrow passages or crossing large gaps. Our approach adds new capabilities to the state of the art, including the ability to reactively re-plan in response to new information, and planning contact locations and whole-body trajectories simultaneously, simplifying the implementation and removing the need for post-processing or complex multi-stage approaches. Our method is more resistant to local minima problems than other potential field based approaches, and our quadratic-program-based posture generator returns nodes more quickly than those of existing algorithms. Rigorous statistical analysis shows that, with short planning horizons (e.g., one step ahead), our planner is faster than the state-of-the-art across all scenarios tested (between 45% and 98% faster on average, depending on the scenario), while planning less efficient motions (requiring 5% fewer to 700% more stance changes on average). In all but one scenario (Chimney Walking), longer planning horizons (e.g., four steps ahead) extended the average planning times (between 73% faster and 400% slower than the state-of-the-art) but resulted in higher quality motion plans (between 8% more and 47% fewer stance changes than the state-of-the-art).
我们提出了一种新颖的滚动规划多接触运动规划器,用于在挑战性场景中工作的腿式机器人。这种规划器能够为诸如烟囱攀爬、穿越非常狭窄的空间或跨越大跨度等复杂动作进行路径规划。我们的方法增强了现有技术能力,包括根据新信息动态重新规划的能力,以及同时规划接触点和全身轨迹的功能,从而简化了实现过程并消除了后处理或多阶段复杂方法的需求。 相比于基于势场的其他方法,本方法对局部极小值问题具有更强的抵抗力,并且我们基于二次规划的身体姿态生成器比现有算法更快地返回节点。严格的统计分析表明,在短时间框架(如前一步骤)内,我们的规划器在所有测试场景中都比现有的最佳方案更快速(平均而言,速度快45%至98%,具体取决于场景),尽管它计划出的动作效率略低(所需站立位置变更次数平均减少5%到增加700%不等)。除了烟囱行走这一特定场景外,在长时间框架(如四步之前)内,虽然规划时间有所延长(比现有的最佳方案快73%至慢400%不等),但是动作计划的质量更高(相比现有最优方法,站立位置变更次数平均减少8%到增加47%)。
https://arxiv.org/abs/2602.11113
Characterization of fragmented rock piles is a fundamental task in the mining and quarrying industries, where rock is fragmented by blasting, transported using wheel loaders, and then sent for further processing. This field report studies a novel method for estimating the relative particle size of fragmented rock piles from only proprioceptive data collected while digging with a wheel loader. Rather than employ exteroceptive sensors (e.g., cameras or LiDAR sensors) to estimate rock particle sizes, the studied method infers rock fragmentation from an excavator's inertial response during excavation. This paper expands on research that postulated the use of wavelet analysis to construct a unique feature that is proportional to the level of rock fragmentation. We demonstrate through extensive field experiments that the ratio of wavelet features, constructed from data obtained by excavating in different rock piles with different size distributions, approximates the ratio of the mean particle size of the two rock piles. Full-scale excavation experiments were performed with a battery electric, 18-tonne capacity, load-haul-dump (LHD) machine in representative conditions in an operating quarry. The relative particle size estimates generated with the proposed sensing methodology are compared with those obtained from both a vision-based fragmentation analysis tool and from sieving of sampled materials.
岩石堆的特性分析是采矿和采石业中的基本任务,因为通过爆破破碎的岩石会使用轮式装载机进行运输,并送往进一步处理。本领域报告研究了一种仅基于在使用轮式装载机挖掘时收集到的内感受数据来估计碎裂岩堆中相对颗粒大小的新方法。这种方法不同于使用外部感知传感器(例如摄像头或激光雷达)来估算岩石颗粒大小,而是从挖掘机在挖掘过程中惯性响应的变化中推断出岩石碎片的情况。 本文扩展了之前的研究工作,该研究提出了采用小波分析的方法构建一个与岩石破碎程度成正比的独特特征。我们通过广泛的现场实验表明,从小型装载机在同一采石场的不同岩石堆中进行不同粒径分布的挖掘过程中获得的数据所构建的小波特征比率近似于两个岩堆平均颗粒大小的比例。 全尺寸挖掘试验是在代表性的运行中的采石场条件下使用18吨容量的电池电力驱动、装运和倾倒(LHD)机器完成。提出的感应方法生成的相对颗粒大小估计结果与基于视觉的破碎分析工具的结果以及通过筛分采集样品材料所得出的结果进行了比较。
https://arxiv.org/abs/2602.11082
Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on-policy RL in the physical world is constrained by safety risk, hardware cost, and environment reset. To bridge this gap, we present RISE, a scalable framework of robotic reinforcement learning via imagination. At its core is a Compositional World Model that (i) predicts multi-view future via a controllable dynamics model, and (ii) evaluates imagined outcomes with a progress value model, producing informative advantages for the policy improvement. Such compositional design allows state and value to be tailored by best-suited yet distinct architectures and objectives. These components are integrated into a closed-loop self-improving pipeline that continuously generates imaginary rollouts, estimates advantages, and updates the policy in imaginary space without costly physical interaction. Across three challenging real-world tasks, RISE yields significant improvement over prior art, with more than +35% absolute performance increase in dynamic brick sorting, +45% for backpack packing, and +35% for box closing, respectively.
尽管在模型容量和数据采集方面持续扩展,Vision-Language-Action (VLA) 模型在接触密集且动态操作的任务中仍然脆弱,其中轻微的执行偏差可能累积成失败。虽然强化学习(RL)提供了一条实现鲁棒性的原则路径,在物理世界中的在线强化学习受安全风险、硬件成本和环境重置的限制。为弥合这一差距,我们提出了RISE,一个通过想象进行机器人强化学习的可扩展框架。其核心是一个组合式世界模型,该模型(i)通过可控的动力学模型预测多视图未来,并(ii)使用进展价值模型评估所设想的结果,从而产生对策略改进有用的有信息量的优势。这种组合设计允许状态和价值由最适合但不同架构和目标定制。这些组件被集成到一个闭环自我改进管道中,在这个管道中,系统会持续生成想象的轨迹、估计优势,并在不进行昂贵物理交互的情况下更新策略。 通过三个具有挑战性的现实世界任务验证了RISE的有效性,相较于现有技术成果,分别实现了动态积木排序性能提高超过35%,背包打包超过45%,以及盒子关闭方面提高了35%。
https://arxiv.org/abs/2602.11075
Ensuring safe robot operation in cluttered and dynamic environments remains a fundamental challenge. While control barrier functions provide an effective framework for real-time safety filtering, their performance critically depends on the underlying geometric representation, which is often simplified, leading to either overly conservative behavior or insufficient collision coverage. Superquadrics offer an expressive way to model complex shapes using a few primitives and are increasingly used for robot safety. To integrate this representation into collision avoidance, most existing approaches directly use their implicit functions as barrier candidates. However, we identify a critical but overlooked issue in this practice: the gradients of the implicit SQ function can become severely ill-conditioned, potentially rendering the optimization infeasible and undermining reliable real-time safety filtering. To address this issue, we formulate an SQ-based safety filtering framework that uses signed distance functions as barrier candidates. Since analytical SDFs are unavailable for general SQs, we compute distances using the efficient Gilbert-Johnson-Keerthi algorithm and obtain gradients via randomized smoothing. Extensive simulation and real-world experiments demonstrate consistent collision-free manipulation in cluttered and unstructured scenes, showing robustness to challenging geometries, sensing noise, and dynamic disturbances, while improving task efficiency in teleoperation tasks. These results highlight a pathway toward safety filters that remain precise and reliable under the geometric complexity of real-world environments.
在复杂和动态环境中确保机器人安全操作仍然是一个基本挑战。虽然控制屏障函数为实时安全性过滤提供了有效的框架,但它们的性能很大程度上依赖于基础几何表示,这种表示通常被简化了,导致行为过于保守或碰撞检测不足。超二次体(Superquadrics)提供了一种表达复杂形状的有效方法,使用少量的基本图形,并且越来越多地用于机器人安全领域。为了将这种表示集成到避碰算法中,大多数现有方法直接利用它们的隐式函数作为屏障候选者。然而,我们发现了一个关键但被忽视的问题:超二次体隐式函数的梯度可能会变得严重病态,这可能导致优化不可行并破坏可靠实时的安全性过滤。 为了解决这个问题,我们制定了一种基于超二次体的安全过滤框架,该框架使用符号距离函数(Signed Distance Functions, SDF)作为屏障候选者。由于一般的超二次体没有可用的解析SDF,我们采用高效的Gilbert-Johnson-Keerthi算法来计算距离,并通过随机平滑法获得梯度。 广泛的模拟和现实世界实验表明,在复杂和非结构化的环境中可以保持一致的无碰撞操作,展示了对挑战性几何形状、传感噪声以及动态干扰的强大鲁棒性。同时,该方法提高了远程操作任务中的任务效率。这些结果强调了一条通向在真实世界的几何复杂度下仍能保持精准和可靠的安全过滤器的道路。
https://arxiv.org/abs/2602.11049
Flight control for autonomous micro aerial vehicles (MAVs) is evolving from steady flight near equilibrium points toward more aggressive aerobatic maneuvers, such as flips, rolls, and Power Loop. Although reinforcement learning (RL) has shown great potential in these tasks, conventional RL methods often suffer from low data efficiency and limited generalization. This challenge becomes more pronounced in multi-task scenarios where a single policy is required to master multiple maneuvers. In this paper, we propose a novel end-to-end multi-task reinforcement learning framework, called GEAR (Geometric Equivariant Aerobatics Reinforcement), which fully exploits the inherent SO(2) rotational symmetry in MAV dynamics and explicitly incorporates this property into the policy network architecture. By integrating an equivariant actor network, FiLM-based task modulation, and a multi-head critic, GEAR achieves both efficiency and flexibility in learning diverse aerobatic maneuvers, enabling a data-efficient, robust, and unified framework for aerobatic control. GEAR attains a 98.85\% success rate across various aerobatic tasks, significantly outperforming baseline methods. In real-world experiments, GEAR demonstrates stable execution of multiple maneuvers and the capability to combine basic motion primitives to complete complex aerobatics.
自主微型飞行器(MAV)的飞行控制正从平衡点附近的稳定飞行转向更激进的特技动作,如翻滚、筋斗和动力环路。尽管强化学习(RL)在这些任务中表现出巨大潜力,但传统RL方法通常面临数据效率低且泛化能力有限的问题。这种挑战在需要掌握多种机动动作的多任务场景中尤为突出。本文提出了一种新的端到端多任务强化学习框架GEAR(Geometric Equivariant Aerobatics Reinforcement),该框架充分利用了MAV动力学中的SO(2)旋转对称性,并将这一属性明确地融入到了策略网络架构中。通过整合等变演员网络、FiLM基元的任务调节以及多头评论家,GEAR在学习多样化的特技机动时实现了高效性和灵活性,从而提供了一个数据效率高、鲁棒性强且统一的特技控制框架。GEAR在各种特技任务中的成功率为98.85%,显著优于基准方法。在真实世界实验中,GEAR展示出执行多种动作的稳定性,并能将基本运动原语组合以完成复杂的特技飞行。
https://arxiv.org/abs/2602.10997
Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{this https URL}{this https URL}
翻译如下: Vision-Language-Action(VLA)模型在通用型机器人操作中具有潜力,但在分布外(OOD)设置下仍然脆弱,尤其是在真实机器人的数据有限的情况下。为了解决泛化瓶颈问题,我们引入了一种层次化的视觉语言行动框架\our{},该框架利用大规模预训练的世界模型来实现稳健且具有一般性的VIsual Subgoal TAsk分解(VISTA)。我们的层级框架由一个世界模型作为高层规划器和一个VLA作为低层执行者组成。首先,高级别的世界模型将操作任务划分为带有目标图像的子任务序列,然后低级别的策略根据文本和视觉指引生成动作序列。与原始的文字目标任务规范相比,这些合成的目标图像是对低级策略进行了可视化和物理化的详细描述,使得跨看不见的对象和新颖场景进行泛化成为可能。 我们验证了该框架在大规模分布外环境中的视觉目标合成以及我们的层次化VLA政策,并发现世界模型生成的指引能使相同结构的VLA在新情景下的性能从14%提高到69%。结果表明,相较于之前的方法,在分布外场景中本方法具有明显的优越性。 项目页面: [这个 https URL](这个 https URL)
https://arxiv.org/abs/2602.10983
VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.
VLA(虚拟到实际)模型在具身智能方面取得了显著进步;然而,它们的评估大多局限于模拟环境或高度受限的真实世界场景。这种不匹配造成了一个实质性的现实差距:强大的基准性能往往掩盖了其在多样化物理环境中较差的泛化能力。我们识别出当前基准测试实践中存在三种系统性不足,这些不足阻碍了公平和可靠的模型比较。 1. 现有的基准未能模拟真实世界的动态特性,忽视了诸如动态物体配置、机器人初始状态、光照变化以及传感器噪声等关键因素。 2. 当前的协议忽略了空间-物理智能的重要性,将评估简化为机械操作任务,而这些任务并未对几何推理进行测试。 3. 该领域缺乏可扩展的完全自主评估手段,而是依赖于简单化的二维度量标准(这会忽略三维空间结构)或依赖于人力辅助系统(这类系统成本高、有偏差且不可扩展)。 为了解决这些问题,我们引入了RADAR(现实世界中的自主动力学和推理),这是一个旨在全面评估VLA模型在真实条件下泛化能力的基准。RADAR整合了三个核心组件: 1. 一套原则性的物理动态系统; 2. 专门设计的任务,明确测试空间推理能力和对物理现象的理解; 3. 基于三维度量标准的完全自主评价流水线,从而不再需要人工监督。 我们将RADAR应用于多个前沿VLA模型,并揭示了它们表面能力下的严重脆弱性。在适度的物理动态条件下性能急剧下降,例如,在传感器噪声情况下预期的3D IoU(交并比)从0.261降至0.068。此外,这些模型展示了有限的空间推理能力。 这些发现将RADAR定位为向可靠且具有泛化性的现实世界评估迈出的重要一步。
https://arxiv.org/abs/2602.10980
In this paper, we derive the continuous space-time equations of motion of a three-dimensional geometrically exact rod, or the Cosserat rod, incorporating planar cross-sectional deformation. We then adopt the Lie group variational integrator technique to obtain a discrete model of the rod incorporating both rotational motion and cross-sectional deformation as well. The resulting discrete model possesses several desirable features: it ensures volume conservation of the discrete elements by considering cross-sectional deformation through a local dilatation factor, it demonstrates the beneficial properties associated with the variational integrator technique, such as the preservation of the rotational configuration, and energy conservation with a bounded error. An exhaustive set of numerical results under various initial conditions of the rod demonstrates the efficacy of the model in replicating the physics of the system.
在这篇论文中,我们推导了一个三维几何精确杆(即Cosserat杆)的连续时空运动方程,并考虑了平面截面变形。接着,我们采用Lie群变分积分器技术来获得一个结合了旋转运动和截面变形的离散模型。所得到的离散模型具有若干理想的特性:它通过局部膨胀因子考虑到截面变形从而确保了离散单元体的体积守恒;展示了与变分积分器技术相关的好处,例如保持旋转配置以及能量守恒(误差有限)。一系列不同初始条件下的数值结果表明,该模型能够有效地再现系统的物理行为。
https://arxiv.org/abs/2602.10963
Standard geometric control relies on force-moment decoupling, an assumption that breaks down in many aerial platforms due to spurious forces naturally induced by control moments. While strategies for such coupled systems have been validated experimentally, a rigorous theoretical certification of their stability is currently missing. This work fills this gap by providing the first formal stability analysis for a generic class of floating rigid bodies subject to spurious forces. We introduce a canonical model and construct a Lyapunov-based proof establishing local exponential stability of the hovering equilibrium. Crucially, the analysis explicitly addresses the structural challenges - specifically the induced non-minimum-phase behavior - that prevent the application of standard cascade arguments.
标准几何控制依赖于力矩解耦的假设,但在许多空中平台上,由于控制力矩自然诱导出的虚假力,这一假设不再成立。虽然针对此类耦合系统的策略已经通过实验验证,但目前缺少严格的理论稳定性证明。本研究填补了这一空白,首次为一类受虚假力影响的浮动刚体系统提供了正式的稳定性分析。我们引入了一个典范模型,并构造了一种基于李雅普诺夫方法的证明,建立了悬停平衡点的地方指数稳定性。关键的是,该分析明确解决了结构挑战——特别是由此产生的非最小相位行为——这阻碍了标准级联论证的应用。
https://arxiv.org/abs/2602.10961
During multi-party interactions, gaze direction is a key indicator of interest and intent, making it essential for social robots to direct their attention appropriately. Understanding the social context is crucial for robots to engage effectively, predict human intentions, and navigate interactions smoothly. This study aims to develop an empirical motion-time pattern for human gaze behavior in various social situations (e.g., entering, leaving, waving, talking, and pointing) using deep neural networks based on participants' data. We created two video clips-one for a computer screen and another for a virtual reality headset-depicting different social scenarios. Data were collected from 30 participants: 15 using an eye-tracker and 15 using an Oculus Quest 1 headset. Deep learning models, specifically Long Short-Term Memory (LSTM) and Transformers, were used to analyze and predict gaze patterns. Our models achieved 60% accuracy in predicting gaze direction in a 2D animation and 65% accuracy in a 3D animation. Then, the best model was implemented onto the Nao robot; and 36 new participants evaluated its performance. The feedback indicated overall satisfaction, with those experienced in robotics rating the models more favorably.
在多主体互动中,目光的方向是兴趣和意图的关键指标,使社会机器人能够适当关注这一点变得至关重要。理解社交环境对于机器人的有效参与、预测人类意图以及顺畅地进行交互来说是至关重要的。这项研究旨在使用基于参与者数据的深度神经网络开发出一套经验性运动时间模式,以描述人在各种社交情况下的目光行为(例如:进入、离开、挥手、对话和指物)。我们创建了两个视频片段——一个用于计算机屏幕,另一个用于虚拟现实头盔——来展示不同的社交场景。从30名参与者那里收集数据:15人使用眼动仪,另外15人则使用Oculus Quest 1 头显。研究中运用了深度学习模型(特别是长短期记忆网络LSTM和Transformer)来分析并预测目光模式。我们的模型在2D动画中的目光方向预测准确率为60%,而在3D动画中的准确率达到了65%。然后,最佳模型被部署到了Nao机器人上,并由36名新参与者对其表现进行了评估。反馈表明总体满意度较高,有机器人技术经验的人对该模型的评价更为积极。
https://arxiv.org/abs/2602.10946
This study centers around the design and implementation of the Maya Robot, a portable elephant-shaped social robot, intended to engage with children undergoing cancer treatment. Initial efforts were devoted to enhancing the robot's facial expression recognition accuracy, achieving a 98% accuracy through deep neural networks. Two subsequent preliminary exploratory experiments were designed to advance the study's objectives. The first experiment aimed to compare pain levels experienced by children during the injection process, with and without the presence of the Maya robot. Twenty-five children, aged 4 to 9, undergoing cancer treatment participated in this counterbalanced study. The paired T-test results revealed a significant reduction in perceived pain when the robot was actively present in the injection room. The second experiment sought to assess perspectives of hospitalized children and their mothers during engagement with Maya through a game. Forty participants, including 20 children aged 4 to 9 and their mothers, were involved. Post Human-Maya Interactions, UTAUT questionnaire results indicated that children experienced significantly less anxiety than their parents during the interaction and game play. Notably, children exhibited higher trust levels in both the robot and the games, presenting a statistically significant difference in trust levels compared to their parents (P-value < 0.05). This preliminary exploratory study highlights the positive impact of utilizing Maya as an assistant for therapy/education in a clinical setting, particularly benefiting children undergoing cancer treatment. The findings underscore the potential of social robots in pediatric healthcare contexts, emphasizing improved pain management and emotional well-being among young patients.
这项研究主要围绕Maya机器人的设计和实施,这是一种便携式象形社交机器人,旨在与正在接受癌症治疗的儿童互动。初步工作集中在通过深度神经网络提高机器人面部表情识别准确度上,达到了98%的精度。随后设计了两项初步探索性实验以推进该研究目标。 第一个实验旨在比较孩子在注射过程中有无Maya机器人存在时所经历的疼痛程度。这项平衡对照研究共有25名年龄在4到9岁之间的正在接受癌症治疗的孩子参与其中。配对T检验结果显示,当机器人实际存在于注射室中时,儿童感知到的疼痛显著减少。 第二个实验旨在通过游戏评估住院儿童及其母亲与Maya互动过程中的视角。该实验共涉及40名参与者,其中包括20名年龄在4至9岁的孩子及其母亲。在人类-机器人交互后进行的UTAUT问卷调查结果显示,在互动和游戏中,孩子们比他们的父母经历到显著较少的焦虑,并且他们对机器人的信任度更高,这种信任水平与父母之间存在统计学意义上的差异(P值<0.05)。 这项初步探索性研究表明,将Maya作为辅助工具用于临床环境中的治疗/教育具有积极影响,尤其是在帮助接受癌症治疗的孩子方面。研究结果强调了社交机器人在儿科医疗环境中潜在的巨大价值,包括改善儿童患者的疼痛管理和情感福祉。
https://arxiv.org/abs/2602.10942
Autonomous mobile robots offer promising solutions for labor shortages and increased operational efficiency. However, navigating safely and effectively in dynamic environments, particularly crowded areas, remains challenging. This paper proposes a novel framework that integrates Vision-Language Models (VLM) and Gaussian Process Regression (GPR) to generate dynamic crowd-density maps (``Abstraction Maps'') for autonomous robot navigation. Our approach utilizes VLM's capability to recognize abstract environmental concepts, such as crowd densities, and represents them probabilistically via GPR. Experimental results from real-world trials on a university campus demonstrated that robots successfully generated routes avoiding both static obstacles and dynamic crowds, enhancing navigation safety and adaptability.
自主移动机器人可以为劳动力短缺和运营效率提升提供有前景的解决方案。然而,在动态环境中,尤其是在拥挤区域的安全有效导航仍然具有挑战性。本文提出了一种结合视觉-语言模型(VLM)与高斯过程回归(GPR)的新框架,以生成用于自主机器人导航的动态人群密度图(“抽象地图”)。我们的方法利用了VLM识别环境抽象概念的能力,如人群密度,并通过GPR以概率方式表示这些概念。在大学校园的真实世界试验中进行的实验结果显示,机器人成功地规划出了避开静态障碍物和动态人群的路线,从而提高了导航的安全性和适应性。
https://arxiv.org/abs/2602.10910
This study presents the development and experimental verification of a biomimetic manta ray robot for underwater autonomous exploration. Inspired by manta rays, the robot uses flapping motion for propulsion to minimize seabed disturbance and enhance efficiency compared to traditional screw propulsion. The robot features pectoral fins driven by servo motors and a streamlined control box to reduce fluid resistance. The control system, powered by a Raspberry Pi 3B, includes an IMU and pressure sensor for real-time monitoring and control. Experiments in a pool assessed the robot's swimming and diving capabilities. Results show stable swimming and diving motions with PD control. The robot is suitable for applications in environments like aquariums and fish nurseries, requiring minimal disturbance and efficient maneuverability. Our findings demonstrate the potential of bio-inspired robotic designs to improve ecological monitoring and underwater exploration.
这项研究介绍了开发和实验验证一种仿生蝠鲼机器人,用于水下自主探索。受蝠鲼的启发,该机器人通过拍打运动推进,与传统的螺旋桨推进方式相比,可以减少海底扰动并提高效率。机器人装备有由伺服电机驱动的胸鳍以及流线型控制盒以减小流体阻力。控制系统基于Raspberry Pi 3B,并配备IMU和压力传感器,实现实时监控和控制。在游泳池中的实验评估了机器人的游动能力和潜水能力。结果显示,在PD控制下实现了稳定的游动和潜水运动。该机器人适用于水族馆和鱼苗场等需要最小化干扰并具备高效机动性的环境。我们的研究结果表明,仿生机器人设计具有改善生态监测和水下探索的潜力。
https://arxiv.org/abs/2602.10904
Cross-domain imitation learning (CDIL) accelerates policy learning by transferring expert knowledge across domains, which is valuable in applications where the collection of expert data is costly. Existing methods are either supervised, relying on proxy tasks and explicit alignment, or unsupervised, aligning distributions without paired data, but often unstable. We introduce the Semi-Supervised CDIL (SS-CDIL) setting and propose the first algorithm for SS-CDIL with theoretical justification. Our method uses only offline data, including a small number of target expert demonstrations and some unlabeled imperfect trajectories. To handle domain discrepancy, we propose a novel cross-domain loss function for learning inter-domain state-action mappings and design an adaptive weight function to balance the source and target knowledge. Experiments on MuJoCo and Robosuite show consistent gains over the baselines, demonstrating that our approach achieves stable and data-efficient policy learning with minimal supervision. Our code is available at~ this https URL.
跨域模仿学习(CDIL)通过在不同领域间转移专家知识来加速策略学习,这对于收集专家数据成本高昂的应用场景非常有价值。现有的方法要么是监督式的,依赖于代理任务和显式对齐;要么是非监督式的,在没有配对数据的情况下对齐分布,但往往不稳定。我们引入了半监督CDIL(SS-CDIL)设置,并提出了首个具备理论依据的SS-CDIL算法。我们的方法仅使用离线数据,包括少量的目标专家演示以及一些未标记的不完美轨迹。为了处理领域差异,我们提出了一种新颖的跨域损失函数来学习跨领域的状态动作映射,并设计了一个自适应权重函数以平衡源和目标知识。 在MuJoCo和Robosuite上的实验显示,我们的方法相较于基准方法具有持续性的优势,证明了我们的方法能够实现稳定且数据高效的策略学习,同时只需要最小的监督。我们的代码可以在以下链接找到:[此URL]。
https://arxiv.org/abs/2602.10793
Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.
机器人操作需要预测环境在行动后的演变情况,然而大多数现有系统缺乏这种预测能力,导致经常出现错误和效率低下。虽然视觉-语言模型(VLMs)提供了高层次的指导,但它们无法明确地预测未来的状态,而现有的世界模型要么只能预测短期的时间范围,要么生成的空间上不一致的画面。为了解决这些问题,我们提出了一种快速且具备预测能力的视频条件动作框架。 我们的方法首先选择并调整一个稳健的视频生成模型,以确保可靠的未来预测;然后应用对抗蒸馏(adversarial distillation)进行快速、少步骤的视频生成;最后训练一个利用生成的视频和真实观察来纠正空间错误的动作模型。大量的实验表明,我们提出的方法能够产生时间上连贯且空间上准确的视频预测,直接支持精确的操作,并在具身一致性(embodiment consistency)、空间指代能力以及任务完成度方面显著优于现有的基准方法。 代码与模型将在适当的时候发布。
https://arxiv.org/abs/2602.10717
Operating drones in urban environments often means they need to land on rooftops, which can have different geometries and surface irregularities. Accurately detecting roof inclination using conventional sensing methods, such as vision-based or acoustic techniques, can be unreliable, as measurement quality is strongly influenced by external factors including weather conditions and surface materials. To overcome these challenges, we propose a novel unmanned aerial manipulator morphology featuring a dual-arm aerial manipulator with an omnidirectional 3D workspace and extended reach. Building on this design, we develop a proprioceptive contact detection and contact localization strategy based on a momentum-based torque observer. This enables the UAM to infer the inclination of slanted surfaces blindly - through physical interaction - prior to touchdown. We validate the approach in flight experiments, demonstrating robust landings on surfaces with inclinations of up to 30.5 degrees and achieving an average surface inclination estimation error of 2.87 degrees over 9 experiments at different incline angles.
在城市环境中操作无人机通常意味着它们需要降落在具有不同几何形状和表面不规则性的屋顶上。使用传统的传感方法,如基于视觉或声学的方法来准确检测屋顶的倾斜角度可能不可靠,因为测量质量会受到外部因素(包括天气条件和表面材料)的影响。 为了解决这些问题,我们提出了一种新颖的无人航空机械手形态设计,该设计采用双臂空中机械手,并具有全方位3D工作空间和扩展的工作范围。在此基础上,我们开发了一种基于动量扭矩观察器的本体感受接触检测与定位策略。这种方法使UAM(Unmanned Aerial Manipulator)能够在不依赖视觉信息的情况下通过物理互动推断出倾斜表面的角度,在触地之前进行准确判断。 我们在飞行实验中验证了这一方法的有效性,展示了在不同倾斜角度下(最多30.5度),平均表面倾斜角估计误差仅为2.87度的良好性能。
https://arxiv.org/abs/2602.10703