We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.
我们介绍了Cosmos-Transfer,这是一种条件世界生成模型,可以根据多种模式的空间控制输入(如分割、深度和边缘)来生成世界模拟。设计中采用的空间条件方案是自适应且可定制的,允许在不同空间位置对不同的条件输入赋予不同的权重。这使得高度可控的世界生成成为可能,并适用于各种“从一个世界到另一个世界”的转移用例,包括仿真到现实(Sim2Real)。我们进行了广泛的评估来分析所提出的模型,并展示了其在物理AI中的应用,包括机器人技术的Sim2Real和自动驾驶汽车数据丰富化。此外,我们还演示了一种推理扩展策略,以实现使用NVIDIA GB200 NVL72机柜进行实时世界生成。为了加速该领域的研究开发,我们在以下网址开源了我们的模型和代码:[此处提供具体的URL链接]。
https://arxiv.org/abs/2503.14492
We investigated the performance of existing semi- and fully autonomous methods for controlling flipper-based skid-steer robots. Our study involves reimplementation of these methods for fair comparison and it introduces a novel semi-autonomous control policy that provides a compelling trade-off among current state-of-the-art approaches. We also propose new metrics for assessing cognitive load and traversal quality and offer a benchmarking interface for generating Quality-Load graphs from recorded data. Our results, presented in a 2D Quality-Load space, demonstrate that the new control policy effectively bridges the gap between autonomous and manual control methods. Additionally, we reveal a surprising fact that fully manual, continuous control of all six degrees of freedom remains highly effective when performed by an experienced operator on a well-designed analog controller from third person view.
我们研究了现有基于鳍片的履带转向机器人半自主和全自主控制方法的性能。我们的研究包括重新实现这些方法以进行公平比较,并提出了一种新的半自主控制策略,该策略在当前最先进的方法之间提供了一个引人注目的权衡。此外,我们还提出了用于评估认知负荷和穿越质量的新指标,并提供了从记录数据生成Quality-Load图的基准测试界面。我们的结果在二维的质量-负载空间中显示,新提出的控制策略有效地弥合了自主控制与手动控制方法之间的差距。另外,我们发现了一个令人惊讶的事实:当经验丰富的操作员使用来自第三方视点的精心设计的模拟控制器进行全手动、连续控制所有六个自由度时,这种方法仍然非常有效。
https://arxiv.org/abs/2503.14389
Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
灵巧的机器人手在复杂环境中通常难以有效泛化,这是因为训练模型所用的数据多样性较低。然而,现实世界提供了无数种不可预知的情境组合,使得为每一种可能变化都编写特定规则变得不切实际。因此,一个自然的解决方案是让机器人能够从复杂的环境体验中学习,这种方法类似于进化过程,在此过程中系统通过持续反馈不断改进,从中吸取失败和成功的经验,并逐步达到最佳性能表现。基于这种思路,我们提出了EvolvingGrasp方法,这是一种进化的抓取生成技术,它通过高效偏好对齐来连续提高抓取性能。 具体来说,我们引入了手位姿势优选优化(Handpose-wise Preference Optimization, HPO),允许模型根据正负反馈持续调整其偏好,并逐步完善抓取策略。为了在在线调整期间进一步提升效率和可靠性,我们在HPO中加入了一个物理感知一致性模型,这加速了推理过程、减少了偏好微调所需的时步数量,并确保整个过程中符合物理学原理的可行性。 广泛的实验结果表明,在四个基准数据集上,我们的方法在抓取成功率和采样效率方面都达到了最先进的表现。研究结果验证了EvolvingGrasp能够实现进化的抓取生成,保证在模拟与现实环境中都能提供稳健、物理可行且偏好对齐的抓取性能。
https://arxiv.org/abs/2503.14329
We address prehensile pushing, the problem of manipulating a grasped object by pushing against the environment. Our solution is an efficient nonlinear trajectory optimization problem relaxed from an exact mixed integer non-linear trajectory optimization formulation. The critical insight is recasting the external pushers (environment) as a discrete probability distribution instead of binary variables and minimizing the entropy of the distribution. The probabilistic reformulation allows all pushers to be used simultaneously, but at the optimum, the probability mass concentrates onto one due to the entropy minimization. We numerically compare our method against a state-of-the-art sampling-based baseline on a prehensile pushing task. The results demonstrate that our method finds trajectories 8 times faster and at a 20 times lower cost than the baseline. Finally, we demonstrate that a simulated and real Franka Panda robot can successfully manipulate different objects following the trajectories proposed by our method. Supplementary materials are available at this https URL.
我们研究了灵巧推压问题,即通过推动环境来操控抓取物体的方法。我们的解决方案是从一个精确的混合整数非线性轨迹优化公式中放松得到的一个高效非线性轨迹优化问题。关键见解是将外部推力器(环境)重新表述为离散概率分布而不是二进制变量,并最小化该分布的熵。这种概率重述使得所有推力器都可以同时使用,但由于最小化了熵,在最优解处概率质量会集中到某一个推力器上。 我们通过数值实验在一项灵巧推压任务中将我们的方法与最先进的基于采样的基准方法进行了比较。结果表明,相比基准方法,我们的方法寻找轨迹的速度快8倍且成本低20倍。 最后,我们展示了模拟和真实的Franka Panda机器人可以根据我们提出的方法生成的路径成功操控不同的物体。补充材料可在此 URL 下载(请将"this https URL"替换为实际链接)。
https://arxiv.org/abs/2503.14268
Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.
基于电流变压器的模仿学习方法引入了离散的动作表示,并在由此产生的潜在代码上训练自回归变换器解码器。然而,初始量化破坏了动作空间的连续结构,从而限制了生成模型的能力。我们提出了一种无量化的替代方法,该方法利用生成无限词汇量变换器(GIVT)作为自回归变换器的直接、连续策略参数化方式。这简化了模仿学习流程,并在多种流行的模拟机器人任务中实现了最先进的性能。通过仔细研究采样算法来增强我们的策略执行,进一步提高了结果质量。
https://arxiv.org/abs/2503.14259
This paper introduces a chain-driven, sandwich-legged, mid-size quadruped robot designed as an accessible research platform. The design prioritizes enhanced locomotion capabilities, improved reliability and safety of the actuation system, and simplified, cost-effective manufacturing processes. Locomotion performance is optimized through a sandwiched leg design and a dual-motor configuration, reducing leg inertia for agile movements. Reliability and safety are achieved by integrating robust cable strain reliefs, efficient heat sinks for motor thermal management, and mechanical limits to restrict leg motion. Simplified design considerations include a quasi-direct drive (QDD) actuator and the adoption of low-cost fabrication techniques, such as laser cutting and 3D printing, to minimize cost and ensure rapid prototyping. The robot weighs approximately 25 kg and is developed at a cost under \$8000, making it a scalable and affordable solution for robotics research. Experimental validations demonstrate the platform's capability to execute trot and crawl gaits on flat terrain and slopes, highlighting its potential as a versatile and reliable quadruped research platform.
本文介绍了一种链驱动、三明治腿设计的中型四足机器人,旨在成为一个易于研究的平台。该设计优先考虑了增强行走能力、改善执行系统的可靠性和安全性以及简化和成本效益制造流程。通过采用三明治式腿部设计和双电机配置,优化了移动性能,减少了腿部惯性,使其能够进行敏捷运动。为了提高可靠性和安全性,设计集成了坚固的电缆拉伸缓解装置、高效的散热器用于电机热管理以及机械限位以限制腿部动作。简化的设计考虑包括准直接驱动(QDD)执行器和采用低成本制造技术,如激光切割和3D打印,以降低成本并确保快速原型制作。该机器人重约25公斤,并且开发成本低于8000美元,使其成为一个可扩展且经济实惠的机器人研究解决方案。实验验证表明,该平台能够在平坦地面和斜坡上执行慢跑步态和爬行步态,展示了其作为多功能可靠的四足研究平台的巨大潜力。
https://arxiv.org/abs/2503.14255
With the increasing demand for efficient and flexible robotic exploration solutions, Reinforcement Learning (RL) is becoming a promising approach in the field of autonomous robotic exploration. However, current RL-based exploration algorithms often face limited environmental reasoning capabilities, slow convergence rates, and substantial challenges in Sim-To-Real (S2R) transfer. To address these issues, we propose a Curriculum Learning-based Transformer Reinforcement Learning Algorithm (CTSAC) aimed at improving both exploration efficiency and transfer performance. To enhance the robot's reasoning ability, a Transformer is integrated into the perception network of the Soft Actor-Critic (SAC) framework, leveraging historical information to improve the farsightedness of the strategy. A periodic review-based curriculum learning is proposed, which enhances training efficiency while mitigating catastrophic forgetting during curriculum transitions. Training is conducted on the ROS-Gazebo continuous robotic simulation platform, with LiDAR clustering optimization to further reduce the S2R gap. Experimental results demonstrate the CTSAC algorithm outperforms the state-of-the-art non-learning and learning-based algorithms in terms of success rate and success rate-weighted exploration time. Moreover, real-world experiments validate the strong S2R transfer capabilities of CTSAC.
随着对高效灵活的机器人探索解决方案需求的增长,强化学习(RL)在自主机器人探索领域正成为一个有前景的方法。然而,当前基于RL的探索算法经常面临环境推理能力受限、收敛速度慢以及Sim-To-Real (S2R) 转移挑战大的问题。为了解决这些问题,我们提出了一种基于课程学习的Transformer强化学习算法(CTSAC),旨在提高探索效率和转移性能。 为了增强机器人的推理能力,在Soft Actor-Critic (SAC) 框架的感知网络中集成了一个Transformer,利用历史信息来提升策略的远见性。此外,我们提出了一种基于周期审查的课程学习方法,这种方法在提高训练效率的同时,还能减轻课程转换过程中的灾难性遗忘问题。 我们在ROS-Gazebo连续机器人仿真平台上进行了训练,并通过优化LiDAR聚类进一步减少了S2R差距。实验结果显示,CTSAC算法在成功率和加权探索时间上优于现有的非学习型和基于学习的算法。此外,实际应用验证了CTSAC具有强大的Sim-To-Real转移能力。
https://arxiv.org/abs/2503.14254
This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robots operating in highly dynamic this http URL integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less this http URL, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at this https URL
本文介绍了GeoFlow-SLAM,这是一种针对在高度动态环境中运行的四足机器人设计的强大而有效的紧密耦合RGBD-惯性SLAM方法。通过整合几何一致性、四足运动学约束以及双流光流(GeoFlow),本方法解决了三个关键挑战:快速移动时特征匹配和姿态初始化失败的问题,以及纹理不足环境中的视觉特征稀疏问题。 在快速移动场景下,利用结合先验地图点和姿态的双流光流技术显著提升了特征匹配的效果。此外,我们提出了一种针对四足机器人快速运动和惯性测量单元(IMU)误差的强大姿态初始化方法,该方法集成了IMU/四足运动学、帧间透视-n-点(PnP)算法以及广义迭代最近邻点法(GICP)。进一步地,本文首次引入了一种全新的优化框架,将深度地图与GICP几何约束紧密耦合起来,以提高长时间内在视觉纹理不足环境下的鲁棒性和精度。 所提出的算法在收集的四足机器人数据集和开源数据集中达到了最先进的性能。为了进一步促进研究和发展,开源的数据集和代码将在指定链接处公开提供(注:原文中的具体URL未给出,在实际发布时需添加相应链接)。
https://arxiv.org/abs/2503.14247
Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.
视觉与语言导航(VLN)系统通常仅专注于离散(全景)或连续(自由运动)范式之一,而忽略了人类居住的动态环境中的复杂性。我们介绍了一个统一的人体感知VLN(HA-VLN)基准测试,该测试将这些范式结合在一起,并在显式的社会意识约束下进行整合。我们的贡献包括: 1. 一个标准化的任务定义,平衡了离散-连续导航和个人空间需求; 2. 一个增强的人类运动数据集(HAPS 2.0)和升级的模拟器,捕捉真实的多个人互动、户外场景以及改进的动作-语言对齐; 3. 在16,844个以人为中心的指令上的广泛基准测试,揭示了多人动态和部分可观察性给领先的VLN代理带来了重大挑战; 4. 现实世界中的机器人测试,在拥挤的室内空间中验证模拟到现实的转移; 5. 一个公开排行榜,支持离散和连续任务之间的透明比较。实验结果表明,在集成社会情境时导航成功率提高且碰撞次数减少,突显了以人为核心的设计需求。 通过发布所有数据集、模拟器、代理代码和评估工具,我们的目标是推动更安全、更具能力和社会责任的VLN研究发展。
https://arxiv.org/abs/2503.14229
Person detection methods are used widely in applications including visual surveillance, pedestrian detection, and robotics. However, accurate detection of persons from overhead fisheye images remains an open challenge because of factors including person rotation and small-sized persons. To address the person rotation problem, we convert the fisheye images into panoramic images. For smaller people, we focused on the geometry of the panoramas. Conventional detection methods tend to focus on larger people because these larger people yield large significant areas for feature maps. In equirectangular panoramic images, we find that a person's height decreases linearly near the top of the images. Using this finding, we leverage the significance values and aggregate tokens that are sorted based on these values to balance the significant areas. In this leveraging process, we introduce panoramic distortion-aware tokenization. This tokenization procedure divides a panoramic image using self-similarity figures that enable determination of optimal divisions without gaps, and we leverage the maximum significant values in each tile of token groups to preserve the significant areas of smaller people. To achieve higher detection accuracy, we propose a person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods when applied to large-scale datasets.
人员检测方法广泛应用于包括视觉监控、行人检测和机器人技术在内的多种应用。然而,由于人体旋转和小尺寸人物等因素的影响,从鱼眼图像中准确地检测人员仍然是一个开放性的挑战。 为了解决人体旋转的问题,我们将鱼眼图像转换成全景图像。对于较小的人物,我们关注了全景图的几何特性。传统的检测方法倾向于关注较大的人,因为这些较大人物能够产生较大的特征地图区域。在等距圆柱投影的全景图像中,我们发现人物的高度接近画面顶部时会线性下降。基于这一观察,我们利用显著值对令牌进行排序并整合以平衡特征图的重要区域。 在此过程中,我们引入了具有全景失真感知功能的标记化处理方法。此标记化程序使用自相似图形将全景图像划分为多个无间隙的最佳划分区域,并通过保留每个标记组中最大的重要值来保持较小人物的重要性区域。 为了实现更高的检测精度,我们提出了一种结合全景图重映射和标记化过程的人体检测与定位方法。大量的实验表明,在应用于大规模数据集时,我们的方法优于传统的方法。
https://arxiv.org/abs/2503.14228
Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.
相机空间和机器人空间之间的精确变换估计至关重要。传统方法使用标记进行手眼标定需要离线图像采集,这限制了它们在线自校准的适用性。虽然基于学习的方法在推进在线校准方面取得了进展,但这些方法难以实现跨机器人的泛化,并且要求机器人完全可见。这项工作提出了一种基于基础特征的在线末端执行器姿态估计(FEEPE)算法,该算法具有免训练和跨末端执行器泛化的特性。 受到零样本泛化能力强大的基础模型的启发,FEEPE利用预先训练的视觉特征来估计从CAD模型和目标图像中提取的2D-3D对应关系,并通过PnP算法实现6D姿态估计。为了解决来自部分观测和对称性的歧义问题,引入了一种多历史关键帧增强的姿态优化算法,该算法利用时间信息提高准确性。 与传统的手眼标定方法相比,FEEPE可以实现无标记的在线校准。与其他机器人姿态估计方法不同,它可以在不进行训练的情况下跨多个机器人和末端执行器进行泛化。广泛的实验表明其在灵活性、泛化能力和性能方面具有明显优势。
https://arxiv.org/abs/2503.14051
Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: this https URL.
视觉定位被认为是许多机器人和视觉系统中的关键部分之一。虽然依赖于特征匹配的最先进的方法已被证明在视觉定位中非常准确,但其对存储和计算资源的需求却是一个负担。场景坐标回归(SCR)是一种替代方法,它通过学习将2D像素映射到3D场景坐标来消除存储障碍。大多数流行的SCR使用卷积神经网络(CNN)提取2D描述符,但我们认为这种方法忽视了像素之间的空间关系。受视觉变换架构成功的启发,我们提出了一种新的SCR架构,称为A-ScoRe,这是一种基于注意力的模型,它在描述符图级别上利用注意力来生成具有高度语义意义和相关性的2D描述符。由于操作是在描述符图上进行的,我们的模型可以与多种数据模态(无论是来自深度图、SLAM到运动结构(SfM)的密集或稀疏数据)一起工作。这种多功能性使得A-SCoRe能够在不同类型的环境和条件下运行,并实现对于移动机器人而言至关重要的灵活性水平。实验结果显示,我们的方法在多个基准测试中与最先进的方法相比性能相当,同时更轻量级且更具灵活性。 代码和预训练模型可以在我们的仓库中公开获得:[此链接](this https URL)。
https://arxiv.org/abs/2503.13982
The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.
长期以来,Vision-and-Language Navigation(VLN)任务的追求是开发出一种具备强大适应性的具身代理,能够在各种任务中无缝转移其导航能力。尽管近年来取得了显著进展,但大多数方法仍需要针对特定数据集进行训练,因此缺乏在涵盖不同指令类型的多样数据集中泛化的通用性。大型语言模型(LLM)展示了出色的推理和泛化能力,在机器人行动规划方面表现出巨大潜力。在这篇论文中,我们提出了一种创新的层级方法FlexVLN,该方法将基于监督学习的指令跟随者的基本导航能力和LLM计划者的强大泛化能力相结合,从而在各种VLN数据集中实现有效的泛化。此外,还提出了验证机制和多模型集成机制以减少由LLM规划者引起的潜在幻觉,并提高指令跟随者的执行准确性。我们将REVERIE、SOON和CVDN-target作为评估泛化能力的域外数据集。FlexVLN的泛化性能远超所有先前的方法。
https://arxiv.org/abs/2503.13966
Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these, methods that assume a continuous action space typically rely on a Gaussian distribution assumption, which limits the flexibility of generated actions. Meanwhile, the application of diffusion models to reinforcement learning has advanced, allowing for more flexible action distributions compared with Gaussian distribution-based approaches. In this study, we applied a diffusion-based reinforcement learning approach to social navigation and validated its effectiveness. Furthermore, by leveraging the characteristics of diffusion models, we propose an extension that enables post-training action smoothing and adaptation to static obstacle scenarios not considered during the training steps.
在动态环境中的移动机器人导航,尤其是在行人交通存在的情况下,是自主服务型移动机器人的开发过程中面临的关键挑战。最近,基于深度强化学习的方法受到了积极的研究,并且由于它们的优化能力,在性能上已经超过了传统的规则基础方法。其中,假设连续动作空间的方法通常依赖于高斯分布的假设,这限制了生成动作的灵活性。与此同时,扩散模型在强化学习中的应用有所进展,允许与基于高斯分布的方法相比产生更灵活的动作分布。 在这项研究中,我们把一种基于扩散的强化学习方法应用于社交导航,并验证了其有效性。此外,通过利用扩散模型的特点,我们提出了一种扩展方案,该方案能够在训练阶段之后实现动作平滑化和对未考虑的静态障碍物场景进行适应。
https://arxiv.org/abs/2503.13934
The capability of effectively moving on complex terrains such as sand and gravel can empower our robots to robustly operate in outdoor environments, and assist with critical tasks such as environment monitoring, search-and-rescue, and supply delivery. Inspired by the Mount Lyell salamander's ability to curl its body into a loop and effectively roll down {\Revision hill slopes}, in this study we develop a sand-rolling robot and investigate how its locomotion performance is governed by the shape of its body. We experimentally tested three different body shapes: Hexagon, Quadrilateral, and Triangle. We found that Hexagon and Triangle can achieve a faster rolling speed on sand, but exhibited more frequent failures of getting stuck. Analysis of the interaction between robot and sand revealed the failure mechanism: the deformation of the sand produced a local ``sand incline'' underneath robot contact segments, increasing the effective region of supporting polygon (ERSP) and preventing the robot from shifting its center of mass (CoM) outside the ERSP to produce sustainable rolling. Based on this mechanism, a highly-simplified model successfully captured the critical body pitch for each rolling shape to produce sustained rolling on sand, and informed design adaptations that mitigated the locomotion failures and improved robot speed by more than 200$\%$. Our results provide insights into how locomotors can utilize different morphological features to achieve robust rolling motion across deformable substrates.
在复杂地形如沙地和碎石上高效移动的能力可以增强机器人的野外操作能力,并帮助完成环境监测、搜寻救援以及物资运输等关键任务。受到利尔山火蜥蜴通过蜷曲身体成环状并有效滚动下坡的启发,我们开发了一种能够在沙地上滚动的机器人,并研究了其运动性能如何受其体型形状的影响。我们在实验中测试了三种不同的体形:六边形、四边形和三角形。 我们的发现显示,六边形和三角形可以在沙地实现更快的滚动速度,但同时也表现出更多的卡住现象。对机器人与沙子之间相互作用的分析揭示了这种失败机制:沙地变形在机器人接触部分下面形成了局部“沙坡”,增加了有效支撑多边形(ERSP)的面积,并阻止了机器人的重心(CoM)移出ERSP,从而阻碍可持续滚动。 基于这一机制,我们建立了一个高度简化的模型,成功捕捉到了每个滚动形状产生持续滚动的关键身体倾斜角度,并提出设计改进措施以缓解运动故障并提升机器人速度超过200%。我们的研究结果提供了有关如何利用不同的形态特征来实现跨可变形基底的稳定滚动运动的见解。 通过这一工作,我们不仅展示了自然界中生物适应策略对工程设计的实际应用价值,还为未来开发更多能够应对复杂自然环境挑战的自主移动机器提供了一条可行路径。
https://arxiv.org/abs/2503.13919
Robots that can operate autonomously in a human living environment are necessary to have the ability to handle various tasks flexibly. One crucial element is coordinated bimanual movements that enable functions that are difficult to perform with one hand alone. In recent years, learning-based models that focus on the possibilities of bimanual movements have been proposed. However, the high degree of freedom of the robot makes it challenging to reason about control, and the left and right robot arms need to adjust their actions depending on the situation, making it difficult to realize more dexterous tasks. To address the issue, we focus on coordination and efficiency between both arms, particularly for synchronized actions. Therefore, we propose a novel imitation learning architecture that predicts cooperative actions. We differentiate the architecture for both arms and add an intermediate encoder layer, Inter-Arm Coordinated transformer Encoder (IACE), that facilitates synchronization and temporal alignment to ensure smooth and coordinated actions. To verify the effectiveness of our architectures, we perform distinctive bimanual tasks. The experimental results showed that our model demonstrated a high success rate for comparison and suggested a suitable architecture for the policy learning of bimanual manipulation.
能够在人类生活环境中自主操作的机器人需要具备处理各种任务的灵活性。一个关键要素是协调双臂动作,这使得单手难以完成的任务变得可能。近年来,基于学习的模型专注于双臂运动的可能性已经被提出。然而,机器人的高自由度使控制推理变得复杂,并且左右机械臂的动作需要根据情况调整,从而实现更灵巧的任务变得更加困难。 为了应对这一问题,我们关注了两个手臂之间的协调和效率,特别是同步动作方面。因此,我们提出了一种新型的模仿学习架构,该架构能够预测合作动作。我们分别设计了左右双臂的不同结构,并添加了一个中间编码层——跨手协作变换器编码器(Inter-Arm Coordinated Transformer Encoder, IACE),以促进同步和时间对齐,从而确保动作顺畅且协调一致。 为了验证我们的架构的有效性,我们进行了不同的双臂任务。实验结果表明,与现有模型相比,我们的模型成功率达到较高水平,并为双臂操作策略学习提出了一个合适的架构。
https://arxiv.org/abs/2503.13916
With this paper, the design of a biomimetic robotic squid (dubbed URSULA) developed for dexterous underwater manipulation is presented. The robot serves as a test bed for several novel underwater technologies such as soft manipulators, propeller-less propulsion, model mediated tele-operation with video and haptic feedback, sonar-based underwater mapping, localization, and navigation, and high bandwidth visible light communications. Following the finalization of the detailed design, a prototype is manufactured and is currently undergoing pool tests.
本文介绍了为水下灵巧操作设计的仿生机器人鱿鱼(代号URSULA)的设计。该机器人作为多种新颖水下技术的测试平台,例如软体机械手、无推进器推进系统、利用视频和触觉反馈的模型介导远程操作、基于声纳的水下地图绘制、定位与导航以及高带宽可见光通信等技术。完成详细设计后,制造了原型机,并正在进行游泳池测试。
https://arxiv.org/abs/2503.13913
The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.
大型语言模型与视觉模型的融合正在开创用户交互式视觉-语言任务的新可能性。一个值得注意的应用是推理分割,其中模型通过理解人类指令中的隐含意义生成像素级别的分割掩膜。然而,无缝的人机互动不仅仅需要对象级别的识别;它还需要理解对象及其细节部分的功能,特别是在多目标场景中。例如,在指示机器人执行“打开电视”的命令时,可能有多种方式来完成这一任务。识别能够开启电视的多个物体(如电视本身或遥控器),提供了更多的灵活性,并有助于找到最优化的情境。此外,了解这些物体的具体部分,比如电视上的按钮或者遥控器上的按钮,对于完成动作至关重要。不幸的是,目前的推理分割数据集主要集中在单个目标对象级别的推理上,这限制了在多目标情境下对一个物体细节部分的详细识别能力。 为了填补这一空白,我们构建了一个大规模的数据集,称为“多目标与多粒度推理”(MMR)。MMR包含基于现有图像-掩膜集合的194,000条复杂且隐含的指令,涵盖了多目标、对象级别和部分级别的方面。该数据集通过分层提供对象和部分信息支持多样且上下文感知的互动。此外,我们提出了一种简单而有效的框架,用于进行多目标、对象级别以及部分水平的推理分割。在MMR上的实验结果显示,所提出的方法可以在多目标和多粒度场景中有效进行推理,而现有的推理分割模型仍有改进空间。
https://arxiv.org/abs/2503.13881
3D perception plays a crucial role in real-world applications such as autonomous driving, robotics, and AR/VR. In practical scenarios, 3D perception models must continuously adapt to new data and emerging object categories, but retraining from scratch incurs prohibitive costs. Therefore, adopting class-incremental learning (CIL) becomes particularly essential. However, real-world 3D point cloud data often include corrupted samples, which poses significant challenges for existing CIL methods and leads to more severe forgetting on corrupted data. To address these challenges, we consider the scenario in which a CIL model can be updated using point clouds with unknown corruption to better simulate real-world conditions. Inspired by Farthest Point Sampling, we propose a novel exemplar selection strategy that effectively preserves intra-class diversity when selecting replay exemplars, mitigating forgetting induced by data corruption. Furthermore, we introduce a point cloud downsampling-based replay method to utilize the limited replay buffer memory more efficiently, thereby further enhancing the model's continual learning ability. Extensive experiments demonstrate that our method improves the performance of replay-based CIL baselines by 2% to 11%, proving its effectiveness and promising potential for real-world 3D applications.
https://arxiv.org/abs/2503.13869
Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent's full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent's performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.
为连续控制机器人设计奖励函数常常会导致微妙的偏差或奖励劫持,尤其是在复杂任务中。基于偏好的强化学习通过从比较反馈而非人工制定信号中学得奖励来缓解部分这些问题,但扩大人类注释规模仍然具有挑战性。近期的研究使用视觉-语言模型(VLM)自动化偏好标记,然而单一最终状态图像通常无法捕捉代理的完整运动轨迹。在本文中,我们提出了一种两步解决方案,既提高了反馈准确性又使奖励学习更符合代理策略。首先,我们在最终观察上叠加轨迹草图以揭示所采取的路径,允许VLM提供更可靠的偏好——在MetaWorld任务中将偏好评分准确度提高大约15-20%。其次,我们通过纳入代理的表现来正则化奖励学习,确保奖励模型基于由当前策略生成的数据进行优化;这一添加使移动任务中的每一集回报提高了约20-30%。在MetaWorld上的实证研究表明,我们的方法实现了例如所有任务中大约70%-80%的成功率,相比之下标准方法成功率低于50%。这些结果强调了结合更丰富的视觉表示与代理意识的奖励正则化相结合的有效性。
https://arxiv.org/abs/2503.13817