Abstract
The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: this https URL
Abstract (translated)
机器人学习中普遍采用的范式试图通过运行时的语言提示,在不同环境、实体和任务间进行泛化。然而,这一方法面临一个基本矛盾:语言往往过于抽象,无法指导实现稳健操作所需的具体物理理解。在这项工作中,我们引入了接触锚定策略(CAP),用空间中的物理接触点替代语言条件设置。同时,我们将CAP结构设计为模块化的实用模型库,而不是单一的通才型政策。这种分解使得我们可以实施从真实世界到模拟环境的迭代循环:我们构建了EgoGym,这是一个轻量级的模拟基准测试平台,能够快速识别故障模式,并在实际部署前细化我们的模型和数据集。 我们展示了通过以接触为条件并通过模拟进行迭代,CAP能够在三个基本操作技能上直接泛化到新的环境和实体中,仅使用23小时的操作演示数据。并且,在零样本评估中,与大型、最先进的视觉语言代理(VLAs)相比,CAP的表现高出56%。 所有模型检查点、代码库、硬件、模拟以及数据集都将开源。 项目页面:[此处链接为“this https URL”,请根据实际情况替换或访问具体网址]
URL
https://arxiv.org/abs/2602.09017