Abstract
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
Abstract (translated)
机器人学习有望释放灵活、通用和灵巧的机器人系统的全部潜力,并解决人工智能领域一些最深层次的问题。然而,将机器人学习提升到实际系统所需的一般化水平面临着数据、泛化能力和鲁棒性方面的重大障碍。本文讨论了如何通过通用型机器人策略(即机器人基础模型)来应对这些挑战,并探讨如何设计有效的通用型机器人策略以执行复杂且高度灵巧的任务。我们提出了一种基于预训练视觉语言模型(VLM)的新型流匹配架构,以继承互联网规模的语义知识。随后,我们将讨论该模型如何在来自多种灵巧机器人平台的大规模多样化数据集上进行训练,包括单臂机器人、双臂机器人和移动操作器。我们从多个方面评估了我们的模型:预训练后的零样本任务执行能力;遵循人类及高级VLM策略的语音指令的能力;以及通过微调来学习新技能的能力。我们的研究结果涵盖了广泛的领域,如折叠衣物、清洁桌子和组装盒子等任务。
URL
https://arxiv.org/abs/2410.24164