Abstract
One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at this https URL
Abstract (translated)
视觉-语言-行动(VLA)模型相对于传统的机器人模仿学习的一个承诺是,它们可以利用大型视觉-语言模型(VLMs)的广泛泛化能力来生成多功能、通用型的机器人策略。然而,目前对VLA的评估仍然不足。传统的人类演示模仿学习基准测试由于缺乏语言指令而不适用;而新兴的一些将语言融入考量中的VLA基准测试往往包含的任务有限,并且没有意图深入研究VLM预训练究竟在多大程度上促进了下游机器人策略的泛化能力提升。与此同时,许多研究依赖于由不同机构独立设计的真实世界机器人设置,这为再现性和可访问性设置了障碍。 为了填补这一空白,我们介绍了一套统一的基于模拟任务的评估工具包,包含50项跨10个子类别的任务,这些类别涵盖了语言指令、视觉和物体。我们系统地在该工具包上对几种最先进的VLA架构进行了评测,以理解它们的泛化能力。我们的结果显示,虽然VLM骨干网络赋予了VLA强大的感知理解和高层次规划能力——即所谓的良好意图,但这种优势并不能可靠地转化为精确的动作执行:当面临分布外(OOD)观察时,尽管策略往往表现出一致的目标意图,但在动作执行方面却常常表现不佳。此外,在行动数据上的微调会侵蚀原始VLM的通用推理能力。 为了作为未来VLA研究的标准基准,并推动跨越感知与行动鸿沟的研究进展,我们公开了我们的任务套件和评估代码。更多信息(包括源代码)可在此链接访问:[此URL应为原文中提供的具体网址,请自行查找]
URL
https://arxiv.org/abs/2506.09930