Abstract
Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.
Abstract (translated)
基础模型(FMs)在连接语言和行动的具身代理中越来越常用,然而不同基础模型集成策略的操作特性仍然有待深入研究——尤其是在复杂指令执行和多变环境下的灵活动作生成方面。本文探讨了构建机器人系统时采用的三种范式:端到端视觉-语言-行动(VLA)模型,这种模型隐含地整合了感知与规划;以及模块化管道,这些管道结合了视觉-语言模型(VLMs)或跨模态大型语言模型(LLMs)。我们通过两个聚焦案例研究来评估这些范式:一个是复杂指令定位任务,该任务旨在测试细粒度的指令理解和跨模式歧义消除能力;另一个是目标操作任务,其目的是通过VLA微调来进行技能转移。我们在零样本和少量样本设置下的实验揭示了泛化能力和数据效率之间的权衡。通过对性能极限的研究,我们提炼出了为开发以语言驱动的物理代理的设计启示,并概述了基础模型在现实世界条件下赋能机器人技术所面临的新兴挑战与机遇。
URL
https://arxiv.org/abs/2505.15685