Abstract
Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our \href{this https URL}{project page}.
Abstract (translated)
翻译: 视觉-语言-行动(VLA)模型正在推动机器人领域的革命,使机器能够理解指令并与物理世界互动。这一领域的新模型和数据集层出不穷,使之既令人兴奋又充满挑战。本综述为VLA领域的复杂格局提供了一个清晰且结构化的指南。我们设计了符合研究人员自然学习路径的框架:从任何VLA模型的基本模块开始,追溯历史上的关键里程碑,并深入探讨定义最近研究前沿的核心挑战。我们的主要贡献在于详细剖析了以下五个最大挑战:(1)表示;(2)执行;(3)泛化;(4)安全性和(5)数据集与评估。这种结构反映了通用智能体发展的路线图:建立基本的感知-行动循环,扩展多样化的体现和环境中的能力,并最终确保可信部署——所有这些都离不开关键的数据基础设施支持。对于每一个挑战,我们回顾了现有的方法并指出了未来的机会。 本文旨在同时为初学者提供基础指南以及为有经验的研究人员制定战略路线图,目的是加速学习过程并激发新思想在具身智能领域的应用。一份持续更新的在线版本可在我们的项目页面上找到([点击此处查看](https://this https URL))。
URL
https://arxiv.org/abs/2512.11362