Abstract
Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.
Abstract (translated)
视频生成模型已经发展成为高保真的物理世界模拟器,能够根据多模态用户输入合成高质量的视频,捕捉代理与其环境之间精细互动。这些模型的出色能力解决了基于物理的仿真器长期面临的许多挑战,并在多个领域得到了广泛应用,例如机器人技术。例如,视频模型能够在不做出禁止性简化假设的情况下实现逼真且物理一致性的可变形体模拟,这一直是一个物理基础仿真中的重大瓶颈。此外,视频模型可以作为细粒度和表达力强的基础世界模型,克服了仅使用语言抽象描述复杂物理互动的局限性。 在这篇综述中,我们回顾了视频模型及其在机器人领域的应用,包括低成本数据生成、模仿学习中的动作预测、强化学习中的动力学与奖励建模、视觉规划以及政策评估。此外,我们也指出了阻碍视频模型在机器人领域可信整合的重要挑战,这些问题包括指令执行能力差、诸如违反物理定律的幻觉效应及不安全内容生成等,并且还包括重大数据整理、训练和推理成本等基本限制。 为了应对这些开放性研究挑战,我们提出了未来的发展方向以激发进一步的研究并最终推动更广泛的应用,特别是在对安全性要求极高的场景中。
URL
https://arxiv.org/abs/2601.07823