Abstract
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
Abstract (translated)
估计描述现实世界过程动态的世界模型的任务,在预测和准备未来结果方面具有巨大的重要性。在视频监控、机器人应用、自动驾驶等领域,这一目标意味着需要根据视频的几个帧来设置视觉上下文,并生成可信的未来图像序列。为此,我们提出了ProgGen方法,该方法通过利用大型语言模型(LLM/VLM)的归纳偏置,以一组神经符号化且易于人类解释的状态集(每帧一个状态)表示视频动态,从而进行视频帧预测任务。 具体而言,ProgGen使用LLM/VLM来合成程序: (i) 估计给定视觉上下文(即帧)下的视频状态; (ii) 根据推断出的过渡动力学来预测未来时间步的状态; (iii) 将预测的状态渲染为可视化的RGB图像帧。 实证评估表明,我们的方法在两个具有挑战性的环境中——PhyWorld和Cart Pole中的视频帧预测任务上优于竞争技术。此外,ProgGen还支持反事实推理,并生成可解释的视频,这证明了其在视频生成任务中有效性和泛化能力的有效性。
URL
https://arxiv.org/abs/2505.14948