Abstract
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
Abstract (translated)
未来运动表示(如光流)对于控制和生成任务具有巨大价值。然而,预测通用的密集空间运动表示仍然是一个关键挑战,并且从嘈杂的真实世界数据中学习这种预测方法的研究相对较少。我们引入了FOFPred,这是一个新颖的语言条件下的光流预测模型,它结合了一个统一的视觉-语言模型(VLM)和扩散架构。这一独特的组合使强大的多模态推理成为可能,并实现了未来运动预测中的像素级生成准确性。我们的模型在大规模网络数据上的人类活动视频描述数据上进行训练——这是一个高度可扩展但又结构化的来源。 为了从这些嘈杂的视频-描述数据中提取有意义的信息,我们采用了关键的数据预处理技术以及统一架构和强大的图像预训练方法。然后,我们将经过训练的模型应用于控制和生成两个不同的下游任务。在机器人操作和基于语言驱动条件下的视频生成评估中,FOFPred展示了其跨领域的适用性,这证实了统一的VLM-扩散架构的价值,并证明了从多样化的网络数据中进行大规模学习对未来光流预测的重要性。
URL
https://arxiv.org/abs/2601.10781