Abstract
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at this https URL.
Abstract (translated)
GPT在自然语言处理方面展示了其显著的成功,然而,单靠语言序列不足以描述视觉世界中的时空细节。相比之下,视频序列在这方面表现出色。鉴于此,本文提出了一种简洁的Video-GPT模型,通过将视频视为对视觉世界的建模新“语言”来实现这一目标。借鉴GPT中下一个标记预测的概念,我们为Video-GPT引入了一个新颖的下一镜头扩散预训练范式。与以往的工作不同的是,这种独特的范式使Video-GPT能够同时应对短期生成和长期预测任务:通过根据历史上的清晰视频片段自回归地去噪嘈杂的视频片段来实现这一目标。 广泛的实验表明,我们的Video-GPT在视频预测方面达到了最先进的性能(Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89),而视频预测是世界建模的关键因素。此外,在包括视频生成和理解在内的六个主流视频任务上,它也表现出良好的适应性,并展示了其在下游应用中的强大泛化能力。 项目页面位于此 [链接](https://this-is-the-project-url.com)。(请根据实际项目页面地址填写正确的URL)
URL
https://arxiv.org/abs/2505.12489