Abstract
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: this https URL .
Abstract (translated)
近年来,文本到视频生成方法主要依赖于计算密集型训练,并需要大规模的视频数据集。在这篇文章中,我们介绍了一种全新的任务:零样本文本到视频生成,并提出了一种低成本的方法(不需要训练或优化),通过利用现有的文本到图像生成方法(例如稳定扩散)的力量,将它们适用于视频领域。我们的 key 修改包括(一)丰富生成帧的潜在编码,使其与运动动态相结合,以保持全局场景和背景时间一致;(二)重新编程帧级别的自注意力,使用第一帧上的新交叉帧注意力,以保留前景对象的背景、外观和身份。实验表明,这导致低 overhead 且高品质、一致性极高的视频生成。此外,我们的方法不仅局限于文本到视频合成,还可以适用于其他任务,例如条件视频生成、内容和专业化视频生成,以及视频指令-Pix2Pix,即指导视频编辑。实验表明,我们的方法和最近的方法表现相似或有时更好,尽管没有训练额外的视频数据。我们的代码将开源在此处:这个 https URL 上。
URL
https://arxiv.org/abs/2303.13439