Abstract
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
Abstract (translated)
文本有条件图像转视频生成(TI2V)旨在从给定的图像(例如,一张女人的照片)和文本描述(例如,“一个女人在喝水”)合成一个真实的视频。现有的TI2V框架通常需要在视频文本数据集上进行昂贵的训练,并针对文本和图像条件设计特定的模型。在本文中,我们提出TI2V-Zero,一种零散拍摄、无需优化、无需微调或引入外部模块的方法,它使预训练的文本到视频(T2V)扩散模型能够根据提供的图像进行条件生成,从而实现无需任何优化、微调或引入外部模块的TI2V生成。我们的方法利用预训练的T2V扩散基础模型作为生成先验。为了在使用附加图像进行视频生成时指导视频生成,我们提出了“重复并滑动”策略,它通过调节反滤波过程来控制预冻扩散模型,使其从提供的图像合成逐帧视频。为了确保时间连续性,我们采用DDPM反向策略对每个新合成帧进行初始化,并使用插值技术帮助保留视觉细节。我们对领域特定数据集和开放数据集进行了全面的实验,其中TI2V-Zero在领域特定模型中始终表现出优异的性能。此外,我们还证明了TI2V-Zero可以在提供更多图像时无缝扩展到其他任务,如视频填充和预测。其自回归设计还支持长视频生成。
URL
https://arxiv.org/abs/2404.16306