TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Abstract (translated)

文本有条件图像转视频生成（TI2V）旨在从给定的图像（例如，一张女人的照片）和文本描述（例如，“一个女人在喝水”）合成一个真实的视频。现有的TI2V框架通常需要在视频文本数据集上进行昂贵的训练，并针对文本和图像条件设计特定的模型。在本文中，我们提出TI2V-Zero，一种零散拍摄、无需优化、无需微调或引入外部模块的方法，它使预训练的文本到视频（T2V）扩散模型能够根据提供的图像进行条件生成，从而实现无需任何优化、微调或引入外部模块的TI2V生成。我们的方法利用预训练的T2V扩散基础模型作为生成先验。为了在使用附加图像进行视频生成时指导视频生成，我们提出了“重复并滑动”策略，它通过调节反滤波过程来控制预冻扩散模型，使其从提供的图像合成逐帧视频。为了确保时间连续性，我们采用DDPM反向策略对每个新合成帧进行初始化，并使用插值技术帮助保留视觉细节。我们对领域特定数据集和开放数据集进行了全面的实验，其中TI2V-Zero在领域特定模型中始终表现出优异的性能。此外，我们还证明了TI2V-Zero可以在提供更多图像时无缝扩展到其他任务，如视频填充和预测。其自回归设计还支持长视频生成。

URL

https://arxiv.org/abs/2404.16306

PDF

https://arxiv.org/pdf/2404.16306.pdf

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF