Paper Reading AI Learner

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

2024-04-25 03:21:11
Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Abstract (translated)

文本有条件图像转视频生成(TI2V)旨在从给定的图像(例如,一张女人的照片)和文本描述(例如,“一个女人在喝水”)合成一个真实的视频。现有的TI2V框架通常需要在视频文本数据集上进行昂贵的训练,并针对文本和图像条件设计特定的模型。在本文中,我们提出TI2V-Zero,一种零散拍摄、无需优化、无需微调或引入外部模块的方法,它使预训练的文本到视频(T2V)扩散模型能够根据提供的图像进行条件生成,从而实现无需任何优化、微调或引入外部模块的TI2V生成。我们的方法利用预训练的T2V扩散基础模型作为生成先验。为了在使用附加图像进行视频生成时指导视频生成,我们提出了“重复并滑动”策略,它通过调节反滤波过程来控制预冻扩散模型,使其从提供的图像合成逐帧视频。为了确保时间连续性,我们采用DDPM反向策略对每个新合成帧进行初始化,并使用插值技术帮助保留视觉细节。我们对领域特定数据集和开放数据集进行了全面的实验,其中TI2V-Zero在领域特定模型中始终表现出优异的性能。此外,我们还证明了TI2V-Zero可以在提供更多图像时无缝扩展到其他任务,如视频填充和预测。其自回归设计还支持长视频生成。

URL

https://arxiv.org/abs/2404.16306

PDF

https://arxiv.org/pdf/2404.16306.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot