Abstract
Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
Abstract (translated)
基于扩散模型的视频生成研究进展迅速。然而,对象保真度和生成长度的限制使得其应用受限。此外,需要无缝循环的特定领域(如动画壁纸)要求第一和最后一帧的视频顺畅播放。为了应对这些挑战,本文提出了一种名为LoopAnimate的新方法,用于生成具有一致开始和结束帧的视频。为了提高对象保真度,我们引入了一个框架,将多层图像出现和文本语义信息解耦。基于图像到图像扩散模型,我们的方法从输入图像中引入了像素级和特征级信息,并在扩散模型的不同位置注入图像外观和文本语义嵌入。现有的UNet-based视频生成模型在训练过程中需要输入整个视频。然而,由于GPU内存的限制,通常帧数限制为16。为了克服这一挑战,本文提出了一种三阶段训练策略,其中帧数逐渐增加,并减少细调模块。此外,我们引入了Temporal Enhanced Motion Module(TEMM),以扩展编码时间和空间信息的能力,达到36帧。所提出的LoopAnimate方法,第一次将UNet-based视频生成模型的单过道生成长度扩展到35帧,同时保持高质量的视频生成。实验证明,LoopAnimate在客观指标(如保真度和时间一致性)和主观评价结果方面实现了最先进的性能。
URL
https://arxiv.org/abs/2404.09172