Paper Reading AI Learner

LoopAnimate: Loopable Salient Object Animation

2024-04-14 07:36:18
Fanyi Wang, Peng Liu, Haotian Hu, Dan Meng, Jingwen Su, Jinjin Xu, Yanhao Zhang, Xiaoming Ren, Zhiwang Zhang

Abstract

Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.

Abstract (translated)

基于扩散模型的视频生成研究进展迅速。然而,对象保真度和生成长度的限制使得其应用受限。此外,需要无缝循环的特定领域(如动画壁纸)要求第一和最后一帧的视频顺畅播放。为了应对这些挑战,本文提出了一种名为LoopAnimate的新方法,用于生成具有一致开始和结束帧的视频。为了提高对象保真度,我们引入了一个框架,将多层图像出现和文本语义信息解耦。基于图像到图像扩散模型,我们的方法从输入图像中引入了像素级和特征级信息,并在扩散模型的不同位置注入图像外观和文本语义嵌入。现有的UNet-based视频生成模型在训练过程中需要输入整个视频。然而,由于GPU内存的限制,通常帧数限制为16。为了克服这一挑战,本文提出了一种三阶段训练策略,其中帧数逐渐增加,并减少细调模块。此外,我们引入了Temporal Enhanced Motion Module(TEMM),以扩展编码时间和空间信息的能力,达到36帧。所提出的LoopAnimate方法,第一次将UNet-based视频生成模型的单过道生成长度扩展到35帧,同时保持高质量的视频生成。实验证明,LoopAnimate在客观指标(如保真度和时间一致性)和主观评价结果方面实现了最先进的性能。

URL

https://arxiv.org/abs/2404.09172

PDF

https://arxiv.org/pdf/2404.09172.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot