Paper Reading AI Learner

Animate Your Motion: Turning Still Images into Dynamic Videos

2024-03-15 10:36:24
Mingxiao Li, Bo Wan, Marie-Francine Moens, Tinne Tuytelaars

Abstract

In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.

Abstract (translated)

近年来,扩散模型在文本转视频生成方面取得了显著的进展,引发了更准确地反映用户意图对视频输出进行控制的需求。传统努力主要集中在使用语义线索(如图像或深度图)或运动基础条件(如移动草图或物体边界框)上。语义输入提供了丰富的场景背景,但缺乏详细的运动特定性;相反,运动输入提供了精确的轨迹信息,但错过了更广泛的语义叙述。直到现在,我们才在扩散模型中集成语义和运动线索,如图1所示。为此,我们引入了场景和运动条件扩散(SMCD)这一新方法来管理多模态输入。它包括一个已知的运动调节模块,并研究了各种方法来整合场景条件,促进不同模态之间的协同作用。对于模型训练,我们将两个模态的条件分开,引入了双阶段训练流程。实验结果表明,我们的设计显著提高了视频质量、运动精度和语义连贯性。

URL

https://arxiv.org/abs/2403.10179

PDF

https://arxiv.org/pdf/2403.10179.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot