Animate Your Motion: Turning Still Images into Dynamic Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.

Abstract (translated)

近年来，扩散模型在文本转视频生成方面取得了显著的进展，引发了更准确地反映用户意图对视频输出进行控制的需求。传统努力主要集中在使用语义线索（如图像或深度图）或运动基础条件（如移动草图或物体边界框）上。语义输入提供了丰富的场景背景，但缺乏详细的运动特定性；相反，运动输入提供了精确的轨迹信息，但错过了更广泛的语义叙述。直到现在，我们才在扩散模型中集成语义和运动线索，如图1所示。为此，我们引入了场景和运动条件扩散（SMCD）这一新方法来管理多模态输入。它包括一个已知的运动调节模块，并研究了各种方法来整合场景条件，促进不同模态之间的协同作用。对于模型训练，我们将两个模态的条件分开，引入了双阶段训练流程。实验结果表明，我们的设计显著提高了视频质量、运动精度和语义连贯性。

URL

https://arxiv.org/abs/2403.10179

PDF

https://arxiv.org/pdf/2403.10179.pdf

Animate Your Motion: Turning Still Images into Dynamic Videos

Abstract

Abstract (translated)

URL

PDF Copy

PDF