Abstract
In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.
Abstract (translated)
近年来,扩散模型在文本转视频生成方面取得了显著的进展,引发了更准确地反映用户意图对视频输出进行控制的需求。传统努力主要集中在使用语义线索(如图像或深度图)或运动基础条件(如移动草图或物体边界框)上。语义输入提供了丰富的场景背景,但缺乏详细的运动特定性;相反,运动输入提供了精确的轨迹信息,但错过了更广泛的语义叙述。直到现在,我们才在扩散模型中集成语义和运动线索,如图1所示。为此,我们引入了场景和运动条件扩散(SMCD)这一新方法来管理多模态输入。它包括一个已知的运动调节模块,并研究了各种方法来整合场景条件,促进不同模态之间的协同作用。对于模型训练,我们将两个模态的条件分开,引入了双阶段训练流程。实验结果表明,我们的设计显著提高了视频质量、运动精度和语义连贯性。
URL
https://arxiv.org/abs/2403.10179