Abstract
Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
Abstract (translated)
视频中间帧生成技术能够在两个图像帧之间创建平滑自然的过渡,因此成为视频编辑和长篇视频合成中不可或缺的工具。然而,现有方法在生成大型、复杂或精细的动作方面存在局限性,难以适应用户多样化的需求,并且通常缺乏对中间帧细节的精细控制,导致与创作者意图不一致。为了填补这些空白,我们介绍了**模型名称**(假设您指代的是某个具体的研究或技术成果,请在此处插入实际的名字),这是一个允许多模态控制的视频中间帧生成框架,包括深度转换和层叠、运动轨迹、文本提示以及移动定位的目标区域等,同时在灵活性、易用性和精度方面实现了细粒度视频插值的良好平衡。 为了实现这一目标,我们采用扩散变换器(DiT)架构作为我们的视频生成模型,因为其已被证明有能力生成高质量的长视频。为确保DiT与多模态控制之间的兼容性,我们将所有运动控制映射到一个通用且易于使用的基于点的稀疏表示形式,并将其用作视频/噪声输入。此外,为了尊重不同级别的粒度和影响力的各种控制类型,我们分别将内容控制和运动控制分离成两个分支以编码所需特征,并在去噪过程中引导生成器,最终形成一个用于运动、另一个用于内容的两组生成器。最后,我们提出了一种阶段式的训练策略,确保我们的模型能够平稳地学习多模态控制。 广泛的定性和定量实验表明,多模态控制使视频故事叙述更加动态化、定制化,并在上下文准确性方面取得了显著的进步。
URL
https://arxiv.org/abs/2510.08561