Abstract
We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.
Abstract (translated)
我们提出了 VIDIM,一种用于视频插帧的生成模型,可以根据起始和结束帧创建短视频。为了实现高保真度并生成在输入数据中未见过的运动,VIDIM 使用级联扩散模型首先在低分辨率下生成目标视频,然后根据低分辨率生成的视频生成高分辨率视频。我们比较 VIDIM 与先前的视频插帧方法,并展示了在大多数设置中,这种方法在底层运动复杂、非线性和模糊的情况下都失败了。此外,我们还证明了在起始和结束帧的无分类指导以及将超分辨率模型对原始高分辨率帧进行条件处理可以实现高保真度的结果。VIDIM 具有从样本中抽样速度快、每个扩散模型需要的参数数量不到10亿个、具有可扩展性并在参数数量较大时仍然具有优良质量等优点。
URL
https://arxiv.org/abs/2404.01203