Abstract
Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
Abstract (translated)
目前最先进的视频修复方法通常依赖于光学流或基于注意力的方法,通过在帧之间传播视觉信息来修复遮罩区域。虽然这些方法在标准基准测试中取得了显著的进展,但它们在需要合成不存在的其他帧中合成新内容时遇到困难。在本文中,我们将视频修复重新定义为条件生成建模问题,并提出了使用条件视频扩散模型解决此类问题的框架。我们强调了使用生成方法解决此任务的优点,证明了我们的方法能够生成多样、高质量的视频修复,并合成与提供上下文一致的时空语义内容。
URL
https://arxiv.org/abs/2405.00251