Abstract
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: this https URL .
Abstract (translated)
近年来,视频生成取得了实质性进展,并实现了真实的效果。然而,现有的AI生成的视频通常都很短,仅描绘了一个场景。为了生成连贯的长视频(故事层),需要在不同的片段之间实现创意的过渡和预测效果。本文介绍了一个短到长视频扩散模型SEINE,它专注于生成过渡和预测。目标是生成高质量的长视频,场景之间的过渡和 shot-level 视频的长度不同。具体来说,我们提出了一个基于文本描述的随机掩码视频扩散模型,用于自动生成过渡。通过提供不同场景的图像作为输入,结合文本控制,我们的模型生成确保连贯性和视觉质量的过渡视频。此外,模型还可以扩展到各种任务,如图像到视频动画和自回归视频预测。为了全面评估这项新的生成任务,我们提出了三个评估标准:时间一致性、语义相似性和视频文本语义对齐。大量实验证实了我们的方法在生成过渡和预测方面比现有方法更有效,从而可以创建故事层长视频。项目页面:此链接 。
URL
https://arxiv.org/abs/2310.20700