Grid Diffusion Models for Text-to-Video Generation

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.

Abstract (translated)

近年来，扩散模型的进步显著提高了文本到图像生成。然而，从文本生成视频比从文本生成图像更具挑战性，因为需要更大的数据集和更高的计算成本。现有的视频生成方法通常使用3D U-Net架构来考虑时间维度，或者使用自回归生成。这些方法需要大量数据，在计算成本方面有限。为了应对这些挑战，我们提出了一个简单但有效的 novel grid diffusion for text-to-video generation without temporal dimension in architecture 和一个大型文本-视频对数据集。我们可以使用固定量的GPU内存生成高质量的视频，而不管帧数多少。此外，由于我们的方法将视频的维度降低到图像的维度，各种基于图像的方法可以应用于视频，如从图像编辑中的图像指导视频操作。我们在定量和定性评估中证明了我们模型的合适性，表明我们的模型在现实世界的视频生成中具有优势。

URL

https://arxiv.org/abs/2404.00234

PDF

https://arxiv.org/pdf/2404.00234.pdf

Grid Diffusion Models for Text-to-Video Generation

Abstract

Abstract (translated)

URL

PDF Copy

PDF