Abstract
Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.
Abstract (translated)
近年来,扩散模型的进步显著提高了文本到图像生成。然而,从文本生成视频比从文本生成图像更具挑战性,因为需要更大的数据集和更高的计算成本。现有的视频生成方法通常使用3D U-Net架构来考虑时间维度,或者使用自回归生成。这些方法需要大量数据,在计算成本方面有限。为了应对这些挑战,我们提出了一个简单但有效的 novel grid diffusion for text-to-video generation without temporal dimension in architecture 和一个大型文本-视频对数据集。我们可以使用固定量的GPU内存生成高质量的视频,而不管帧数多少。此外,由于我们的方法将视频的维度降低到图像的维度,各种基于图像的方法可以应用于视频,如从图像编辑中的图像指导视频操作。我们在定量和定性评估中证明了我们模型的合适性,表明我们的模型在现实世界的视频生成中具有优势。
URL
https://arxiv.org/abs/2404.00234