Seer: Language Instructed Video Prediction with Latent Diffusion Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70\% preference in the human evaluation.

Abstract (translated)

想象未来的轨迹是机器人进行计划的关键，并成功达到其目标。因此，基于文本的图像预测(TVP)是促进一般机器人政策学习的必要的任务，即预测给定语言指令和参考帧的未来视频帧。这是一个高度挑战的任务，以 ground task-level goals 协同指定的高保真度帧的目标，需要大规模数据和计算。为了解决这个问题，并赋予机器人预见未来的能力，我们提议一个样本和计算效率高的模型，名为 \textbf{Seer}，通过在时间轴上膨胀预训练的文本到图像稳定扩散模型。我们膨胀去噪 U-Net 和语言 conditioning 模型，使用两个新技术，即自回归的空间和时间注意力和帧序列文本分解，来传播预训练的 T2I 模型中的丰富先验知识，在每个帧上传播。通过精心设计的架构，Seer 使得生成高保真度、同步和指令对齐的视频帧通过微调少量的层在少量数据上fine-tuning能够实现。在 something something V2(SSv2)和Bridgedata 数据集的实验结果证明了我们在大约 210 小时训练的 4 RTX 3090 GPU 上表现出卓越的视频预测性能：将当前领先的模型 FVD 从 290 降低到 200 在SSv2 上，并在人类评估中实现至少 70\% 的偏好。

URL

https://arxiv.org/abs/2303.14897

PDF

https://arxiv.org/pdf/2303.14897.pdf