Paper Reading AI Learner

Seer: Language Instructed Video Prediction with Latent Diffusion Models

2023-03-27 03:12:24
Xianfan Gu, Chuan Wen, Jiaming Song, Yang Gao

Abstract

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70\% preference in the human evaluation.

Abstract (translated)

想象未来的轨迹是机器人进行计划的关键,并成功达到其目标。因此,基于文本的图像预测(TVP)是促进一般机器人政策学习的必要的任务,即预测给定语言指令和参考帧的未来视频帧。这是一个高度挑战的任务,以 ground task-level goals 协同指定的高保真度帧的目标,需要大规模数据和计算。为了解决这个问题,并赋予机器人预见未来的能力,我们提议一个样本和计算效率高的模型,名为 \textbf{Seer},通过在时间轴上膨胀预训练的文本到图像稳定扩散模型。我们膨胀去噪 U-Net 和语言 conditioning 模型,使用两个新技术,即自回归的空间和时间注意力和帧序列文本分解,来传播预训练的 T2I 模型中的丰富先验知识,在每个帧上传播。通过精心设计的架构,Seer 使得生成高保真度、同步和指令对齐的视频帧通过微调少量的层在少量数据上fine-tuning能够实现。在 something something V2(SSv2)和Bridgedata 数据集的实验结果证明了我们在大约 210 小时训练的 4 RTX 3090 GPU 上表现出卓越的视频预测性能:将当前领先的模型 FVD 从 290 降低到 200 在SSv2 上,并在人类评估中实现至少 70\% 的偏好。

URL

https://arxiv.org/abs/2303.14897

PDF

https://arxiv.org/pdf/2303.14897.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot