Abstract
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .
Abstract (translated)
文本到视频预测(TVP)是一项下游视频生成任务,要求模型在给定一系列初始视频帧和描述所需运动的文本的情况下,产生后续的视频帧。实践中,TVP 方法主要关注一类特定类型的视频,这些视频展示了由人类或机械臂操作的对象操纵过程。先前的方法通常会调整那些预先训练于文本到图像任务的模型,因而生成的视频往往缺乏所需的连贯性。一种自然的发展方向是利用最近预训练的文本到视频(T2V)模型。然而,这种做法由于最常用的微调技术——低秩适应(LoRA)产生了不理想的结果而变得更加具有挑战性。 在本项工作中,我们提出了一种基于调整策略的方法,称为帧级条件适配(FCA)。在该模块内,我们设计了一个子模块,它能够从输入文本中生成每一帧的文本嵌入,这一额外的文字条件有助于视频生成。我们将使用FCA对T2V模型进行微调,并将其初始帧作为附加条件纳入其中。我们会比较和讨论将此类嵌入注入T2V模型中的更有效策略。我们进行了广泛的消融研究来评估设计选择的定量和定性性能。 我们的方法为TVP任务建立了新的最先进水平,项目页面在此[URL]。
URL
https://arxiv.org/abs/2503.12953