Abstract
Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.
Abstract (translated)
视频预测通过利用历史帧并展示其在许多应用中的巨大潜力,例如气象预测和自动驾驶。之前的工作通常将最终高级语义特征解码为未来帧,而没有纹理细节,这导致了预测质量的下降。为了激励这种,我们开发了一个Pair-wise Layer Attention(PLA)模块,通过结合低级视觉提示和高级特征,增强从U形结构中提取的特征图的层间语义依赖关系。因此,预测帧的纹理细节得到丰富。此外,大多数现有方法通过Translator捕捉了语义动态,但未能充分利用编码器的空间特征。这启发了我们设计一个Spatial Masking(SM)模块,在预训练期间遮盖部分编码特征,通过Decoder增加剩余特征像素的可视性。为此,我们提出了一个PLA-SM框架用于视频预测,以捕捉运动趋势。在五个基准测试中进行广泛实验和严谨的消融研究证明了所提出方法的优势。代码可以在GitHub上找到。
URL
https://arxiv.org/abs/2311.11289