Abstract
Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. The primary challenge in next-frame prediction lies in effectively capturing and processing both spatial and temporal information from previous video sequences. The transformer architecture, known for its prowess in handling sequence data, has made remarkable progress in this domain. However, transformer-based next-frame prediction models face notable issues: (a) The multi-head self-attention (MHSA) mechanism requires the input embedding to be split into $N$ chunks, where $N$ is the number of heads. Each segment captures only a fraction of the original embeddings information, which distorts the representation of the embedding in the latent space, resulting in a semantic dilution problem; (b) These models predict the embeddings of the next frames rather than the frames themselves, but the loss function based on the errors of the reconstructed frames, not the predicted embeddings -- this creates a discrepancy between the training objective and the model output. We propose a Semantic Concentration Multi-Head Self-Attention (SCMHSA) architecture, which effectively mitigates semantic dilution in transformer-based next-frame prediction. Additionally, we introduce a loss function that optimizes SCMHSA in the latent space, aligning the training objective more closely with the model output. Our method demonstrates superior performance compared to the original transformer-based predictors.
Abstract (translated)
视频中的下一帧预测对于自主驾驶、目标跟踪和运动预测等应用至关重要。下一帧预测的主要挑战在于有效捕捉并处理来自先前视频序列的空间和时间信息。具有处理序列数据能力的Transformer架构在这领域取得了显著进展,但基于Transformer的下一帧预测模型面临一些值得注意的问题:(a) 多头自注意力(MHSA)机制要求输入嵌入被分割为$N$个块,其中$N$是头部的数量。每个部分仅捕获原始嵌入信息的一部分,这在潜在空间中扭曲了嵌入表示,导致语义稀释问题;(b) 这些模型预测下一帧的嵌入而不是实际帧本身,但损失函数基于重构错误而非预测嵌入——这就产生了训练目标和模型输出之间的不一致。为此,我们提出了Semantic Concentration Multi-Head Self-Attention (SCMHSA)架构,该架构在基于Transformer的下一帧预测中有效地缓解了语义稀释问题。此外,我们还引入了一种优化潜在空间中的SCMHSA的损失函数,使训练目标更接近模型输出。我们的方法相较于原始的基于Transformer的预测器显示出优越的表现性能。
URL
https://arxiv.org/abs/2501.16753