TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Abstract
Abstract (translated)
URL
PDF

Abstract

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

Abstract (translated)

大规模视频语言预训练在推动视频语言理解任务方面取得了显著的进步。然而，视频编码的繁重计算负担仍然是一个效率瓶颈，特别是对于长视频。由于其固有的3D属性和时空冗余，这些视频包含大量的视觉标记。这使得捕捉复杂的时间和空间关系变得具有挑战性。为了应对这个问题，我们提出了一个名为TEmporal-Spatial Token Aggregation（TESTA）的效率方法。TESTA通过自适应地聚合相似的帧以及每帧相似的补丁来压缩视频语义。TESTA可以将视频语义减少75%，从而加速视频编码。在此基础上，我们引入了一个带有分割空间时间标记聚合模块的视频编码器块预训练视频语言模型。我们在每个视频编码器块上评估我们的模型，并对用于段落到视频检索和长形式视频QA的五种数据集进行实验。实验结果表明，TESTA通过1.7倍于计算效率的改进提高了计算效率，并从处理更长输入帧的规模中获得了显著的性能提升，例如在QuerYD上的R@1值为+13.7，在Condensed Movie上的R@1值为+6.5。

URL

https://arxiv.org/abs/2310.19060

PDF

https://arxiv.org/pdf/2310.19060.pdf

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Abstract

Abstract (translated)

URL

PDF Copy

PDF