Paper Reading AI Learner

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

2023-10-29 16:25:32
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou

Abstract

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

Abstract (translated)

大规模视频语言预训练在推动视频语言理解任务方面取得了显著的进步。然而,视频编码的繁重计算负担仍然是一个效率瓶颈,特别是对于长视频。由于其固有的3D属性和时空冗余,这些视频包含大量的视觉标记。这使得捕捉复杂的时间和空间关系变得具有挑战性。为了应对这个问题,我们提出了一个名为TEmporal-Spatial Token Aggregation(TESTA)的效率方法。TESTA通过自适应地聚合相似的帧以及每帧相似的补丁来压缩视频语义。TESTA可以将视频语义减少75%,从而加速视频编码。在此基础上,我们引入了一个带有分割空间时间标记聚合模块的视频编码器块预训练视频语言模型。我们在每个视频编码器块上评估我们的模型,并对用于段落到视频检索和长形式视频QA的五种数据集进行实验。实验结果表明,TESTA通过1.7倍于计算效率的改进提高了计算效率,并从处理更长输入帧的规模中获得了显著的性能提升,例如在QuerYD上的R@1值为+13.7,在Condensed Movie上的R@1值为+6.5。

URL

https://arxiv.org/abs/2310.19060

PDF

https://arxiv.org/pdf/2310.19060.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot