Paper Reading AI Learner

LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

2024-04-01 17:54:34
Akshita Gupta, Gaurav Mittal, Ahmed Magooda, Ye Yu, Graham W. Taylor, Mei Chen

Abstract

Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.

Abstract (translated)

Temporal Action Localization(TAL)涉及在未剪辑的视频中定位和分类动作片段。大型视频基础模型的发展使得仅使用RGB视频骨干的先前方法已经无法比需要同时具备RGB和光学流模态的先前方法更优秀。利用这些大型模型通常局限于仅训练TAL头部,因为需要横跨GPU内存训练视频骨干。为了克服这一限制,我们引入了LoSA,专为TAL设计的第一个内存和参数高效的骨架适配器,以处理未剪辑的视频。LoSA专门为TAL设计,通过引入长-短程适配器来调整视频骨干的中间层。这些适配器与视频骨干并行运行,显著减少了内存足迹。LoSA还包括长-短程融合,将来自视频骨干层的中输出适配器策略性地组合以增强TAL头提供的视频特征。实验证明,LoSA在标准TAL基准测试、THUMOS-14和ActivityNet-v1.3等所有现有方法中均显著胜出,通过将端到端骨架适应性扩展到像VideoMAEv2~(ViT-g)这样的大规模参数模型,并超越仅头部传输学习。

URL

https://arxiv.org/abs/2404.01282

PDF

https://arxiv.org/pdf/2404.01282.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot