Paper Reading AI Learner

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

2023-11-27 12:39:42
Huanjin Yao, Wenhao Wu, Zhiheng Li

Abstract

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at this https URL.

Abstract (translated)

大型的预训练视觉模型在计算机视觉方面取得了令人印象深刻的成功。然而,为下游任务(特别是视频理解)完全对大型模型进行微调可能会导致计算成本过高。最近的研究将他们的注意力转向了高效的图像到视频传输学习。然而,现有的高效微调方法缺乏对训练内存使用和将较大模型转移到视频领域的探索。在本文中,我们提出了一个名为Side4Video的新 Spatial-Temporal Side Network,用于记忆高效的微调大型图像模型以视频理解,具体来说,我们附加了一个轻量级的空间-时间侧网络附着在冻结的视觉模型上,避免了通过沉重预训练模型进行反向传播,并利用原始图像模型的多级空间特征。极具内存效率的架构使我们能够将内存使用量减少75%,比以前基于适配器的方法实现更大的ViT-E(4.4B)和ViT-L(304M)。通过这种方式,我们可以将ViT-E(4.4B)用于视频理解任务,这是ViT-L(304M)的14倍。我们的方法在各种视频数据集上的各种任务(即单模态和跨模态任务,如动作识别和文本-视频检索)上的表现令人印象深刻,特别是在 Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)上。我们的代码发布在以下 URL 上。

URL

https://arxiv.org/abs/2311.15769

PDF

https://arxiv.org/pdf/2311.15769.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot