Paper Reading AI Learner

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

2024-01-19 09:58:06
Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang

Abstract

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at this https URL

Abstract (translated)

文本-视频检索是一个关键的多模态任务,用于找到与文本查询最相关的视频。尽管像CLIP这样的预训练模型在這個領域展現了令人驚嘆的潛力,但由于模型的成本因模型大小增加而持續增加,這仍然是一個問題。為了解決這個挑戰,已經出現了提示調試作為一種替代方法。然而,當將預訓練的圖像-文本模型適應下游視頻-文本任務時,現有作品仍然面臨兩個問題:(1)視覺編碼器只能編碼帧級特征,而無法提取全局層次的視頻信息。(2)將視覺和文本編碼器與分開的提示相结合,無法解決視覺-文本模態差距。因此,我們提出了DGL,一個跨模態的動態提示調試方法,具有全局-local視頻注意力。與以前的提示調試方法不同,我們使用共享的潜在空間生成局部級別的文本和圖像提示,鼓勵模態交互。此外,我們提出了一种將視頻建模為全局-local注意力的全局視頻信息捕捉方法。大量的實驗發現,僅僅調整0.67%的參數,我們的跨模態提示調試策略DGL在MSR-VTT、VATEX、LSMDC和ActivityNet數據集上的表現已超越或與完全調試方法相当。該URL為:

URL

https://arxiv.org/abs/2401.10588

PDF

https://arxiv.org/pdf/2401.10588.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot