Paper Reading AI Learner

Videoprompter: an ensemble of foundational models for zero-shot video understanding

2023-10-23 19:45:46
Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

Abstract

Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query visual features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and employ a video-to-text model to convert the query video to its descriptive form. The resulting descriptions contain vital visual cues of the query video, such as what objects are present and their spatio-temporal interactions. These descriptive cues provide additional semantic knowledge to VLMs to enhance their zeroshot performance. Second, we propose video-specific prompts to LLMs to generate more meaningful descriptions to enrich class label representations. Specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for class names, offering a higher-level action context for additional visual cues, We demonstrate the effectiveness of our approach in video understanding across three different zero-shot settings: 1) video action recognition, 2) video-to-text and textto-video retrieval, and 3) time-sensitive video tasks. Consistent improvements across multiple benchmarks and with various VLMs demonstrate the effectiveness of our proposed framework. Our code will be made publicly available.

Abstract (translated)

视觉语言模型(VLMs)通过计算视觉特征和基于文本的分类标签表示之间的相似度分数对查询视频进行分类。最近,大型语言模型(LLMs)已被用于通过增强分类标签的描述性来丰富文本基分类。然而,这些改进仅限于基于文本的分类器,而查询视觉特征并未被考虑。在本文中,我们提出了一种结合预训练的区分性VLMs和预训练生成视频到文本和文本到文本模型的框架。我们引入了两个关键的修改:首先,我们提出语言指导的视觉特征增强,并使用视频到文本模型将查询视频转换为描述形式。得到的描述包含了查询视频的重要视觉线索,例如存在的物体及其空间和时间相互作用。这些描述性线索为VLMs提供了额外的语义知识,以提高其零 shot性能。其次,我们为LLMs提出了视频特定提示,以生成更有意义的描述来丰富分类标签表示。具体来说,我们引入了提示技术创建了分类名称的树层次结构,为附加视觉线索提供了更高层次的动作上下文。我们在三个不同的零 shot设置中证明了我们方法的有效性:1)视频动作识别,2)视频到文本和文本到视频检索,3)时间敏感视频任务。在多个基准和各种VLMs上的持续改进表明,我们所提出的框架是有效的。我们的代码将公开可用。

URL

https://arxiv.org/abs/2310.15324

PDF

https://arxiv.org/pdf/2310.15324.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot