Paper Reading AI Learner

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

2023-10-07 20:57:54
Avinash Madasu, Anahita Bhiwandiwalla, Vasudev Lal

Abstract

Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.

Abstract (translated)

基础多模态模型在大型图像-文本对或视频-文本对上预训练已经展示了在下游任务中强大的泛化能力。然而,与图像-文本模型不同,预训练视频-文本模型在收集大型且干净的同步数据集方面总是不可行,并且在预训练阶段涉及到的计算成本是指数级的。因此,相关的问题是:图像-文本模型是否可以适应视频任务?使用这些模型是否比直接在视频上进行预训练更有益?在本研究中,我们通过在一个零散设置的视频理解任务上对图像-文本模型的泛化能力进行详细研究来回答这个问题。我们在多样化的视频任务中研究了9个基本图像-文本模型,包括视频动作识别(视频AR)、视频检索、视频问题回答(视频QA)、视频多项选择(视频MC)和视频字幕(视频CP)。我们的实验结果表明,图像-文本模型在视频AR、视频RT和视频MC上表现出令人印象深刻的性能。此外,它们在视频字幕上的表现适中,在视频QA上的表现较差。这些发现阐明了在避免昂贵的预训练步骤的同时将基础图像-文本模型适应各种视频任务的价值。

URL

https://arxiv.org/abs/2310.04914

PDF

https://arxiv.org/pdf/2310.04914.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot