Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

Abstract
Abstract (translated)
URL
PDF

Abstract

Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.

Abstract (translated)

基础多模态模型在大型图像-文本对或视频-文本对上预训练已经展示了在下游任务中强大的泛化能力。然而，与图像-文本模型不同，预训练视频-文本模型在收集大型且干净的同步数据集方面总是不可行，并且在预训练阶段涉及到的计算成本是指数级的。因此，相关的问题是：图像-文本模型是否可以适应视频任务？使用这些模型是否比直接在视频上进行预训练更有益？在本研究中，我们通过在一个零散设置的视频理解任务上对图像-文本模型的泛化能力进行详细研究来回答这个问题。我们在多样化的视频任务中研究了9个基本图像-文本模型，包括视频动作识别（视频AR）、视频检索、视频问题回答（视频QA）、视频多项选择（视频MC）和视频字幕（视频CP）。我们的实验结果表明，图像-文本模型在视频AR、视频RT和视频MC上表现出令人印象深刻的性能。此外，它们在视频字幕上的表现适中，在视频QA上的表现较差。这些发现阐明了在避免昂贵的预训练步骤的同时将基础图像-文本模型适应各种视频任务的价值。

URL

https://arxiv.org/abs/2310.04914

PDF

https://arxiv.org/pdf/2310.04914.pdf

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

Abstract

Abstract (translated)

URL

PDF Copy

PDF