Abstract
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.
Abstract (translated)
理解动词对于模拟人类和物体通过空间和时间相互交互和影响非常重要。最近,基于CLIP的技术先进的视频语言模型被发现在动词理解方面存在局限性,并且主要依赖名词,限制了在需要行动和时间理解的实际视频应用中的表现。在这项工作中,我们提出了一种新的动词重点对比性框架(VFC),以改进基于CLIP的视频语言模型的动词理解能力。该框架由两个主要部分组成:(1)利用预训练的大型语言模型(LLM)创建交叉modal对比性的硬负向量,并使用校准策略平衡正负面对中概念的出现;(2)实施精细的动词短语匹配损失。我们的方法实现了零样本性能目标,在三个主要下游任务中,专注于动词理解:视频-文本匹配、视频问答和视频分类。据我们所知,这是第一个提出减轻动词理解问题的方法的工作,而不仅仅是突出它。
URL
https://arxiv.org/abs/2304.06708