Abstract
Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at this https URL.
Abstract (translated)
尽管预先训练在大量数据上,最先进的视频语言对齐模型对视频摘要中的语义可解释变化不够稳健。我们的工作通过确定一系列对比性错误对齐,例如替换实体、动作和颠倒事件顺序等,这些对齐模型应该对语义可解释变化具有稳健性。为此,我们引入了VideoCon,一个由大型语言模型构建的视频语言对齐数据集,生成原视频摘要和对比视频摘要之间的合理对比。然后,用VideoCon微调生成式视频语言模型来评估视频语言等价性和生成解释。基于VideoCon的视频语言对齐模型在人类生成的对比摘要任务上显著优于当前模型。它在大规模文本到视频检索(SSv2-Temporal)和视频问题回答(ATP-Hard)等temporally-extensive视频语言任务上的AUC提高了12个点。此外,我们的模型在新颖视频和人类生成的摘要和解释上表现出色。我们的代码和数据可在此处访问:https://www.youtube.com/watch?v=QzQgPvSn_Pk&t=0s
URL
https://arxiv.org/abs/2311.10111