Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge.

Abstract (translated)

过去几年, Joint video-language learning 吸引了越来越多的关注。然而,现有的工作主要关注剪辑视频(事件)的单个或多个片段,这导致在推理时需要人类标注的事件边界。为了摆脱束缚,我们提出了一种针对未剪辑视频的基层视觉语言学习框架,该框架自动检测信息事件并有效地挖掘多 sentence 描述和相应事件Segment之间的对齐。我们不再采用粗粒度的视频-语言对齐,而是提出了两个双任务目标,以鼓励更细粒度的Segment 级对齐,即文本到事件 grounded(TEG) 和事件到文本 generation(ETG)。TEG 学习通过估计在共同语义空间中的情感距离估计出可能的事件提议,同时,ETG 旨在根据事件提议重构(生成)匹配的文本,鼓励事件表示保留有意义的语义信息。为了鼓励准确的事件和文本集合之间的标签分配,我们提出了一种新的语义aware成本,以减轻由于歧义边界标注引起的劣化匹配结果。我们的框架可以轻松扩展到涉及视觉grounded 语言理解和生成的任务。我们在ActivityNetcaptions、YouCook2 和 YouMakeup等平台上实现了先进的高密度视频字幕性能,并在多个其他语言生成和理解任务中表现出竞争力。我们的方法还获得了 PIC 4 挑战MTVG 和 MDVC 任务的第一名。

URL

https://arxiv.org/abs/2303.06378

PDF

https://arxiv.org/pdf/2303.06378.pdf