Abstract
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at this https URL
Abstract (translated)
Sequential video understanding,作为新兴的视频理解任务,吸引了许多研究人员的关注,因为它具有目标导向的性质。本文研究了未提供准确时间戳级别文本-视频对齐的弱监督Sequential视频理解任务。我们借鉴了CLIP的思想,具体来说,我们使用Transformer将帧级特征整合用于视频表示,使用预先训练的文本编码器分别编码每个行动和整个视频对应的文本。为了建模文本和视频之间的对应关系,我们提出了多个粒度的损失,其中视频段落对比度损失强迫整个视频和完整脚本匹配,而精细的帧语句对比度损失强迫每个行动和其描述匹配。由于帧语句对应关系不可得,我们提出了利用时间域中视频行动Sequential的顺序性生成伪帧语句对应关系,并监督网络使用伪标签进行训练。在视频序列验证和文本到视频匹配方面的广泛实验结果表明,我们的方法比基准方法表现更好,这验证了我们提出的方法的有效性。代码可在该https URL处获取。
URL
https://arxiv.org/abs/2303.12370