Multi-granularity Correspondence Learning from Long-term Noisy Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at this https URL.

Abstract (translated)

现有视频语言研究主要集中在学习短视频片段，而很少涉及对长视频的长期依赖关系的探讨，因为模型的计算成本过高，导致对长视频的建模存在过高的问题。为解决这一问题，一个可行的解决方案是学习视频片段和字幕之间的对应关系，然而这不可避免地遇到了多粒度噪声匹配（MNC）问题。具体来说，MNC指的是视频片段和字幕对齐误差（粗粒度）和帧词对齐误差（细粒度），这阻碍了时序学习和视频理解。在本文中，我们提出了一个统一最优传输（OT）框架下的Norton噪声鲁棒时序优化（Norton）来解决MNC问题。总之，Norton通过视频段落和视频片段对比损失来捕捉基于OT的长期依赖关系。为解决视频段落对比中的粗粒度对齐问题，Norton通过可调的提示桶对无关的片段和字幕进行过滤，并根据传输距离将异步视频片段对齐。为解决细粒度对齐问题，Norton使用软最大操作来确定关键单词和关键帧。此外，Norton还利用视频片段对比中的潜在有缺陷负样本，通过OT分配对齐目标来确保精确的时间建模。对视频检索、视频QA和动作分割等领域的实验证实了我们的方法的有效性。代码可以从该链接获取：https://this URL。

URL

https://arxiv.org/abs/2401.16702

PDF

https://arxiv.org/pdf/2401.16702.pdf

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Abstract

Abstract (translated)

URL

PDF Copy

PDF