Abstract
Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.
Abstract (translated)
视频问答(Video-Question-Answering,简称 VideoQA)涉及到捕捉随时间变化的复杂视觉关系,这对先进的视频语言模型(VLMs)来说仍是一个挑战。这种困难部分源于需要将视觉内容表示为适合这些模型处理的合理大小输入的问题。为了应对这一问题,我们提出了基于关系的视频表征学习框架(RElation-based Video rEpresentAtion Learning, REVEAL)。该框架旨在通过编码结构化、分解式的表征来捕捉视觉关系信息。 具体来说,受时空场景图的启发,我们提出将视频序列在随时间变化的过程中以语言嵌入的形式表示为一组关系三元组(即形式为“主体-谓词-客体”的关系)。为此,我们从视频字幕中提取明确的关系,并结合Many-to-Many噪声对比估计(MM-NCE)和Q-Former架构来对齐无序的视频衍生查询集与相应的文本基础关系描述。在推理阶段,生成的Q-former能够产生一个高效的令牌表示形式,可以作为输入提供给VLM进行VideoQA任务。 我们在五个具有挑战性的基准测试(NeXT-QA、Intent-QA、STAR、VLEP和TVQA)上评估了该框架的表现。结果显示,基于查询的视频表征在与全局对齐基础CLF或patch令牌表示相比时能够胜出,并且在需要时间推理和关系理解的任务中,其表现可以匹敌当前最先进的模型。代码和模型将公开发布。
URL
https://arxiv.org/abs/2504.05463