Paper Reading AI Learner

REVEAL: Relation-based Video Representation Learning for Video-Question-Answering

2025-04-07 19:54:04
Sofian Chaybouti, Walid Bousselham, Moritz Wolter, Hilde Kuehne

Abstract

Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.

Abstract (translated)

视频问答(Video-Question-Answering,简称 VideoQA)涉及到捕捉随时间变化的复杂视觉关系,这对先进的视频语言模型(VLMs)来说仍是一个挑战。这种困难部分源于需要将视觉内容表示为适合这些模型处理的合理大小输入的问题。为了应对这一问题,我们提出了基于关系的视频表征学习框架(RElation-based Video rEpresentAtion Learning, REVEAL)。该框架旨在通过编码结构化、分解式的表征来捕捉视觉关系信息。 具体来说,受时空场景图的启发,我们提出将视频序列在随时间变化的过程中以语言嵌入的形式表示为一组关系三元组(即形式为“主体-谓词-客体”的关系)。为此,我们从视频字幕中提取明确的关系,并结合Many-to-Many噪声对比估计(MM-NCE)和Q-Former架构来对齐无序的视频衍生查询集与相应的文本基础关系描述。在推理阶段,生成的Q-former能够产生一个高效的令牌表示形式,可以作为输入提供给VLM进行VideoQA任务。 我们在五个具有挑战性的基准测试(NeXT-QA、Intent-QA、STAR、VLEP和TVQA)上评估了该框架的表现。结果显示,基于查询的视频表征在与全局对齐基础CLF或patch令牌表示相比时能够胜出,并且在需要时间推理和关系理解的任务中,其表现可以匹敌当前最先进的模型。代码和模型将公开发布。

URL

https://arxiv.org/abs/2504.05463

PDF

https://arxiv.org/pdf/2504.05463.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot