Paper Reading AI Learner

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

2025-03-12 14:13:17
Qi Liu, Weiying Xue, Yuxiao Wang, Zhenao Wei

Abstract

The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.

Abstract (translated)

视频视觉关系检测(VidVRD)任务旨在识别视频中物体及其之间的关系,这一任务由于动态内容、高昂的标注成本以及长尾分布的关系类型而极具挑战性。视觉语言模型(VLMs)有助于探索开放词汇表式的视觉关系检测任务,但往往忽视了不同视觉区域间及它们之间关系的关联性。此外,直接使用VLM来识别视频中的视觉关系也会因为图像与视频之间的巨大差异而带来显著挑战。 因此,我们提出了一种新颖的开放式视频视觉关系检测框架——OpenVidVRD,通过提示学习将VLM的知识和能力迁移到改进VidVRD任务上。具体来说,我们利用VLM从基于视频区域自动生成的区域描述中提取文本表示。接下来,开发了一个时空细化模块,通过整合跨模态时空互补信息来推导视频中的物体级关系表示。此外,采用一种提示驱动策略以对齐语义空间,以此充分利用VLM的语义理解能力,提高OpenVidVRD的整体泛化能力。 在VidVRD和VidOR公开数据集上进行的广泛实验表明,所提出的模型优于现有的方法。

URL

https://arxiv.org/abs/2503.09416

PDF

https://arxiv.org/pdf/2503.09416.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot