Abstract
The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.
Abstract (translated)
视频视觉关系检测(VidVRD)任务旨在识别视频中物体及其之间的关系,这一任务由于动态内容、高昂的标注成本以及长尾分布的关系类型而极具挑战性。视觉语言模型(VLMs)有助于探索开放词汇表式的视觉关系检测任务,但往往忽视了不同视觉区域间及它们之间关系的关联性。此外,直接使用VLM来识别视频中的视觉关系也会因为图像与视频之间的巨大差异而带来显著挑战。 因此,我们提出了一种新颖的开放式视频视觉关系检测框架——OpenVidVRD,通过提示学习将VLM的知识和能力迁移到改进VidVRD任务上。具体来说,我们利用VLM从基于视频区域自动生成的区域描述中提取文本表示。接下来,开发了一个时空细化模块,通过整合跨模态时空互补信息来推导视频中的物体级关系表示。此外,采用一种提示驱动策略以对齐语义空间,以此充分利用VLM的语义理解能力,提高OpenVidVRD的整体泛化能力。 在VidVRD和VidOR公开数据集上进行的广泛实验表明,所提出的模型优于现有的方法。
URL
https://arxiv.org/abs/2503.09416