End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

Abstract
Abstract (translated)
URL
PDF

Abstract

Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.

Abstract (translated)

开放词汇视频视觉关系检测旨在将视频视觉关系检测扩展到注释类别的范围之外，通过检测视频中的未见关系来识别可见和未见对象之间的关系。现有的方法通常使用在关闭数据集上训练的轨迹检测器来检测物体轨迹，然后将轨迹输入到大规模预训练的视觉语言模型中，以实现开放词汇分类。然而，对预训练轨迹检测器的依赖使得它们无法扩展到新颖物体类别，导致性能下降。为了解决这个问题，我们提出将物体轨迹检测和关系分类统一成一个端到端的开放词汇框架。在这个框架中，我们提出了一个关系意识到的开放词汇轨迹检测器。它主要由一个基于查询的Transformer解码器组成，其中CLIP的视觉编码器在帧级别进行离心化以实现开放词汇物体检测，和一个轨迹关联器。为了在轨迹检测过程中利用关系上下文，我们在Transformer解码器中嵌入了一个关系查询，相应地，设计了一个辅助关系损失，以使解码器能够直观地感知物体之间的关系。此外，我们还提出了一个利用CLIP丰富语义知识来发现新关系的开放词汇关系分类器。为了使CLIP更好地适应关系分类，我们设计了一个多模态提示方法，该方法采用空间时间视觉提示进行视觉表示，并使用视觉指导语言提示进行语言输入。在两个公开数据集VidVRD和VidOR上的大量实验证明了我们框架的有效性。我们的框架还应用于更困难的跨数据集场景，以进一步证明其泛化能力。

URL

https://arxiv.org/abs/2409.12499

PDF

https://arxiv.org/pdf/2409.12499.pdf

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

Abstract

Abstract (translated)

URL

PDF Copy

PDF