Paper Reading AI Learner

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

2024-09-19 06:25:01
Yongqi Wang, Shuo Yang, Xinxiao Wu, Jiebo Luo

Abstract

Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.

Abstract (translated)

开放词汇视频视觉关系检测旨在将视频视觉关系检测扩展到注释类别的范围之外,通过检测视频中的未见关系来识别可见和未见对象之间的关系。现有的方法通常使用在关闭数据集上训练的轨迹检测器来检测物体轨迹,然后将轨迹输入到大规模预训练的视觉语言模型中,以实现开放词汇分类。然而,对预训练轨迹检测器的依赖使得它们无法扩展到新颖物体类别,导致性能下降。为了解决这个问题,我们提出将物体轨迹检测和关系分类统一成一个端到端的开放词汇框架。在这个框架中,我们提出了一个关系意识到的开放词汇轨迹检测器。它主要由一个基于查询的Transformer解码器组成,其中CLIP的视觉编码器在帧级别进行离心化以实现开放词汇物体检测,和一个轨迹关联器。为了在轨迹检测过程中利用关系上下文,我们在Transformer解码器中嵌入了一个关系查询,相应地,设计了一个辅助关系损失,以使解码器能够直观地感知物体之间的关系。此外,我们还提出了一个利用CLIP丰富语义知识来发现新关系的开放词汇关系分类器。为了使CLIP更好地适应关系分类,我们设计了一个多模态提示方法,该方法采用空间时间视觉提示进行视觉表示,并使用视觉指导语言提示进行语言输入。在两个公开数据集VidVRD和VidOR上的大量实验证明了我们框架的有效性。我们的框架还应用于更困难的跨数据集场景,以进一步证明其泛化能力。

URL

https://arxiv.org/abs/2409.12499

PDF

https://arxiv.org/pdf/2409.12499.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot