Paper Reading AI Learner

VrdONE: One-stage Video Visual Relation Detection

2024-08-18 08:38:20
Xinjie Jiang, Chenxi Zheng, Xuemiao Xu, Bangzhen Liu, Weiying Zheng, Huaidong Zhang, Shengfeng He

Abstract

Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]{228,58,136}{\href{this https URL}{this https URL}}.

Abstract (translated)

视频视觉关系检测(VidVRD)关注实体在视频中的交互,是深入了解视频场景的关键步骤,超出了基本的视觉任务。传统方法在面对其复杂性时,通常将任务分为两个部分:一个是确定关系类别的存在,另一个是确定它们的时域边界。这种划分忽略了这些元素之间的固有联系。为了识别跨越不同持续时间的关系实体对,我们提出了VrdONE,一种简洁而有效的单阶段模型。VrdONE结合了主题和对象的特征,将谓词检测转换为他们联合表示的1D实例分割。这个设置允许在一次性识别关系类别和生成二进制掩码的同时,消除需要提议生成或后处理等额外步骤的需求。VrdONE在各种帧之间的特征交互方面表现出色,能够捕捉到短暂的和持久的关系。此外,我们引入了主题-对象协同(SOS)模块,提高了主题和对象在结合前如何相互感知。VrdONE在VidOR基准和ImageNet-VidVRD上实现了最先进的性能,展示了其在不同时间尺度上分辨关系的卓越能力。代码可在此处获得:\textcolor[R{228,58,136}]{\href{this <https://this <https://this> URL>}{this <https://this> URL}}。

URL

https://arxiv.org/abs/2408.09408

PDF

https://arxiv.org/pdf/2408.09408.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot