Abstract
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]{228,58,136}{\href{this https URL}{this https URL}}.
Abstract (translated)
视频视觉关系检测(VidVRD)关注实体在视频中的交互,是深入了解视频场景的关键步骤,超出了基本的视觉任务。传统方法在面对其复杂性时,通常将任务分为两个部分:一个是确定关系类别的存在,另一个是确定它们的时域边界。这种划分忽略了这些元素之间的固有联系。为了识别跨越不同持续时间的关系实体对,我们提出了VrdONE,一种简洁而有效的单阶段模型。VrdONE结合了主题和对象的特征,将谓词检测转换为他们联合表示的1D实例分割。这个设置允许在一次性识别关系类别和生成二进制掩码的同时,消除需要提议生成或后处理等额外步骤的需求。VrdONE在各种帧之间的特征交互方面表现出色,能够捕捉到短暂的和持久的关系。此外,我们引入了主题-对象协同(SOS)模块,提高了主题和对象在结合前如何相互感知。VrdONE在VidOR基准和ImageNet-VidVRD上实现了最先进的性能,展示了其在不同时间尺度上分辨关系的卓越能力。代码可在此处获得:\textcolor[R{228,58,136}]{\href{this <https://this <https://this> URL>}{this <https://this> URL}}。
URL
https://arxiv.org/abs/2408.09408