Abstract
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
Abstract (translated)
Video Visual Relation Detection (VidVRD) 旨在使用空间边界和时间边界检测视频中的视觉关系 triplets。现有的 VidVRD 方法可以根据不同的分类方法将其分为bottom-up和top-down paradigm,取决于其分类方法。bottom-up方法基于片段的方法,将短片 tubelet 对的关系分类并将它们合并成较长的视频关系。top-down方法直接分类较长的视频 tubelet 对。虽然使用视频 tubelets 的video-based 方法已经取得了令人瞩目的结果,但我们认为有效的空间和时间建模比选择片段 tubelets 和视频 tubelets 更为重要。这激励我们重新考虑基于片段的分类 paradigm 并探索 VidVRD 的关键成功因素。在本文中,我们提出了一种Hierarchical Context Model (HCM),该模型基于片段来丰富基于对象的空间和基于关系的时间的上下文。我们证明,使用片段 tubelet 可以比大多数基于视频的方法获得更好的性能。此外,使用片段 tubelet 可以在模型设计中提供更多的灵活性,并减轻与视频 tubelets 相关的限制,例如挑战性的长期对象跟踪问题和长期 tubelet 特征压缩中的时间信息丢失问题。在两个挑战性的 VidVRD 基准测试中进行了广泛的实验验证,我们的 HCM 实现了新的先进技术性能,强调了在基于片段的分类 paradigm 内引入高级的空间和时间上下文建模的有效性。
URL
https://arxiv.org/abs/2307.08984