In Defense of Clip-based Video Relation Detection

Abstract
Abstract (translated)
URL
PDF

Abstract

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

Abstract (translated)

Video Visual Relation Detection (VidVRD) 旨在使用空间边界和时间边界检测视频中的视觉关系 triplets。现有的 VidVRD 方法可以根据不同的分类方法将其分为bottom-up和top-down paradigm,取决于其分类方法。bottom-up方法基于片段的方法,将短片 tubelet 对的关系分类并将它们合并成较长的视频关系。top-down方法直接分类较长的视频 tubelet 对。虽然使用视频 tubelets 的video-based 方法已经取得了令人瞩目的结果,但我们认为有效的空间和时间建模比选择片段 tubelets 和视频 tubelets 更为重要。这激励我们重新考虑基于片段的分类 paradigm 并探索 VidVRD 的关键成功因素。在本文中,我们提出了一种Hierarchical Context Model (HCM),该模型基于片段来丰富基于对象的空间和基于关系的时间的上下文。我们证明,使用片段 tubelet 可以比大多数基于视频的方法获得更好的性能。此外,使用片段 tubelet 可以在模型设计中提供更多的灵活性,并减轻与视频 tubelets 相关的限制,例如挑战性的长期对象跟踪问题和长期 tubelet 特征压缩中的时间信息丢失问题。在两个挑战性的 VidVRD 基准测试中进行了广泛的实验验证,我们的 HCM 实现了新的先进技术性能,强调了在基于片段的分类 paradigm 内引入高级的空间和时间上下文建模的有效性。

URL

https://arxiv.org/abs/2307.08984

PDF

https://arxiv.org/pdf/2307.08984.pdf

In Defense of Clip-based Video Relation Detection

Abstract

Abstract (translated)

URL

PDF Copy

PDF