Abstract
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
Abstract (translated)
目前基于视频的场景图形生成(VidSGG)方法在预测不符合训练数据固有分布的 predicate 方面表现较差。在本文中,我们对 predicate 进行更细致的观察,并发现大多数视觉关系(例如 sit_Above 涉及行动模式(sit)和空间模式(Above),而模式级别的分布偏差相对较轻。基于这一认识,我们提出了一种分离标签学习(DLL)范式,从模式级角度解决顽固的视觉关系预测问题。具体而言,DLL 将 predicate 标签分离,并采用不同的分类器学习行动和空间模式。模式后将它们组合并映射回 predicate。此外,我们提出了一种知识级标签分离方法,从同一模式中的头predicate 到尾predicate 转移非目标知识,以校准 tail 类分布。我们验证了 DLL 在常用的 VidSGG 基准测试数据上的有效性,即 VidVRD。广泛的实验表明,DLL 提供了一种非常简单但非常有效的解决方案,解决长尾巴问题,实现 VidSGG 的先进技术表现。
URL
https://arxiv.org/abs/2303.13209