Abstract
Dynamic scene graph generation (SGG) from videos requires not only comprehensive understanding of objects across the scenes that are prone to temporal fluctuations but also a model the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is the crucial bottleneck of most dynamic SGG methods, since most of them focus on capturing spatio-temporal context using complex architectures, which leads to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware temporal consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across the frames. In addition, it uses correlation debiasing to learn the unbiased relation representation for long-tailed classes. Moreover, to attenuate the predictive uncertainties, it uses a mixture of sigmoidal cross-entropy loss and contrastive loss to incorporate label correlations to identify the commonly co-occurring relations and help debias the long-tailed ones. Extensive experimental evaluation shows a performance gain as high as 4.1% showing the superiority of generating more unbiased scene graphs.
Abstract (translated)
动态场景图生成(SGG)从视频中需要不仅全面了解易受时间波动的场景中对象的全面情况,还需要模型对不同对象的时空运动和相互作用进行建模。此外,长尾分布的视觉关系是大多数动态 SGG 方法的瓶颈,因为它们主要关注使用复杂架构捕捉空间-时间上下文,导致生成有偏的场景图。为解决这些挑战,我们提出了 FloCoDe: 基于流的时空一致性和不确定性减轻的关联消解 for unbiased dynamic scene graphs。FloCoDe 使用流特征扭曲来检测帧间一致的物体。此外,它使用关联消解来学习无偏的类之间的关系表示。此外,为了消弱预测的不确定性,它使用正则化索索交叉熵损失和对比损失来结合标签关联以确定共现的关系,并帮助消除有偏的类。大量的实验评估显示,生成更无偏的场景图的性能提高可达 4.1%,证明了生成更有利的场景图的优势。
URL
https://arxiv.org/abs/2310.16073