Unbiased Scene Graph Generation in Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.

Abstract (translated)

从视频动态场景Graph(SGG)的生成任务变得复杂且具有挑战性，因为场景本身具有动态特性、模型预测时间的随机波动以及视觉关系长长尾分布的特性，而基于图像的SGG已经面临了上述挑战。现有的动态SGG方法主要关注使用复杂的架构捕捉时空上下文，而没有解决上述挑战，特别是关系长长尾分布的问题。这可能导致生成偏差的场景Graph。为了应对这些挑战，我们提出了名为TEMPURA的新框架，它利用集体一致性和记忆原型引导的无偏差动态SGG性能衰减。 Tempura使用对象级别的时间一致性通过Transformer序列建模实现，学习使用记忆引导训练合成无偏差的关系表示，并通过高斯混合模型(GMM)衰减视觉关系预测的不确定性。广泛的实验表明，我们的方法比现有方法实现了显著的性能提升(在某些情况下高达10%)，突出了它在生成更多无偏差场景Graph方面的优越性。

URL

https://arxiv.org/abs/2304.00733

PDF

https://arxiv.org/pdf/2304.00733.pdf