Abstract
Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).
Abstract (translated)
场景图生成的目标是检测视觉关系三元组(主题、谓词、对象)。由于数据中的偏见,当前模型往往预测常见的谓词,例如“在”和“在”,而不是有用的谓词,例如“站在”和“看着”。这种趋势导致准确的信息和整体表现的损失。如果模型仅使用“在路上”而不是“在路上堵住了”来描述一个图像,这可能是一个非常严重的误解。我们认为,这种情况是由两个不平衡因素引起的:语义空间上的不平衡和训练样本水平的不平衡。为了解决这一问题,我们提出了 DB-SGG,一个基于去偏但不同于传统分布适应的有效框架。它整合了两个组件:语义去偏(SD)和平衡谓词学习(BPL)。SD利用混淆矩阵和二分类图来构建谓词关系。BPL采用随机Undersampling策略和歧义移除策略,重点优化有用的谓词。得益于无模型过程,我们的方法和Transformer在 SGG 数据集上的三个 SGG 子任务中的 mR@20 表现相比提高了136.3%、119.5% 和 122.6%。我们在另一个复杂的 SGG 数据集(SGG-GQA)和两个后续任务(sentence-to-graph 检索和图像摘要)上进行了进一步验证。
URL
https://arxiv.org/abs/2308.05286