Abstract
We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
Abstract (translated)
我们介绍了OvSGTR,这是一种基于变压器的全新框架,用于全开放式词汇量场景图生成,克服了传统封闭集模型的局限性。传统的方法将物体和关系识别限制在一个固定的词汇表中,这妨碍了它们在新概念频繁出现的真实世界场景中的应用。相比之下,我们的方法同时预测超出预定义类别的对象(节点)及其相互关系(边)。OvSGTR采用类似于DETR的架构,包括冻结的图像骨干网络和文本编码器来提取高质量的视觉和语义特征,并通过变压器解码器融合这些特征以进行端到端场景图预测。为了丰富模型对复杂视觉关系的理解,我们提出了一种基于关系感知的预训练策略,在弱监督下综合生成场景图注释。具体而言,我们研究了三种管道——基于场景解析器、基于大型语言模型(LLM)和多模态LLM的方法——以利用最少的手动标注生成可转移的监督信号。此外,为了解决开放式词汇设置中常见的灾难性遗忘问题,我们引入了一种结合视觉概念保留机制与知识蒸馏策略的方法,在微调过程中确保模型保持丰富的语义线索。在VG150基准测试上的广泛实验表明,OvSGTR在封闭集、基于开放词汇对象检测的、关系驱动型和完全开放式词汇量等多种设置下均达到了最先进的性能水平。我们的结果强调了大规模关系感知预训练和变压器架构对于推进场景图生成向更通用和可靠视觉理解方向发展的前景。
URL
https://arxiv.org/abs/2505.20106