Paper Reading AI Learner

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

2025-05-26 15:11:23
Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

Abstract

We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

Abstract (translated)

我们介绍了OvSGTR,这是一种基于变压器的全新框架,用于全开放式词汇量场景图生成,克服了传统封闭集模型的局限性。传统的方法将物体和关系识别限制在一个固定的词汇表中,这妨碍了它们在新概念频繁出现的真实世界场景中的应用。相比之下,我们的方法同时预测超出预定义类别的对象(节点)及其相互关系(边)。OvSGTR采用类似于DETR的架构,包括冻结的图像骨干网络和文本编码器来提取高质量的视觉和语义特征,并通过变压器解码器融合这些特征以进行端到端场景图预测。为了丰富模型对复杂视觉关系的理解,我们提出了一种基于关系感知的预训练策略,在弱监督下综合生成场景图注释。具体而言,我们研究了三种管道——基于场景解析器、基于大型语言模型(LLM)和多模态LLM的方法——以利用最少的手动标注生成可转移的监督信号。此外,为了解决开放式词汇设置中常见的灾难性遗忘问题,我们引入了一种结合视觉概念保留机制与知识蒸馏策略的方法,在微调过程中确保模型保持丰富的语义线索。在VG150基准测试上的广泛实验表明,OvSGTR在封闭集、基于开放词汇对象检测的、关系驱动型和完全开放式词汇量等多种设置下均达到了最先进的性能水平。我们的结果强调了大规模关系感知预训练和变压器架构对于推进场景图生成向更通用和可靠视觉理解方向发展的前景。

URL

https://arxiv.org/abs/2505.20106

PDF

https://arxiv.org/pdf/2505.20106.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot