Paper Reading AI Learner

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

2024-04-01 04:21:01
Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He

Abstract

Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.

Abstract (translated)

场景图生成(SGG)旨在将视觉场景分解为中间图表示,以供下游推理任务使用。尽管最近取得了进展,但现有的方法在生成具有新颖视觉关系概念的场景图时仍存在困难。为解决这一挑战,我们引入了一种基于序列生成的全新开放词汇SGG框架。我们的框架利用了视觉语言预训练模型(VLM),并引入了图像到图生成范式。具体来说,我们通过VLM的图像到文本生成生成场景图序列,然后从这些序列中构建场景图。通过这样做,我们充分利用VLM的强大的能力实现开放词汇的SGG,并通过显式关系建模增强VL任务的性能。实验结果表明,我们的设计不仅实现了更好的性能,而且通过显式关系建模知识增强了下游视觉语言任务的表现。

URL

https://arxiv.org/abs/2404.00906

PDF

https://arxiv.org/pdf/2404.00906.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot