Paper Reading AI Learner

DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

2025-06-18 16:00:19
Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li

Abstract

Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: this https URL

Abstract (translated)

视觉语言模型(VLMs)现在生成的是以对话层面、多句式描述为主的视觉描述,这挑战了最初为单句描述到场景图映射设计的文本场景图解析器。当前的方法通常通过合并句子级别的解析输出来处理对话输入,但这种方式常常遗漏跨句指代等现象,导致产生的图碎片化,并且影响下游VLM任务的表现。 为了应对这一挑战,我们引入了一个新的任务——话语级文本场景图解析(DiscoSG),并为此构建了数据集DiscoSG-DS。该数据集包括400个专家注释的和8,430对合成的多句描述与对应场景图配对的数据。每个描述平均包含9句话,而每个图表至少包含了现有数据集中三倍以上的三元组数量。 虽然在DiscoSG-DS上微调大型PLM(如GPT-4)可以使SPICE评分比最佳句子合并基线提高约48%,但由于高昂的推理成本和限制性许可条款阻碍了其开源使用,而且较小规模的微调模型难以处理复杂的图。我们提出了一种名为DiscoSG-Refiner的方法:首先利用一个小一点的语言模型生成基础图;然后采用第二个语言模型进行迭代式的图编辑建议,从而减少整个图表生成的工作量。 通过使用两个Flan-T5-Base模型,DiscoSG-Refiner仍能比最佳基线提高约30%的SPICE评分,并且推理速度比GPT-4快86倍。此外,在对话级描述评估和幻觉检测等下游VLM任务上也表现出稳定地提升性能。 代码与数据可在以下网址获取:[请提供具体的URL链接]

URL

https://arxiv.org/abs/2506.15583

PDF

https://arxiv.org/pdf/2506.15583.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot