Paper Reading AI Learner

Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

2025-04-20 14:50:49
Lin Li, Wei Chen, Jiahui Li, Long Chen

Abstract

Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.

Abstract (translated)

最近在多模态大型语言模型(MLLMs)方面的进展显著提高了对象级定位和区域描述的能力,但在视觉关系理解方面仍存在局限性(例如,场景图生成),特别是在建模涉及事件中多个实体的多重语义角色的\textit{N}-元关系时。这种对多实体之间缺乏\textit{语义依赖}模型导致了不可靠的结果,加剧了MLLMs的幻想倾向和过度依赖语言先验知识。 为此,我们提出了Relation-R1,这是第一个统一的关系理解框架,它明确地在强化学习(RL)范式内整合了认知链式思考(CoT)引导的监督微调(SFT)和组相对策略优化(GRPO)。具体来说,我们首先通过SFT建立了基础推理能力,并强制执行结构化的输出和思维过程。然后利用GRPO通过多奖励优化来精炼这些输出,优先考虑视觉语义定位而非语言诱导偏差,从而提高泛化能力。 在广泛使用的PSG和SWiG数据集上的大量实验表明,Relation-R1在二元关系理解和\textit{N}-元关系理解方面都达到了最先进的性能。

URL

https://arxiv.org/abs/2504.14642

PDF

https://arxiv.org/pdf/2504.14642.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot