Paper Reading AI Learner

MMhops-R1: Multimodal Multi-hop Reasoning

2025-12-15 17:29:02
Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li, Chunfeng Yuan, Guangting Wang, Fengyun Rao, Ying Shan, Weiming Hu

Abstract

The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

Abstract (translated)

执行跨模态多跳推理的能力,通过迭代地整合来自各种模式和外部知识的信息来解决复杂的现实世界挑战是至关重要的。然而,现有的多模态大型语言模型(MLLMs)主要局限于单步推理,因为现有基准的复杂性不足以评估和推动多跳能力的发展。为了解决这一差距,我们引入了MMhops,这是一个全新的大规模基准测试平台,旨在系统地评估并促进多模态多跳推理。MMhops数据集包括两个具有挑战性的任务格式:Bridging(桥接)和Comparison(比较),这些格式要求模型动态构建复杂的推理链,并整合外部知识。 为了应对MMhops带来的挑战,我们提出了MMhops-R1,这是一种新颖的多模态检索增强生成(mRAG)框架,旨在进行动态推理。我们的框架利用强化学习来优化模型,使其能够自主规划推理路径、形成有针对性的问题查询并综合多层次信息。全面的实验表明,在MMhops上,MMhops-R1显著优于强大的基线模型,这强调了动态规划和多模态知识整合对于复杂推理的重要性。此外,MMhops-R1在需要固定跳推理的任务中展示了很强的一般化能力,这突显了我们动态规划方法的稳健性。 总之,我们的工作贡献了一个具有挑战性的新基准测试以及一个强大的基线模型,并且我们将发布相关的代码、数据和权重以促进这一关键领域未来的研究。

URL

https://arxiv.org/abs/2512.13573

PDF

https://arxiv.org/pdf/2512.13573.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot