Abstract
The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
Abstract (translated)
执行跨模态多跳推理的能力,通过迭代地整合来自各种模式和外部知识的信息来解决复杂的现实世界挑战是至关重要的。然而,现有的多模态大型语言模型(MLLMs)主要局限于单步推理,因为现有基准的复杂性不足以评估和推动多跳能力的发展。为了解决这一差距,我们引入了MMhops,这是一个全新的大规模基准测试平台,旨在系统地评估并促进多模态多跳推理。MMhops数据集包括两个具有挑战性的任务格式:Bridging(桥接)和Comparison(比较),这些格式要求模型动态构建复杂的推理链,并整合外部知识。 为了应对MMhops带来的挑战,我们提出了MMhops-R1,这是一种新颖的多模态检索增强生成(mRAG)框架,旨在进行动态推理。我们的框架利用强化学习来优化模型,使其能够自主规划推理路径、形成有针对性的问题查询并综合多层次信息。全面的实验表明,在MMhops上,MMhops-R1显著优于强大的基线模型,这强调了动态规划和多模态知识整合对于复杂推理的重要性。此外,MMhops-R1在需要固定跳推理的任务中展示了很强的一般化能力,这突显了我们动态规划方法的稳健性。 总之,我们的工作贡献了一个具有挑战性的新基准测试以及一个强大的基线模型,并且我们将发布相关的代码、数据和权重以促进这一关键领域未来的研究。
URL
https://arxiv.org/abs/2512.13573