Abstract
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at this https URL
Abstract (translated)
深度推理是解决复杂任务的基础,尤其是在要求顺序理解和多模态理解的以视觉为中心的情境中。然而,现有的基准测试通常使用完全合成的、单一回合查询来评估代理(如AI系统),并且仅限于有限的视觉模式,缺乏用于评估实际环境中所需跨多个步骤的推理质量的框架。为了解决这个问题,我们引入了Agent-X,这是一个大规模的基准测试工具,旨在评估以视觉为中心的代理在真实世界多模态设置中的多步和深度推理能力。Agent-X包含828项具有现实视觉上下文的任务,包括图像、多图比较、视频以及说明性文本。这些任务涵盖了六个主要的代理环境:通用视觉推理、网络浏览、安全与监控、自动驾驶、体育及数学推理。我们的基准要求代理在各种环境中将工具使用与明确的分步决策相结合。此外,我们提出了一种精细粒度的步骤级评估框架,该框架评估每一步推理的正确性和逻辑连贯性以及在整个任务中工具使用的有效性。我们的结果表明,即使是表现最好的模型(包括GPT、Gemini和Qwen家族的模型)也难以解决多步视觉任务,在全链成功方面得分不到50%。这些发现突出了当前长上下文语言模型在推理和工具使用能力上的关键瓶颈,并指明了未来以视觉为中心的代理性推理模型的研究方向。我们的数据和代码可在以下网址公开获取:[此URL](请将实际链接放置在此处)。
URL
https://arxiv.org/abs/2505.24876