Paper Reading AI Learner

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

2025-05-30 17:59:53
Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

Abstract

Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at this https URL

Abstract (translated)

深度推理是解决复杂任务的基础,尤其是在要求顺序理解和多模态理解的以视觉为中心的情境中。然而,现有的基准测试通常使用完全合成的、单一回合查询来评估代理(如AI系统),并且仅限于有限的视觉模式,缺乏用于评估实际环境中所需跨多个步骤的推理质量的框架。为了解决这个问题,我们引入了Agent-X,这是一个大规模的基准测试工具,旨在评估以视觉为中心的代理在真实世界多模态设置中的多步和深度推理能力。Agent-X包含828项具有现实视觉上下文的任务,包括图像、多图比较、视频以及说明性文本。这些任务涵盖了六个主要的代理环境:通用视觉推理、网络浏览、安全与监控、自动驾驶、体育及数学推理。我们的基准要求代理在各种环境中将工具使用与明确的分步决策相结合。此外,我们提出了一种精细粒度的步骤级评估框架,该框架评估每一步推理的正确性和逻辑连贯性以及在整个任务中工具使用的有效性。我们的结果表明,即使是表现最好的模型(包括GPT、Gemini和Qwen家族的模型)也难以解决多步视觉任务,在全链成功方面得分不到50%。这些发现突出了当前长上下文语言模型在推理和工具使用能力上的关键瓶颈,并指明了未来以视觉为中心的代理性推理模型的研究方向。我们的数据和代码可在以下网址公开获取:[此URL](请将实际链接放置在此处)。

URL

https://arxiv.org/abs/2505.24876

PDF

https://arxiv.org/pdf/2505.24876.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot