Paper Reading AI Learner

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

2025-10-06 06:05:36
Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

Abstract

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

Abstract (translated)

最近的多模态LLM在基于图表的视觉问答任务中表现出潜力,但在处理未标注的图表时表现急剧下降,尤其是那些需要精确视觉解读而非依赖于文本捷径的任务。为了解决这个问题,我们引入了ChartAgent——一种新颖的代理框架,它能够直接在图表的空间域内进行视觉推理。与基于文本的链式思维推理不同,ChartAgent通过迭代地将查询分解为视觉子任务,并使用诸如绘制注释、裁剪区域(例如分割饼图的部分或隔离条形)和定位轴等专门动作来主动操作和互动图表图像,从而完成每个子任务。这个迭代推理过程紧密模仿了人类认知策略在理解图表时的处理方式。 ChartAgent在ChartBench和ChartX基准测试中达到了最先进的准确率,在总体上超过了先前的方法多达16.07%的绝对增益,并且对于未标注、数值密集型查询,其性能提升了高达17.31%。此外,我们的分析表明: (a) ChartAgent在各种图表类型中都有效; (b) 它能在不同的视觉和推理复杂度水平上获得最高分; (c) 它是一个即插即用框架,可以提升多种基础LLM的性能。 我们的工作是首批展示使用工具增强的多模态代理进行基于视觉的图表理解的研究之一。

URL

https://arxiv.org/abs/2510.04514

PDF

https://arxiv.org/pdf/2510.04514.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot