Abstract
Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
Abstract (translated)
最近的多模态LLM在基于图表的视觉问答任务中表现出潜力,但在处理未标注的图表时表现急剧下降,尤其是那些需要精确视觉解读而非依赖于文本捷径的任务。为了解决这个问题,我们引入了ChartAgent——一种新颖的代理框架,它能够直接在图表的空间域内进行视觉推理。与基于文本的链式思维推理不同,ChartAgent通过迭代地将查询分解为视觉子任务,并使用诸如绘制注释、裁剪区域(例如分割饼图的部分或隔离条形)和定位轴等专门动作来主动操作和互动图表图像,从而完成每个子任务。这个迭代推理过程紧密模仿了人类认知策略在理解图表时的处理方式。 ChartAgent在ChartBench和ChartX基准测试中达到了最先进的准确率,在总体上超过了先前的方法多达16.07%的绝对增益,并且对于未标注、数值密集型查询,其性能提升了高达17.31%。此外,我们的分析表明: (a) ChartAgent在各种图表类型中都有效; (b) 它能在不同的视觉和推理复杂度水平上获得最高分; (c) 它是一个即插即用框架,可以提升多种基础LLM的性能。 我们的工作是首批展示使用工具增强的多模态代理进行基于视觉的图表理解的研究之一。
URL
https://arxiv.org/abs/2510.04514