Abstract
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
Abstract (translated)
结构化图像理解,如解释表格和图表,需要在图像中的各种结构和文本之间战略性地重新聚焦,形成推理序列以得出最终答案。然而,当前的多模态大型语言模型(LLM)缺乏这种多步选择性注意的能力。在这项工作中,我们介绍了 ReFocus,这是一个简单而有效的框架,它使多模态 LLM 具备通过代码在输入图像上执行视觉编辑来生成“视觉思维”的能力,从而转移和精炼其视觉焦点。具体而言,ReFocus 使得多模态 LLM 能够生成 Python 代码调用工具并修改输入图像,在此基础上依次绘制方框、高亮显示部分和屏蔽区域,从而增强视觉推理过程。 我们在涉及表格和图表的多种结构化图像理解任务上进行了实验。与未经视觉编辑的 GPT-4 相比,ReFocus 在所有任务中都显著提高了性能,在表格任务上的平均增益为 11.0%,在图表任务上的平均增益为 6.8%。我们深入分析了不同视觉编辑的效果,并解释了为什么 ReFocus 能够在不引入额外信息的情况下提高性能。 此外,我们使用 ReFocus 收集了一个包含 14,000 条数据的训练集,并证明了这种具有中间信息的视觉思维链提供了比标准 VQA 数据更好的监督效果,在模型训练中与 QA 对相比平均增益为 8.0%,与 CoT 相比则为 2.6%。
URL
https://arxiv.org/abs/2501.05452