Abstract
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.
Abstract (translated)
强化学习微调(Reinforcement Learning Finetuning, RFT)极大地提升了大型语言模型(LLMs)的推理能力,使其能够进行长链思维、自我修正和有效工具使用。尽管最近的研究试图将RFT扩展到视觉-语言模型(VLMs),但这些努力大多局限于基于静态图像输入的文字推理,未能实现真正意义上的多模态响应推理。相比之下,在测试时采用的方法如Visual Sketchpad虽然包含可视步骤,但却缺乏训练机制。我们引入了VTool-R1框架,这是第一个让VLMs在训练过程中生成多模态思维链的框架,并通过穿插文字和中间视觉推理步骤实现这一目标。VTool-R1将基于Python的可视化编辑工具集成到RFT流程中,使模型能够学习何时以及如何生成有助于最终推理过程的可视推理步骤。我们的方法通过基于成果奖励而非过程监督进行训练,在不依赖于过程监督的情况下激发了战略性的视觉工具使用以支持推理。在针对图表和表格结构化视觉问答任务上的实验表明,VTool-R1通过教导模型“用图像思考”并生成带工具的多模态思维链来提升推理性能。
URL
https://arxiv.org/abs/2505.19255