Paper Reading AI Learner

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

2025-05-25 18:23:39
Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt

Abstract

Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.

Abstract (translated)

强化学习微调(Reinforcement Learning Finetuning, RFT)极大地提升了大型语言模型(LLMs)的推理能力,使其能够进行长链思维、自我修正和有效工具使用。尽管最近的研究试图将RFT扩展到视觉-语言模型(VLMs),但这些努力大多局限于基于静态图像输入的文字推理,未能实现真正意义上的多模态响应推理。相比之下,在测试时采用的方法如Visual Sketchpad虽然包含可视步骤,但却缺乏训练机制。我们引入了VTool-R1框架,这是第一个让VLMs在训练过程中生成多模态思维链的框架,并通过穿插文字和中间视觉推理步骤实现这一目标。VTool-R1将基于Python的可视化编辑工具集成到RFT流程中,使模型能够学习何时以及如何生成有助于最终推理过程的可视推理步骤。我们的方法通过基于成果奖励而非过程监督进行训练,在不依赖于过程监督的情况下激发了战略性的视觉工具使用以支持推理。在针对图表和表格结构化视觉问答任务上的实验表明,VTool-R1通过教导模型“用图像思考”并生成带工具的多模态思维链来提升推理性能。

URL

https://arxiv.org/abs/2505.19255

PDF

https://arxiv.org/pdf/2505.19255.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot