Paper Reading AI Learner

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

2025-01-09 18:59:58
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

Abstract

Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.

Abstract (translated)

结构化图像理解,如解释表格和图表,需要在图像中的各种结构和文本之间战略性地重新聚焦,形成推理序列以得出最终答案。然而,当前的多模态大型语言模型(LLM)缺乏这种多步选择性注意的能力。在这项工作中,我们介绍了 ReFocus,这是一个简单而有效的框架,它使多模态 LLM 具备通过代码在输入图像上执行视觉编辑来生成“视觉思维”的能力,从而转移和精炼其视觉焦点。具体而言,ReFocus 使得多模态 LLM 能够生成 Python 代码调用工具并修改输入图像,在此基础上依次绘制方框、高亮显示部分和屏蔽区域,从而增强视觉推理过程。 我们在涉及表格和图表的多种结构化图像理解任务上进行了实验。与未经视觉编辑的 GPT-4 相比,ReFocus 在所有任务中都显著提高了性能,在表格任务上的平均增益为 11.0%,在图表任务上的平均增益为 6.8%。我们深入分析了不同视觉编辑的效果,并解释了为什么 ReFocus 能够在不引入额外信息的情况下提高性能。 此外,我们使用 ReFocus 收集了一个包含 14,000 条数据的训练集,并证明了这种具有中间信息的视觉思维链提供了比标准 VQA 数据更好的监督效果,在模型训练中与 QA 对相比平均增益为 8.0%,与 CoT 相比则为 2.6%。

URL

https://arxiv.org/abs/2501.05452

PDF

https://arxiv.org/pdf/2501.05452.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot