Abstract
Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.
Abstract (translated)
几何问题求解(Geometric Problem Solving,GPS)对多模态大型语言模型(Multimodal Large Language Models,MLLMs)提出了独特的挑战。这不仅需要模型能够同时理解文本和图表信息,还需要具备迭代的视觉空间推理能力。现有的方法通常将图示处理为静态图像进行分析,但这种方法缺乏动态操作的能力——这是人类几何推理中的核心方面之一,包括辅助线构造和仿射变换等。 本文介绍了一种名为GeoSketch的新框架,它把几何推理重新定义为一个互动的感知-推理-行动循环。GeoSketch集成了三个关键组件: 1. **感知模块**:该模块将图示抽象成结构化的逻辑形式。 2. **符号推理模块**:这个模块应用几何定理来决定下一步演绎步骤。 3. **草图操作模块**:执行如绘制辅助线或进行变换等具体操作,从而以闭环方式更新图示。 为了训练这一代理模型,我们开发了一种两阶段的管道流程: - 监督微调:在2000个符号化轨迹上进行监督学习。 - 强化学习:利用密集且符号化的奖励来增强模型的鲁棒性和策略性探索能力。 为评估该方法的有效性,我们引入了GeoSketch基准测试集——这是一个包含390个需要辅助构造或仿射变换的高质量几何问题的数据集。实验结果表明,在强大的MLLM基线模型上进行测试时,GeoSketch在分步推理准确度和解决问题的成功率方面都有显著提升。 通过将层级决策、可执行视觉操作以及符号验证统一起来,GeoSketch将多模态推理从静态解释推进到了动态且可验证的互动阶段。这种新的方法为解决复杂的视觉空间问题提供了一个坚实的基础。
URL
https://arxiv.org/abs/2509.22460