Abstract
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at this https URL to support future work on building more general, open-ended, and creative reasoning systems.
Abstract (translated)
拼图寻宝(Puzzlehunts)是一种复杂的、多步骤的谜题类型,这些问题没有明确的问题定义。与常规推理基准中的任务清晰指令相反,拼图寻宝要求模型从多模态证据中发现潜在问题结构,并通过迭代推理解决问题,这种类型的挑战类似于现实生活中的科学发现、探索性数据分析或调查性问题解决等场景。尽管最近在基础模型方面取得了进展,但它们在这种开放设置上的表现尚未得到充分测试。 本文介绍了 PuzzleWorld,这是一个包含 667 个拼图寻宝风格问题的大规模基准测试集,旨在评估分步的、开放式和多模态创造性推理的能力。每个谜题都附有最终解决方案、详细的推理痕迹以及认知技能标签,这使得整体基准测试和细粒度诊断分析成为可能。目前最先进的模型仅能达到 1-2% 的最终答案准确性,其中表现最好的模型也只能解决 14% 的谜题,并且分步准确率为 40%。我们展示了推理标注的价值:在小模型上进行推理痕迹的微调可以将分步推理准确率从 4% 提高到 11%,而仅基于最终答案训练会导致性能下降接近于零。 我们的错误分析表明,当前的模型表现出短视性的推理方式,并受到语言基础推断限制的影响。此外,它们缺乏对于视觉和空间推理至关重要的草图绘制能力。我们已将 PuzzleWorld 在 [此链接](https://puzzleworld.org) 上公开发布,以支持未来研究中构建更通用、开放性更强和更具创造性的推理系统的工作。
URL
https://arxiv.org/abs/2506.06211