Paper Reading AI Learner

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

2025-06-06 16:17:09
Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

Abstract

Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at this https URL to support future work on building more general, open-ended, and creative reasoning systems.

Abstract (translated)

拼图寻宝(Puzzlehunts)是一种复杂的、多步骤的谜题类型,这些问题没有明确的问题定义。与常规推理基准中的任务清晰指令相反,拼图寻宝要求模型从多模态证据中发现潜在问题结构,并通过迭代推理解决问题,这种类型的挑战类似于现实生活中的科学发现、探索性数据分析或调查性问题解决等场景。尽管最近在基础模型方面取得了进展,但它们在这种开放设置上的表现尚未得到充分测试。 本文介绍了 PuzzleWorld,这是一个包含 667 个拼图寻宝风格问题的大规模基准测试集,旨在评估分步的、开放式和多模态创造性推理的能力。每个谜题都附有最终解决方案、详细的推理痕迹以及认知技能标签,这使得整体基准测试和细粒度诊断分析成为可能。目前最先进的模型仅能达到 1-2% 的最终答案准确性,其中表现最好的模型也只能解决 14% 的谜题,并且分步准确率为 40%。我们展示了推理标注的价值:在小模型上进行推理痕迹的微调可以将分步推理准确率从 4% 提高到 11%,而仅基于最终答案训练会导致性能下降接近于零。 我们的错误分析表明,当前的模型表现出短视性的推理方式,并受到语言基础推断限制的影响。此外,它们缺乏对于视觉和空间推理至关重要的草图绘制能力。我们已将 PuzzleWorld 在 [此链接](https://puzzleworld.org) 上公开发布,以支持未来研究中构建更通用、开放性更强和更具创造性的推理系统的工作。

URL

https://arxiv.org/abs/2506.06211

PDF

https://arxiv.org/pdf/2506.06211.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot