Abstract
Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.
Abstract (translated)
视觉交错思维(VI-CoT)使大规模语言模型能够根据逐步的中间视觉状态(IVS)连续更新它们的理解和决策,就像人类一样。这种方法在各种任务中取得了显著的成功,并推动了相关基准的进步。尽管有了这些令人鼓舞的进步,现有的基准却向模型提供相对固定的中间视觉状态,而不是自由式的中间视觉状态,这可能会强制性地扭曲原始的思维轨迹,无法全面评估其内在的推理能力。更重要的是,目前的基准没有系统地探索IVS对未受限制的推理性能的影响因素。 为了解决上述差距,我们引入了一个专门针对这些任务设计的新基准——ViC-Bench,包括四个代表性的任务:迷宫导航、拼图游戏、具身长时规划和复杂计数。每个任务都配备了专用的自由式IVS生成管道,支持功能调用。为了系统地评估VI-CoT的能力,我们提出了一套全面的评估方案,采用逐步推进的三阶段策略,并引入了有针对性的新度量标准。此外,我们还建立了一个增量提示信息注入(IPII)策略来逐层探索对VI-CoT有影响的提示因素。 我们在18种先进的大规模语言模型上进行了广泛的测试,揭示了关于它们在VI-CoT能力方面的关键见解。我们的新基准已在Huggingface平台公开发布。
URL
https://arxiv.org/abs/2505.14404