Paper Reading AI Learner

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

2025-05-20 14:18:54
Xuecheng Wu, Jiaxing Liu, Danlei Huang, Xiaoyu Li, Yifan Wang, Chen Chen, Liya Ma, Xuezhi Cao, Junxiao Xue

Abstract

Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.

Abstract (translated)

视觉交错思维(VI-CoT)使大规模语言模型能够根据逐步的中间视觉状态(IVS)连续更新它们的理解和决策,就像人类一样。这种方法在各种任务中取得了显著的成功,并推动了相关基准的进步。尽管有了这些令人鼓舞的进步,现有的基准却向模型提供相对固定的中间视觉状态,而不是自由式的中间视觉状态,这可能会强制性地扭曲原始的思维轨迹,无法全面评估其内在的推理能力。更重要的是,目前的基准没有系统地探索IVS对未受限制的推理性能的影响因素。 为了解决上述差距,我们引入了一个专门针对这些任务设计的新基准——ViC-Bench,包括四个代表性的任务:迷宫导航、拼图游戏、具身长时规划和复杂计数。每个任务都配备了专用的自由式IVS生成管道,支持功能调用。为了系统地评估VI-CoT的能力,我们提出了一套全面的评估方案,采用逐步推进的三阶段策略,并引入了有针对性的新度量标准。此外,我们还建立了一个增量提示信息注入(IPII)策略来逐层探索对VI-CoT有影响的提示因素。 我们在18种先进的大规模语言模型上进行了广泛的测试,揭示了关于它们在VI-CoT能力方面的关键见解。我们的新基准已在Huggingface平台公开发布。

URL

https://arxiv.org/abs/2505.14404

PDF

https://arxiv.org/pdf/2505.14404.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot