Abstract
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at this https URL .
Abstract (translated)
视频大型语言模型(VLMs)在各种视觉语言任务上取得了显著成果,然而其实际应用受限于“大海捞针”问题:从原始视频帧生成的大量视觉标记消耗了模型的上下文窗口。现有解决方案通过选择一组稀疏的关键帧来缓解这一问题,从而减少标记数量,但这种基于帧的选择方法忽略了至关重要的时间动态性,导致对运动和事件连续性的推理效果不佳。 在这项工作中,我们系统地探索了时间信息的影响,并证明将选择从孤立的关键帧扩展到关键片段(即短的、具有时间连贯性的片段)可以提升视频理解能力。为了在保持固定计算预算的同时容纳片段更大的标记足迹,我们提出了一种自适应分辨率策略,该策略能够动态平衡空间分辨率和片段长度,确保每段视频的标记数量恒定。 我们在三个长形式视频基准测试中进行了实验,结果表明我们的无训练方法F2C,在Video-MME、LongVideoBench 和 MLVU 基准上分别比均匀采样高出8.1%、5.6%和10.3%。这些结果显示了在帧选择中保持时间连贯性的重要性,并为将视频LLM扩展到实际的视频理解应用提供了实用路径。 项目网页在此 [URL](https://example.com) 可用。(注意:请确保插入正确的网址链接)
URL
https://arxiv.org/abs/2510.02262