Paper Reading AI Learner

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

2025-10-02 17:43:01
Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler

Abstract

Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at this https URL .

Abstract (translated)

视频大型语言模型(VLMs)在各种视觉语言任务上取得了显著成果,然而其实际应用受限于“大海捞针”问题:从原始视频帧生成的大量视觉标记消耗了模型的上下文窗口。现有解决方案通过选择一组稀疏的关键帧来缓解这一问题,从而减少标记数量,但这种基于帧的选择方法忽略了至关重要的时间动态性,导致对运动和事件连续性的推理效果不佳。 在这项工作中,我们系统地探索了时间信息的影响,并证明将选择从孤立的关键帧扩展到关键片段(即短的、具有时间连贯性的片段)可以提升视频理解能力。为了在保持固定计算预算的同时容纳片段更大的标记足迹,我们提出了一种自适应分辨率策略,该策略能够动态平衡空间分辨率和片段长度,确保每段视频的标记数量恒定。 我们在三个长形式视频基准测试中进行了实验,结果表明我们的无训练方法F2C,在Video-MME、LongVideoBench 和 MLVU 基准上分别比均匀采样高出8.1%、5.6%和10.3%。这些结果显示了在帧选择中保持时间连贯性的重要性,并为将视频LLM扩展到实际的视频理解应用提供了实用路径。 项目网页在此 [URL](https://example.com) 可用。(注意:请确保插入正确的网址链接)

URL

https://arxiv.org/abs/2510.02262

PDF

https://arxiv.org/pdf/2510.02262.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot