Abstract
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.
Abstract (translated)
近年来,在大型视频语言模型的支持下,已经取得了在实时规划和详细交互方面显著的进步。然而,它们的高计算需求和缺乏注释数据集限制了它们在学术研究中的应用。在这项工作中,我们介绍了一个名为VideoLLaMB的新框架,它利用桥层中的时间记忆单元实现整个视频序列的编码,同时保留历史视觉数据,有效保护语义连续,并提高各种任务上的模型性能。这种方法包括循环记忆单元和SceneTilling算法,将视频分割为独立的语义单元以保留语义完整性。实验证明,VideoLLaMB在三个VideoQA基准测试中的表现明显优于其竞争者,其性能提高了5.5个点。在MVBench上,VideoLLaMB-7B的表现甚至超过了其前7B模型。值得注意的是,即使视频长度增加至8倍,它仍然保持了鲁棒性能,这与PLLaVA的表现非常相似。此外,我们的Needle in a Video Haystack (NIAVH)基准中的帧检索结果也进一步证明了VideoLLaMB在准确识别长时间视频中的特定帧方面表现出众。我们的SceneTilling算法还使流式视频字幕的生成变得直接,无需进行额外的训练。在效率方面,经过16帧训练的VideoLLaMB,在单个Nvidia A100 GPU上支持高达320帧,从而确保了高性能和性价比,为长时间视频语言模型在学术和实践应用中奠定了新的基础。
URL
https://arxiv.org/abs/2409.01071