Abstract
The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at this https URL.
Abstract (translated)
过去一年见证了基于视频的大语言模型的重要进展。然而,开发一个既能处理短视频又能理解长视频的统一模型仍然是一个未解决的问题。大多数现有的视频大语言模型无法处理长达数小时的视频,而专门为长时间视频设计的方法往往对较短的视频和图像无效。在本文中,我们确定了问题的关键在于视频中的冗余内容。为了解决这个问题,我们提出了一种新的池化策略,该策略同时实现了标记压缩和指令感知的视觉特征聚合。我们的模型称为提示引导池化LLaVA(PPLLaVA)。具体来说,PPLLaVA由三个核心组件组成:基于CLIP的视觉-提示对齐,用于提取与用户指令相关的视觉信息;提示引导池化,通过卷积式的池化将视觉序列压缩到任意规模;以及针对长提示设计的上下文扩展,这在视觉对话中很常见。此外,我们的代码库还集成了最先进的视频直接偏好优化(DPO)和视觉交错训练。广泛的实验验证了我们模型的表现。凭借卓越的吞吐量和仅1024个视觉上下文,PPLLaVA作为视频大语言模型在图像基准测试中取得了更好的结果,并在各种视频基准上实现了最先进性能,在从生成标题到多项选择题的任务中表现出色,能够处理从几秒到数小时长度的视频。代码可在以下链接获取:[此 https URL]。
URL
https://arxiv.org/abs/2411.02327