Abstract
Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
Abstract (translated)
现有的长视频理解方法主要集中在仅持续几十秒的视频上,对较长视频的处理技术探索有限。长视频中的帧数增加带来了两个主要挑战:定位关键信息和进行长距离推理的困难。因此,我们提出了DrVideo,一个基于文档检索的长视频理解系统。我们的关键想法是将长视频理解问题转化为长文档理解问题,以便有效地利用大型语言模型的力量。具体来说,DrVideo将一个长视频转换为文本格式的长文档,首先检索关键帧并补充这些帧的信息,作为系统的起点。然后采用基于代理的迭代循环来持续搜索丢失的信息,补充相关数据,并在链式思维的方式下提供最终预测。在长视频基准测试上进行的大量实验证实了我们的方法的有效性。DrVideo在EgoSchema基准上比现有最先进的方法+3.8个准确率更高(3分钟),在MovieChat-1K的break模式下+17.9个准确率更高,在MovieChat-1K全局模式下+38.0个准确率更高(10分钟),在LLama-Vid QA数据集上+30.2个准确率更高(超过60分钟)。
URL
https://arxiv.org/abs/2406.12846