Abstract
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
Abstract (translated)
本文通过分解多级模块推理框架来解决视频问答(videoQA)任务。之前的方法已经通过单个规划阶段在不基于视觉内容的简单有效的基线上了表现出良好的效果。然而,通过一个简单而有效的基准,我们发现,对于具有挑战性的视频QA设置,这样的系统在实践中会导致脆性行为。因此,与传统单阶段规划方法不同,我们提出了一个由事件解析器、基线阶段和最终推理阶段以及外部记忆组成的多阶段系统。所有阶段都是训练免费的,并通过大型模型的少样本提示来执行,在每个阶段产生可解释的中间输出。通过分解规划和任务的底层复杂性,我们的方法MoReVQA在现有视频QA基准(NExT-QA,iVQA,EgoSchema,ActivityNet-QA)上取得了最先进的结果,并扩展到相关任务(基于内容的视频QA,段落标题)。
URL
https://arxiv.org/abs/2404.06511