Abstract
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: this https URL.
Abstract (translated)
在外部知识视觉问答(OK-VQA)中,模型必须识别图像中的相关视觉信息,并结合外部知识以准确回答问题。将这项任务扩展到基于视频的视觉支持对话设置中,对话模型不仅要能够随时间推断出相关的视觉细节,还要能回答那些所需信息不一定存在于视觉信息中的问题。此外,整体对话的上下文也必须被考虑在内。 为了探索这一任务,我们引入了一个数据集,包含2,017个视频和5,986个人工标注的对话,这些对话包括40,954轮交替进行的对话回合。虽然对话背景是基于特定视频片段中的视觉信息,但问题进一步需要那些不在视觉中直接显示的外部知识。因此,模型不仅必须识别相关视频部分,还要利用外部知识在对话中交流。 我们还在我们的数据集上提供了一些基准测试,并展示了与此任务相关的未来挑战。该数据集可在此公开获取:[此URL](this https URL)。
URL
https://arxiv.org/abs/2506.09953