Paper Reading AI Learner

Outside Knowledge Conversational Video Dataset -- Dialoguing over Videos

2025-06-11 17:23:35
Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck

Abstract

In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: this https URL.

Abstract (translated)

在外部知识视觉问答(OK-VQA)中,模型必须识别图像中的相关视觉信息,并结合外部知识以准确回答问题。将这项任务扩展到基于视频的视觉支持对话设置中,对话模型不仅要能够随时间推断出相关的视觉细节,还要能回答那些所需信息不一定存在于视觉信息中的问题。此外,整体对话的上下文也必须被考虑在内。 为了探索这一任务,我们引入了一个数据集,包含2,017个视频和5,986个人工标注的对话,这些对话包括40,954轮交替进行的对话回合。虽然对话背景是基于特定视频片段中的视觉信息,但问题进一步需要那些不在视觉中直接显示的外部知识。因此,模型不仅必须识别相关视频部分,还要利用外部知识在对话中交流。 我们还在我们的数据集上提供了一些基准测试,并展示了与此任务相关的未来挑战。该数据集可在此公开获取:[此URL](this https URL)。

URL

https://arxiv.org/abs/2506.09953

PDF

https://arxiv.org/pdf/2506.09953.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot