Paper Reading AI Learner

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

2023-12-16 03:17:30
Mingfei Han, Xiaojun Chang, Heng Wang, Linjie Yang

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

Abstract (translated)

一段短视频可能包含多个事件的进展和有趣的故事线。人类需要捕捉每个镜头中的事件,并将它们联系在一起,以理解其背后的故事。在这项工作中,我们提出了一个新的多镜头视频理解基准Shot2Story20K,带有详细的镜头级别字幕和全面的视频摘要。为了促进更好地语义理解视频,我们提供了视觉信号和人类叙述的 caption。我们设计了几种不同的任务,包括单镜头视频和叙述性 captioning,多镜头视频摘要和带有描述的图像检索。初步实验表明,生成一个长且全面的视频摘要存在一些挑战。然而,生成的不完美的摘要已经可以显著提高现有视频理解任务的性能,如视频问答,探索了一个未被探索的视频理解设置,带有详细的摘要。

URL

https://arxiv.org/abs/2312.10300

PDF

https://arxiv.org/pdf/2312.10300.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot