Paper Reading AI Learner

VideoXum: Cross-modal Visual and Textural Summarization of Videos

2023-03-21 17:51:23
Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo

Abstract

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

Abstract (translated)

视频摘要的目标是从源视频中提取最重要的信息,以生成一种简短的片段或文本叙事。传统上,不同的方法会根据输出是否为视频或文本而提出,从而忽视了视觉摘要和文本摘要这两个语义相关的任务之间的相关性。我们提出了一个新的联合视频和文本摘要任务。该任务的目标是从一段较长的视频中提取 both a缩短的视频片段和相应的文本摘要,并将其统称为跨媒体摘要。生成的缩短视频片段和文本叙事应该语义上紧密对齐。为此,我们首先建立了一个大规模的人类标注数据集——VideoXum(X代表不同感官方式)。该数据集基于活动Net进行重新标注。在我们过滤掉不符合长度要求的视频后,我们的新数据集仍然存在14,001段较长的视频。在每个视频中,我们的重新标注数据集都有人类标注的视频摘要和相应的文本摘要。然后我们设计了一种新的端到端模型——VTSUM-BILP,以解决我们提出的任务的挑战。此外,我们提出了一种新的度量指标——VT-CLIPScore,以帮助评估跨媒体摘要的语义一致性。该提议模型在 this 新的任务上取得了良好的表现,并为未来的研究树立了基准。

URL

https://arxiv.org/abs/2303.12060

PDF

https://arxiv.org/pdf/2303.12060.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot