Paper Reading AI Learner

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

2023-03-11 11:00:16
Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo

Abstract

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge.

Abstract (translated)

过去几年, Joint video-language learning 吸引了越来越多的关注。然而,现有的工作主要关注剪辑视频(事件)的单个或多个片段,这导致在推理时需要人类标注的事件边界。为了摆脱束缚,我们提出了一种针对未剪辑视频的基层视觉语言学习框架,该框架自动检测信息事件并有效地挖掘多 sentence 描述和相应事件Segment之间的对齐。我们不再采用粗粒度的视频-语言对齐,而是提出了两个双任务目标,以鼓励更细粒度的Segment 级对齐,即文本到事件 grounded(TEG) 和事件到文本 generation(ETG)。TEG 学习通过估计在共同语义空间中的情感距离估计出可能的事件提议,同时,ETG 旨在根据事件提议重构(生成)匹配的文本,鼓励事件表示保留有意义的语义信息。为了鼓励准确的事件和文本集合之间的标签分配,我们提出了一种新的语义aware成本,以减轻由于歧义边界标注引起的劣化匹配结果。我们的框架可以轻松扩展到涉及视觉grounded 语言理解和生成的任务。我们在ActivityNetcaptions、YouCook2 和 YouMakeup等平台上实现了先进的高密度视频字幕性能,并在多个其他语言生成和理解任务中表现出竞争力。我们的方法还获得了 PIC 4 挑战MTVG 和 MDVC 任务的第一名。

URL

https://arxiv.org/abs/2303.06378

PDF

https://arxiv.org/pdf/2303.06378.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot