Paper Reading AI Learner

Weakly Supervised Video Moment Retrieval From Text Queries

2019-04-05 21:11:25
Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury

Abstract

There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely time-consuming and often not scalable. In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate. We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. Specifically, our main idea is to utilize latent alignment between video frames and sentence descriptions using Text-Guided Attention (TGA). TGA is then used during the test phase to retrieve relevant moments. Experiments on two benchmark datasets demonstrate that our method achieves comparable performance to state-of-the-art fully supervised approaches.

Abstract (translated)

近年来,在利用自然语言查询从文本到视频的即时检索中,有一些方法被提出,但在训练过程中需要充分的监督。但是,为每个文本描述获取大量带有时间边界注释的培训视频非常耗时,而且通常不可扩展。为了解决这一问题,在本文中,我们引入了从弱标签学习文本到视频片段检索任务的问题。监督的薄弱之处在于,在培训过程中,我们只能访问视频文本对,而不能访问与不同文本描述相关的视频的时间范围。我们提出了一个基于视觉语义嵌入的框架,该框架只使用视频级句子描述从视频中学习相关片段的概念。具体来说,我们的主要想法是利用视频帧和句子描述之间的潜在对齐,使用文本引导注意力(TGA)。然后在测试阶段使用TGA来检索相关力矩。对两个基准数据集的实验表明,我们的方法达到了与最先进的完全监督方法相当的性能。

URL

https://arxiv.org/abs/1904.03282

PDF

https://arxiv.org/pdf/1904.03282.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot