Paper Reading AI Learner

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

2019-06-07 20:48:19
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: this http URL.

Abstract (translated)

学习文本视频嵌入通常需要一组带有手动提供标题的视频剪辑。然而,这样的数据集创建起来既昂贵又耗时,因此很难大规模获取。在这项工作中,我们建议学习这样的嵌入视频数据与现成的自然语言注释的形式,自动转录叙述。这项工作的贡献是三倍。首先,我们介绍了howto100M:一个包含1.36亿个视频片段的大规模数据集,这些视频片段来源于1.22米的叙述性教学网络视频,描述了人类执行和描述超过23K个不同的视觉任务。我们的数据收集过程快速、可扩展,不需要任何额外的手动注释。第二,我们证明了在这些数据上训练的文本视频嵌入可以在教学视频数据集(如YouCook2或CrossTask)上获得最先进的文本到视频检索和操作本地化结果。最后,我们证明这种嵌入技术可以很好地传输到其他领域:在通用YouTube视频(MSR-VTT数据集)和电影(LSMDC数据集)上的微调优于仅在这些数据集上训练的模型。我们的数据集、代码和模型将在以下网址公开:http-url。

URL

https://arxiv.org/abs/1906.03327

PDF

https://arxiv.org/pdf/1906.03327.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot