HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Abstract
Abstract (translated)
URL
PDF

Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: this http URL.

Abstract (translated)

学习文本视频嵌入通常需要一组带有手动提供标题的视频剪辑。然而，这样的数据集创建起来既昂贵又耗时，因此很难大规模获取。在这项工作中，我们建议学习这样的嵌入视频数据与现成的自然语言注释的形式，自动转录叙述。这项工作的贡献是三倍。首先，我们介绍了howto100M：一个包含1.36亿个视频片段的大规模数据集，这些视频片段来源于1.22米的叙述性教学网络视频，描述了人类执行和描述超过23K个不同的视觉任务。我们的数据收集过程快速、可扩展，不需要任何额外的手动注释。第二，我们证明了在这些数据上训练的文本视频嵌入可以在教学视频数据集（如YouCook2或CrossTask）上获得最先进的文本到视频检索和操作本地化结果。最后，我们证明这种嵌入技术可以很好地传输到其他领域：在通用YouTube视频（MSR-VTT数据集）和电影（LSMDC数据集）上的微调优于仅在这些数据集上训练的模型。我们的数据集、代码和模型将在以下网址公开：http-url。

URL

https://arxiv.org/abs/1906.03327

PDF

https://arxiv.org/pdf/1906.03327.pdf