Abstract
Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors. We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks. Code is available at: this https URL
Abstract (translated)
对视频和语言的共同理解是一个有很多应用的活跃研究领域。此领域的先前工作通常依赖于学习文本视频嵌入。然而,这种方法的一个难点是缺乏大规模注释的视频标题数据集进行培训。为了解决这个问题,我们的目标是学习来自异构数据源的文本视频嵌入。为此,我们提出了一种混合嵌入专家(MEE)模型,能够处理训练期间缺失的输入模式。因此,我们的框架可以从图像和视频数据集中同时学习改进的文本视频嵌入。我们还展示了MEE对其他输入模态如面描述符的推广。我们评估我们的视频检索任务并报告MPII电影描述和MSR-VTT数据集的结果。所提出的MEE模型显示出显着的改进,并且在文本到视频和视频到文本检索任务上优于先前报道的方法。代码位于:https网址
URL
https://arxiv.org/abs/1804.02516