Paper Reading AI Learner

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

2018-04-07 06:59:45
Antoine Miech, Ivan Laptev, Josef Sivic

Abstract

Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors. We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks. Code is available at: this https URL

Abstract (translated)

对视频和语言的共同理解是一个有很多应用的活跃研究领域。此领域的先前工作通常依赖于学习文本视频嵌入。然而,这种方法的一个难点是缺乏大规模注释的视频标题数据集进行培训。为了解决这个问题,我们的目标是学习来自异构数据源的文本视频嵌入。为此,我们提出了一种混合嵌入专家(MEE)模型,能够处理训练期间缺失的输入模式。因此,我们的框架可以从图像和视频数据集中同时学习改进的文本视频嵌入。我们还展示了MEE对其他输入模态如面描述符的推广。我们评估我们的视频检索任务并报告MPII电影描述和MSR-VTT数据集的结果。所提出的MEE模型显示出显着的改进,并且在文本到视频和视频到文本检索任务上优于先前报道的方法。代码位于:https网址

URL

https://arxiv.org/abs/1804.02516

PDF

https://arxiv.org/pdf/1804.02516.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot