Paper Reading AI Learner

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

2016-09-26 19:14:12
Atousa Torabi, Niket Tandon, Leonid Sigal

Abstract

Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on annotation and 18.9% on video retrieval tasks for subset of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11% over whole LSMDC16 public test-set.

Abstract (translated)

学习联合语言 - 视觉嵌入具有许多非常吸引人的特性,并且可以导致各种实际应用,包括自然语言图像/视频注释和搜索。在这项工作中,我们研究了三种不同的联合语言 - 视觉神经网络模型体系结构。我们在大型LSMDC16电影数据集上评估我们的模型,以执行两项任务:1)视频注释和检索的标准排名2)我们提出的电影多选题测试。该测试方便了基于人类活动的自然语言视频注释的视觉语言模型的自动评估。除了作为LSMDC16的一部分提供的原始音频描述(AD)字幕之外,我们收集并将提供a)使用亚马逊MTurk获得的这些字幕的手动生成的重新解释b)在“谓词+对象”中自动生成的人类活动元素(PO)短语基于活动知识挖掘模型“Knowle”。对于1000个样本的子集,我们最好的模型归档回顾@ 19.2%的注释和18.9%的视频检索任务。对于多项选择测试,我们最好的模型在整个LSMDC16公共测试集上达到58.11%的准确度。

URL

https://arxiv.org/abs/1609.08124

PDF

https://arxiv.org/pdf/1609.08124.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot