Paper Reading AI Learner

MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

2018-09-19 15:50:18
Oliver Nina, Washington Garcia, Scott Clouse, Alper Yilmaz

Abstract

Learning visual feature representations for video analysis is a daunting task that requires a large amount of training samples and a proper generalization framework. Many of the current state of the art methods for video captioning and movie description rely on simple encoding mechanisms through recurrent neural networks to encode temporal visual information extracted from video data. In this paper, we introduce a novel multitask encoder-decoder framework for automatic semantic description and captioning of video sequences. In contrast to current approaches, our method relies on distinct decoders that train a visual encoder in a multitask fashion. Our system does not depend solely on multiple labels and allows for a lack of training data working even with datasets where only one single annotation is viable per video. Our method shows improved performance over current state of the art methods in several metrics on multi-caption and single-caption datasets. To the best of our knowledge, our method is the first method to use a multitask approach for encoding video features. Our method demonstrates its robustness on the Large Scale Movie Description Challenge (LSMDC) 2017 where our method won the movie description task and its results were ranked among other competitors as the most helpful for the visually impaired.

Abstract (translated)

学习视频分析的视觉特征表示是一项艰巨的任务,需要大量的训练样本和适当的泛化框架。用于视频字幕和电影描述的许多现有技术方法依赖于通过递归神经网络的简单编码机制来编码从视频数据提取的时间视觉信息。在本文中,我们介绍了一种新的多任务编码器 - 解码器框架,用于自动语义描述和视频序列的字幕。与当前的方法相比,我们的方法依赖于以多任务方式训练可视编码器的不同解码器。我们的系统不仅仅依赖于多个标签,并且即使对于每个视频只有一个注释可行的数据集,也可以使缺少训练数据。我们的方法在多字幕和单字幕数据集的若干指标上显示出优于当前最新技术方法的性能。据我们所知,我们的方法是第一种使用多任务方法编码视频特征的方法。我们的方法展示了其在2017年大型电影描述挑战(LSMDC)中的稳健性,其中我们的方法赢得了电影描述任务,其结果在其他竞争者中被评为对视障者最有帮助的。

URL

https://arxiv.org/abs/1809.07257

PDF

https://arxiv.org/pdf/1809.07257.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot