Paper Reading AI Learner

End-to-End Video Captioning with Multitask Reinforcement Learning

2018-03-21 14:51:17
Lijun Li, Boqing Gong

Abstract

Although end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision and machine learning, those limitations of E2E learning are especially amplified by the fact that both the input videos and output captions are lengthy sequences. Indeed, state-of-the-art methods of video captioning process video frames by convolutional neural networks and generate captions by unrolling recurrent neural networks. If we connect them in an E2E manner, the resulting model is both memory-consuming and data-hungry, making it extremely hard to train. In this paper, we propose a multitask reinforcement learning approach to training an E2E video captioning model. The main idea is to mine and construct as many effective tasks (e.g., attributes, rewards, and the captions) as possible from the human captioned videos such that they can jointly regulate the search space of the E2E neural network, from which an E2E video captioning model can be found and generalized to the testing phase. To the best of our knowledge, this is the first video captioning model that is trained end-to-end from the raw video input to the caption output. Experimental results show that such a model outperforms existing ones to a large margin on two benchmark video captioning datasets.

Abstract (translated)

尽管端到端(E2E)学习在各种任务中带来了有希望的性能,但它经常受到硬件限制(例如,GPU存储器)的阻碍,并且易于过度拟合。当谈到视频字幕是计算机视觉和机器学习中最具挑战性的基准测试任务之一时,由于输入视频和输出字幕都是冗长的序列,这些E2E学习的局限性尤其被放大。事实上,最新的视频字幕方法通过卷积神经网络处理视频帧,并通过展开递归神经网络来生成字幕。如果我们以E2E方式连接它们,那么最终的模型既耗费内存又耗费数据,这使得训练非常困难。在本文中,我们提出了一种多任务强化学习方法来训练E2E视频字幕模型。其主要思想是尽可能从人工上传的视频中挖掘和构建尽可能多的有效任务(例如,属性,奖励和字幕),这样他们就可以共同调节E2E神经网络的搜索空间,从中可以将E2E视频字幕模型可以被发现并推广到测试阶段。据我们所知,这是第一个从原始视频输入到字幕输出端到端训练的视频字幕模型。实验结果表明,这样的模型在两个基准视频字幕数据集上优于现有模型。

URL

https://arxiv.org/abs/1803.07950

PDF

https://arxiv.org/pdf/1803.07950.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot