Paper Reading AI Learner

Consensus-based Sequence Training for Video Captioning

2017-12-27 09:38:52
Sang Phan, Gustav Eje Henter, Yusuke Miyao, Shin'ichi Satoh

Abstract

Captioning models are typically trained using the cross-entropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each step to make training converge. We propose a fast approach to optimize one's objective of interest through the REINFORCE algorithm. First we show that, by replacing model samples with ground-truth sentences, RL training can be seen as a form of weighted cross-entropy loss, giving a fast, RL-based pre-training algorithm. Second, we propose to use the consensus among ground-truth captions of the same video as the baseline reward. This can be computed very efficiently. We call the complete proposal Consensus-based Sequence Training (CST). Applied to the MSRVTT video captioning benchmark, our proposals train significantly faster than comparable methods and establish a new state-of-the-art on the task, improving the CIDEr score from 47.3 to 54.2.

Abstract (translated)

字幕模型通常使用交叉熵损失进行训练。然而,他们的表现是通过其他指标评估的,旨在更好地与人体评估相关联。最近,已经表明,强化学习(RL)可以直接优化诸如字幕等任务的这些度量。然而,这在计算上是昂贵的,并且需要在每个步骤中指定基线奖励以使训练收敛。我们提出了一种快速方法来通过REINFORCE算法来优化自己感兴趣的目标。首先我们证明,通过用模型样本替换地面真值语句,RL训练可以被看作是加权交叉熵损失的一种形式,给出了一种快速的基于RL的预训练算法。其次,我们建议使用同一视频的地面实况标题中的共识作为基线奖励。这可以非常有效地计算。我们称完整提案基于共识的序列培训(CST)。应用到MSRVTT视频字幕基准测试中,我们的方案比同类方法训练速度快得多,并在任务上建立了新的技术水平,将CIDEr得分从47.3提高到了54.2。

URL

https://arxiv.org/abs/1712.09532

PDF

https://arxiv.org/pdf/1712.09532.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot