Paper Reading AI Learner

End-to-End Dense Video Captioning with Masked Transformer

2018-04-03 04:11:00
Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, Caiming Xiong

Abstract

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

Abstract (translated)

密集视频字幕旨在为未修剪的视频中的所有事件生成文字说明。这涉及检测和描述事件。因此,以前所有关于密集视频字幕的方法通过为这两个子问题构建两个模型(即事件提议和字幕模型)来解决这个问题。模型可以单独或交替进行训练。这可以防止语言描述直接影响事件提议,这对于生成准确的描述非常重要。为了解决这个问题,我们提出了一种用于密集视频字幕的端到端变压器模型。编码器将视频编码为适当的表示。提议解码器从具有不同锚点的编码中解码以形成视频事件提议。字幕解码器使用掩蔽网络将其注意力限制在编码特征上的提议事件上。该屏蔽网络将事件提议转换为可区分的掩码,以确保提案和字幕在训练期间的一致性。此外,我们的模型采用了自我注意机制,可以在编码过程中使用高效的非循环结构,并且可以提高性能。我们在ActivityNet Captions和YouCookII数据集上展示了这种端到端模型的有效性,我们分别获得了10.12和6.58个METEOR得分。

URL

https://arxiv.org/abs/1804.00819

PDF

https://arxiv.org/pdf/1804.00819.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot