Paper Reading AI Learner

Streamlined Dense Video Captioning

2019-04-08 07:17:30
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han

Abstract

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.

Abstract (translated)

密集视频字幕是一项极具挑战性的任务,因为准确、连贯地描述视频中的事件需要对视频内容的整体理解以及对单个事件的上下文推理。大多数现有的方法处理这个问题,首先从视频中检测事件建议,然后在建议的子集上加上字幕。因此,生成的句子容易重复或不一致,因为它们没有考虑事件之间的时间依赖性。为了应对这一挑战,我们提出了一种新颖的密集视频字幕框架,该框架明确模拟了视频中事件之间的时间依赖性,并利用先前事件的视觉和语言背景进行连贯的故事讲述。这一目标的实现是:1)整合事件序列生成网络,以自适应地选择一系列事件建议;2)将事件建议序列提供给我们的顺序视频字幕网络,该网络通过强化学习,在事件和事件级别提供两级奖励,以实现更好的上下文模型iNG。该技术在大多数度量中都能在ActivityNet标题数据集上取得优异的性能。

URL

https://arxiv.org/abs/1904.03870

PDF

https://arxiv.org/pdf/1904.03870.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot