Abstract
Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.
Abstract (translated)
密集视频字幕是一项极具挑战性的任务,因为准确、连贯地描述视频中的事件需要对视频内容的整体理解以及对单个事件的上下文推理。大多数现有的方法处理这个问题,首先从视频中检测事件建议,然后在建议的子集上加上字幕。因此,生成的句子容易重复或不一致,因为它们没有考虑事件之间的时间依赖性。为了应对这一挑战,我们提出了一种新颖的密集视频字幕框架,该框架明确模拟了视频中事件之间的时间依赖性,并利用先前事件的视觉和语言背景进行连贯的故事讲述。这一目标的实现是:1)整合事件序列生成网络,以自适应地选择一系列事件建议;2)将事件建议序列提供给我们的顺序视频字幕网络,该网络通过强化学习,在事件和事件级别提供两级奖励,以实现更好的上下文模型iNG。该技术在大多数度量中都能在ActivityNet标题数据集上取得优异的性能。
URL
https://arxiv.org/abs/1904.03870