Abstract
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at this https URL.
Abstract (translated)
一个理想的模型用于密集视频标题——预测在视频中的局部文本描述——应该能够处理长输入视频,预测丰富的详细文本描述,并在处理整个视频之前产生输出。然而,当前的顶级模型仅处理固定数量的下采样帧,在看完整个视频后做出单个全预测。我们提出了一种流式密集视频标题模型,该模型由两个新颖的组件组成:首先,我们提出了一个基于聚类的输入词模块,该模块的内存大小固定,可以处理任意长度的视频。其次,我们开发了一种流式解码算法,使得我们的模型能够在处理整个视频之前做出预测。我们的模型实现了这种流式能力,并在三个密集视频标题基准中显著提高了性能:ActivityNet,YouCook2 和 ViTT。我们的代码发布在這個 https URL 上。
URL
https://arxiv.org/abs/2404.01297