Streaming Dense Video Captioning

2024-04-01 17:59:15

Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

arXiv_CV

arXiv_CV Caption Video_Caption Prediction Pose Activity

Abstract
Abstract (translated)
URL
PDF

Abstract

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at this https URL.

Abstract (translated)

一个理想的模型用于密集视频标题——预测在视频中的局部文本描述——应该能够处理长输入视频，预测丰富的详细文本描述，并在处理整个视频之前产生输出。然而，当前的顶级模型仅处理固定数量的下采样帧，在看完整个视频后做出单个全预测。我们提出了一种流式密集视频标题模型，该模型由两个新颖的组件组成：首先，我们提出了一个基于聚类的输入词模块，该模块的内存大小固定，可以处理任意长度的视频。其次，我们开发了一种流式解码算法，使得我们的模型能够在处理整个视频之前做出预测。我们的模型实现了这种流式能力，并在三个密集视频标题基准中显著提高了性能：ActivityNet，YouCook2 和 ViTT。我们的代码发布在這個 https URL 上。

URL

https://arxiv.org/abs/2404.01297

PDF

https://arxiv.org/pdf/2404.01297.pdf

Streaming Dense Video Captioning

Abstract

Abstract (translated)

URL

PDF Copy

PDF