SnapCap: Efficient Snapshot Compressive Video Captioning

Abstract
Abstract (translated)
URL
PDF

Abstract

Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.

Abstract (translated)

视频字幕（VC）是一个具有挑战性的多模态任务，因为需要通过理解各种复杂视频来用语言描述场景。对于机器来说，传统的VC沿着“图像压缩-解码-然后编码”的流程进行，压缩是存储和传输的关键。然而，在这样一个流程中，一些潜在的缺陷是无法避免的，即压缩过程中信息冗余导致效率低和信息丢失。为解决这些问题，本文提出了一种新颖的VC管道，可以直接从压缩测量中生成字幕，可以被快照压缩感知相机捕获，我们称之为SnapCap。具体来说，利用信号仿真，我们获得了为我们的模型提供丰富测量视频注释数据对的能力。此外，为了更好地从压缩测量中提取与语言相关的视觉表示，我们通过预训练的CLIP（带有丰富语言视觉关联）来蒸馏知识，以指导我们的SnapCap的学习。为了证明SnapCap的有效性，我们在两个广泛使用的VC数据集上进行了实验。两个质量和数量结果证实了我们的管道优越于传统VC管道。特别是，与“在重构后进行编码”的方法相比，我们的SnapCap至少可以快3倍，并实现更好的字幕效果。

URL

https://arxiv.org/abs/2401.04903

PDF

https://arxiv.org/pdf/2401.04903.pdf

SnapCap: Efficient Snapshot Compressive Video Captioning

Abstract

Abstract (translated)

URL

PDF Copy

PDF