Abstract
Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.
Abstract (translated)
视频字幕(VC)是一个具有挑战性的多模态任务,因为需要通过理解各种复杂视频来用语言描述场景。对于机器来说,传统的VC沿着“图像压缩-解码-然后编码”的流程进行,压缩是存储和传输的关键。然而,在这样一个流程中,一些潜在的缺陷是无法避免的,即压缩过程中信息冗余导致效率低和信息丢失。为解决这些问题,本文提出了一种新颖的VC管道,可以直接从压缩测量中生成字幕,可以被快照压缩感知相机捕获,我们称之为SnapCap。 具体来说,利用信号仿真,我们获得了为我们的模型提供丰富测量视频注释数据对的能力。此外,为了更好地从压缩测量中提取与语言相关的视觉表示,我们通过预训练的CLIP(带有丰富语言视觉关联)来蒸馏知识,以指导我们的SnapCap的学习。为了证明SnapCap的有效性,我们在两个广泛使用的VC数据集上进行了实验。两个质量和数量结果证实了我们的管道优越于传统VC管道。特别是,与“在重构后进行编码”的方法相比,我们的SnapCap至少可以快3倍,并实现更好的字幕效果。
URL
https://arxiv.org/abs/2401.04903