Paper Reading AI Learner

SnapCap: Efficient Snapshot Compressive Video Captioning

2024-01-10 03:11:21
Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Bo Chen, Xin Yuan

Abstract

Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.

Abstract (translated)

视频字幕(VC)是一个具有挑战性的多模态任务,因为需要通过理解各种复杂视频来用语言描述场景。对于机器来说,传统的VC沿着“图像压缩-解码-然后编码”的流程进行,压缩是存储和传输的关键。然而,在这样一个流程中,一些潜在的缺陷是无法避免的,即压缩过程中信息冗余导致效率低和信息丢失。为解决这些问题,本文提出了一种新颖的VC管道,可以直接从压缩测量中生成字幕,可以被快照压缩感知相机捕获,我们称之为SnapCap。 具体来说,利用信号仿真,我们获得了为我们的模型提供丰富测量视频注释数据对的能力。此外,为了更好地从压缩测量中提取与语言相关的视觉表示,我们通过预训练的CLIP(带有丰富语言视觉关联)来蒸馏知识,以指导我们的SnapCap的学习。为了证明SnapCap的有效性,我们在两个广泛使用的VC数据集上进行了实验。两个质量和数量结果证实了我们的管道优越于传统VC管道。特别是,与“在重构后进行编码”的方法相比,我们的SnapCap至少可以快3倍,并实现更好的字幕效果。

URL

https://arxiv.org/abs/2401.04903

PDF

https://arxiv.org/pdf/2401.04903.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot