Paper Reading AI Learner

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

2023-11-28 22:57:17
Jacob Zhiyuan Fang, Skyler Zheng, Vasu Sharma, Robinson Piramuthu

Abstract

To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its compactness, is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks including video question answering, text-to-video retrieval, etc. In particular, our E-ViLM obtains obvious efficiency improvements by reaching competing performances with faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also provide extensive ablative studies that validate the effectiveness of our proposed learning schema for E-ViLM.

Abstract (translated)

要为具有挑战性的现实世界任务构建可扩展的模型,了解各种形式(如视频、文本和图像)的多样多模态数据非常重要。在现有工作中,许多作品集中利用大型但笨重的跨模态架构。尽管它们的有效性,但较大的架构无法将模型扩展到现实世界的应用中,因此构建轻量级的VL架构和高效的学习模式具有很大的实际价值。在本文中,我们提出了一个高效的视频-语言模型(称为E-ViLM)和一个掩码视频建模(MVM)方案,并使用语义向量量化tokenizer进行辅助。特别地,我们的E-ViLM学会了从预训练的向量量化tokenizer产生的遮罩视频区域的语义标签中重构语义标签,该tokenizer将连续的视觉信号离散化成了标签。我们证明了,即使我们的简单MVM任务和常规VL预训练模型比较紧凑,我们的E-ViLM也能够从视频-语言数据集中学习到有表达性的表示,并且在广泛的视频-语言任务(包括视频问答和文本-到-视频检索等)上表现良好。特别是,我们的E-ViLM通过达到与具有更快的推理速度的竞争水平取得了显著的效率改进,即我们的模型在MSRVTT基准上达到$39.3\%$的Top-$1$准确率,仅保留$15\%$的参数和$94.8\%$的GFLOPs。我们还提供了广泛的差分研究,证实了我们的学习方案对E-ViLM的有效性。

URL

https://arxiv.org/abs/2311.17267

PDF

https://arxiv.org/pdf/2311.17267.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot