E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

Abstract
Abstract (translated)
URL
PDF

Abstract

To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its compactness, is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks including video question answering, text-to-video retrieval, etc. In particular, our E-ViLM obtains obvious efficiency improvements by reaching competing performances with faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also provide extensive ablative studies that validate the effectiveness of our proposed learning schema for E-ViLM.

Abstract (translated)

要为具有挑战性的现实世界任务构建可扩展的模型，了解各种形式（如视频、文本和图像）的多样多模态数据非常重要。在现有工作中，许多作品集中利用大型但笨重的跨模态架构。尽管它们的有效性，但较大的架构无法将模型扩展到现实世界的应用中，因此构建轻量级的VL架构和高效的学习模式具有很大的实际价值。在本文中，我们提出了一个高效的视频-语言模型（称为E-ViLM）和一个掩码视频建模（MVM）方案，并使用语义向量量化tokenizer进行辅助。特别地，我们的E-ViLM学会了从预训练的向量量化tokenizer产生的遮罩视频区域的语义标签中重构语义标签，该tokenizer将连续的视觉信号离散化成了标签。我们证明了，即使我们的简单MVM任务和常规VL预训练模型比较紧凑，我们的E-ViLM也能够从视频-语言数据集中学习到有表达性的表示，并且在广泛的视频-语言任务（包括视频问答和文本-到-视频检索等）上表现良好。特别是，我们的E-ViLM通过达到与具有更快的推理速度的竞争水平取得了显著的效率改进，即我们的模型在MSRVTT基准上达到$39.3\%$的Top-$1$准确率，仅保留$15\%$的参数和$94.8\%$的GFLOPs。我们还提供了广泛的差分研究，证实了我们的学习方案对E-ViLM的有效性。

URL

https://arxiv.org/abs/2311.17267

PDF

https://arxiv.org/pdf/2311.17267.pdf

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

Abstract

Abstract (translated)

URL

PDF Copy

PDF