TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Abstract
Abstract (translated)
URL
PDF

Abstract

The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at this https URL.

Abstract (translated)

基础模型的终极目标是实现任务无关性，即支持非特定任务微调而无需特定任务微调。尽管在自然语言处理和图像表示学习方面取得了突破，但由于时间空间和信号的不确定性不断增加，视频模型仍然难以达到这一目标。为了减轻训练难度，现有工作利用图像基础模型的先前知识并配备高效的时间模块。尽管微调表现令人满意，但我们经验证他们无法满足弹出使用的要求，因为与基准模型相比，他们的零Shot/线性协议性能甚至下降了。在这项工作中，我们分析导致下降的因素，即从语言监督失真的角度分析。我们指出，像先前工作一样全局微调文本编码器是最优的选择，因为它可能会过度适应风格，从而失去其捕捉各种语言寄存器语义的最初泛化能力。过度适应的文本编码器会提供有害的监督信号，降低视频表示。为了解决这个问题，我们提出了一个无退化预训练策略，通过冻结浅层层并启用可调整的深度语义捕捉，以保持文本编码器的泛化能力，同时允许任务相关的语义在可调整的深度层上捕捉。对于训练目标，我们采用了TVTS中的文字转录排序任务，并使用掩膜技术实现了 scalable 训练。因此，我们生产了一系列模型，称为 TVTSv2，具有高达10亿参数。我们实现了新的视频基准面上的性能，通过冻结主干线，超越了最近的ImageBind、InternVideo等。代码可在该https URL上获取。

URL

https://arxiv.org/abs/2305.14173

PDF

https://arxiv.org/pdf/2305.14173.pdf