Abstract
Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at this https URL
Abstract (translated)
预训练的视觉-语言模型已经在视频理解方面取得了有效性。然而,最近的研究并没有充分利用视频的必要时间信息,仅仅是平均帧级表示或参考连续帧。我们引入了 Temporally Contextualized CLIP (TC-CLIP),这是一个先驱性的框架,用于视频理解,有效且高效地利用了全面视频信息。我们提出了 Temporal Contextualization (TC),一种新的层间时间信息注入机制,用于提取每个帧的核心信息,将视频中的相关信息连接起来,并最终在特征编码过程中利用上下文 tokens。此外,我们的 Video-conditional Prompting (VP) 模块用于在文本模态生成有信息的提示。我们在零散、少散、基础到 novel 和完全监督的动作识别上进行了广泛的实验,以验证我们的 TC-CLIP 的优越性。TC 和 VP 的消融研究确保了我们的设计选择。代码可以从该链接获取:
URL
https://arxiv.org/abs/2404.09490