Abstract
Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at this https URL
Abstract (translated)
近年来,由于其在零散拍摄识别方面的成功,人们对将视觉语言模型(VLMs)应用于图像和第三人称视频分类产生了浓厚兴趣。然而,将这些模型应用于以自我为中心的视频领域仍然是一个未被充分探索的领域。为了填补这一空白,我们提出了一个简单而有效的跨模态适应框架,我们称之为X-MIC。通过使用视频适配器,我们的管道学会了将冻结的文本嵌入对齐到每个以自我为中心的视频直接在共享嵌入空间中。我们新颖的适配器架构通过解开可学习的时间建模和冻结的视觉编码器的 learnable temporal modeling,保留并提高了预训练VLMs的泛化能力。这导致文本嵌入与每个以自我为中心的视频之间的增强对齐,从而在跨数据集通用性上取得了显著的提高。我们在Epic-Kitchens,Ego4D和EGTEA数据集上对细粒度跨数据集动作通用性进行评估,证明了我们的方法的的有效性。代码可以从该链接获得:
URL
https://arxiv.org/abs/2403.19811