Abstract
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
Abstract (translated)
视觉-语言模型(如CLIP)的引入,促进了能够泛化到未见过视频和人类动作的基础视频模型的发展。然而,这些模型通常是在网络视频上进行训练的,而这些视频往往无法捕捉日常活动(ADL)视频中存在的挑战。现有研究通过结合3D骨架与RGB视频来解决类似外观、细微的动作模式及多视角等特定于ADL的问题。不过,这种方法未将语言整合进来,从而限制了其对新动作类别的泛化能力。 在本文中,我们提出了SKI模型,该模型将3D骨架融入到视觉-语言嵌入空间中。通过联合训练,SKI模型利用了一种骨骼-语言模型(SkeletonCLIP),能够将骨架信息注入到视觉语言模型(VLMs)和大型视觉语言模型(LVLMs)中。值得注意的是,在推理阶段SKI模型不需要骨架数据,从而增强了其在实际应用中的鲁棒性。 我们通过三个流行的ADL数据集上的零样本动作识别与视频字幕生成任务验证了SKI模型的有效性。
URL
https://arxiv.org/abs/2502.03459