Abstract
To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
Abstract (translated)
为了在3D人类运动和语言之间构建跨模态潜在空间,获取大规模且高质量的人类运动数据至关重要。然而,与图像数据的丰富相比,运动数据的稀疏性限制了现有运动-语言模型的性能。为了应对这个问题,我们引入了“运动补丁”,一种新的运动序列表示,并通过迁移学习使用Vision Transformers(ViT)作为运动编码器,旨在从图像域提取有用知识并将其应用于运动域。这些运动补丁是由基于运动部件在运动序列中进行拆分和排序的骨骼关节创建的,对不同的骨架结构具有鲁棒性,可以被视为ViT中的颜色图像补丁。我们发现,通过使用通过2D图像数据训练得到的预训练ViT权重的迁移学习,可以提高运动分析的性能,为解决运动数据有限的问题提供了一个有前途的方向。我们的广泛实验表明,与ViT共同使用的运动补丁在文本到运动检索基准测试和其他新颖挑战任务(如跨骨架识别、零散射击运动分类和人类交互识别)上实现了最先进的性能,这些任务目前由于缺乏数据而受到阻碍。
URL
https://arxiv.org/abs/2405.04771