Abstract
Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.
Abstract (translated)
信息检索是一个不断发展和关键的研究领域。特别是,在在线获取高质量的人体运动数据的需求增加的情况下,人类运动研究作品激增。以前的工作主要集中在双模态学习,例如文本和运动任务,但很少有人探讨过三模态学习。直观上,引入一个新的模态可以丰富模型的应用场景,更重要的是,适当选择额外的模态也可以作为中介,增强其他两个不同模态之间的对齐。在本文中,我们引入了LAVIMO(语言-视频-动作对齐),一个将人体中心视频作为附加模态的三模态学习框架,从而有效弥合文本和运动之间的差距。此外,我们的方法利用了专门设计的注意力机制来促进文本、视频和动作模态之间的加强对齐和协同作用。通过实验,我们在HumanML3D和KIT-ML数据集上的结果表明,LAVIMO在各种跨模态检索任务中实现了最先进的性能,包括文本到运动、运动到文本、视频到运动和运动到视频。
URL
https://arxiv.org/abs/2403.00691