Abstract
Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.
Abstract (translated)
近年来在语言模型的进步中,已经证明了它们在多轮对话和保留会话上下文方面的能力。然而,在多模态生成模型中,尤其是在人体运动模型中,这种能力仍然没有被广泛探索。通过将多轮对话集成到控制连续虚拟人运动中,生成式人体运动模型可以实现人类任务执行的直观和逐步过程,适用于人形机器人、游戏代理或其他嵌入式系统。在这项工作中,我们提出了MotionChain,一种通过多模态提示生成连续和长时间的人体运动控制器。具体来说,MotionChain由多模态词表组成,将各种数据类型如文本、图像和运动转换为离散的标记,并与Vision-Motion- aware 语言模型耦合。通过利用大规模语言、视觉语言和视觉运动数据来辅助运动相关生成任务,MotionChain因此可以理解多轮对话中的每个指令,并生成跟随这些提示的人体运动。大量实验证实了MotionChain的有效性,证明了其在会话运动生成方面的卓越性能,以及更直观的人机交互方式。
URL
https://arxiv.org/abs/2404.01700