MotionChain: Conversational Motion Controllers via Multimodal Prompts

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.

Abstract (translated)

近年来在语言模型的进步中，已经证明了它们在多轮对话和保留会话上下文方面的能力。然而，在多模态生成模型中，尤其是在人体运动模型中，这种能力仍然没有被广泛探索。通过将多轮对话集成到控制连续虚拟人运动中，生成式人体运动模型可以实现人类任务执行的直观和逐步过程，适用于人形机器人、游戏代理或其他嵌入式系统。在这项工作中，我们提出了MotionChain，一种通过多模态提示生成连续和长时间的人体运动控制器。具体来说，MotionChain由多模态词表组成，将各种数据类型如文本、图像和运动转换为离散的标记，并与Vision-Motion- aware 语言模型耦合。通过利用大规模语言、视觉语言和视觉运动数据来辅助运动相关生成任务，MotionChain因此可以理解多轮对话中的每个指令，并生成跟随这些提示的人体运动。大量实验证实了MotionChain的有效性，证明了其在会话运动生成方面的卓越性能，以及更直观的人机交互方式。

URL

https://arxiv.org/abs/2404.01700

PDF

https://arxiv.org/pdf/2404.01700.pdf

MotionChain: Conversational Motion Controllers via Multimodal Prompts

Abstract

Abstract (translated)

URL

PDF Copy

PDF