Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Abstract
Abstract (translated)
URL
PDF

Abstract

Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Abstract (translated)

人类动作理解是一个基本任务，具有多样化的实际应用价值，并得益于大规模运动捕捉数据集的可用性得以实现。最近的研究主要集中在文本运动任务上，如基于文本的运动生成、编辑和问答任务。在这项研究中，我们引入了新的基于文本的人体动作 grounded (THMG) 任务，旨在精确地将给定文本描述中的时间片段定位到未修剪的运动序列中。捕捉全局时间信息对 THMG 任务至关重要。然而，基于全局时间自注意力的Transformer模型在处理长篇未修剪序列时面临挑战，因为其二次计算成本。为了应对这些挑战，我们提出了 Text-controlled Motion Mamba (TM-Mamba)，一种集成文本控制、语言查询控制和空间图拓扑结构的统一模型，具有仅线性记忆成本。模型的核心是一个文本控制的选取机制，根据文本查询动态地融入全局时间信息。通过引入关系嵌入，模型进一步增强了拓扑意识。为了评估，我们引入了 BABEL-Grounding，第一个提供人类动作详细文本描述以及相应时间段的文本运动数据集。广泛的评估结果证明了 TM-Mamba 在 BABEL-Grounding 上的有效性。

URL

https://arxiv.org/abs/2404.11375

PDF

https://arxiv.org/pdf/2404.11375.pdf

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Abstract

Abstract (translated)

URL

PDF Copy

PDF