Paper Reading AI Learner

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

2024-04-17 13:33:09
Xinghan Wang, Zixi Kang, Yadong Mu

Abstract

Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Abstract (translated)

人类动作理解是一个基本任务,具有多样化的实际应用价值,并得益于大规模运动捕捉数据集的可用性得以实现。最近的研究主要集中在文本运动任务上,如基于文本的运动生成、编辑和问答任务。在这项研究中,我们引入了新的基于文本的人体动作 grounded (THMG) 任务,旨在精确地将给定文本描述中的时间片段定位到未修剪的运动序列中。捕捉全局时间信息对 THMG 任务至关重要。然而,基于全局时间自注意力的Transformer模型在处理长篇未修剪序列时面临挑战,因为其二次计算成本。为了应对这些挑战,我们提出了 Text-controlled Motion Mamba (TM-Mamba),一种集成文本控制、语言查询控制和空间图拓扑结构的统一模型,具有仅线性记忆成本。模型的核心是一个文本控制的选取机制,根据文本查询动态地融入全局时间信息。通过引入关系嵌入,模型进一步增强了拓扑意识。为了评估,我们引入了 BABEL-Grounding,第一个提供人类动作详细文本描述以及相应时间段的文本运动数据集。广泛的评估结果证明了 TM-Mamba 在 BABEL-Grounding 上的有效性。

URL

https://arxiv.org/abs/2404.11375

PDF

https://arxiv.org/pdf/2404.11375.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot