Paper Reading AI Learner

HumMUSS: Human Motion Understanding using State Space Models

2024-04-16 19:59:21
Arnab Kumar Mondal, Stefano Alletto, Denis Tome

Abstract

Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.

Abstract (translated)

从视频理解人类运动对于各种应用至关重要,包括姿态估计、网格恢复和动作识别。虽然最先进的方法主要依赖于Transformer架构,但这些方法在实际场景中具有局限性。当在实时连续流中预测时,Transformer的运行速度较慢,并且不具有对新帧率的泛化能力。鉴于这些限制,我们提出了一个新颖的无需关注的时空模型来解决人类运动理解问题,该模型基于最近在状态空间模型方面的进展。我们的模型不仅在各种运动理解任务中与Transformer架构的模型性能相匹敌,还带来了诸如对不同视频帧率适应性和与较长序列点关键点的合作工作等功能。此外,与基于Transformer的方案一样,所提出的模型支持离线和实时应用。对于实时连续预测,我们的模型在保持高准确性的同时,具有比基于Transformer的方案更快的记忆效率和成倍提高的运行速度。

URL

https://arxiv.org/abs/2404.10880

PDF

https://arxiv.org/pdf/2404.10880.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot