Paper Reading AI Learner

RopeTP: Global Human Motion Recovery via Integrating Robust Pose Estimation with Diffusion Trajectory Prior

2024-10-27 07:19:39
Mingjiang Liang, Yongkang Cheng, Hualin Liang, Shaoli Huang, Wei Liu

Abstract

We present RopeTP, a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos. At the heart of RopeTP is a hierarchical attention mechanism that significantly improves context awareness, which is essential for accurately inferring the posture of occluded body parts. This is achieved by exploiting the relationships with visible anatomical structures, enhancing the accuracy of local pose estimations. The improved robustness of these local estimations allows for the reconstruction of precise and stable global trajectories. Additionally, RopeTP incorporates a diffusion trajectory model that predicts realistic human motion from local pose sequences. This model ensures that the generated trajectories are not only consistent with observed local actions but also unfold naturally over time, thereby improving the realism and stability of 3D human motion reconstruction. Extensive experimental validation shows that RopeTP surpasses current methods on two benchmark datasets, particularly excelling in scenarios with occlusions. It also outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.

Abstract (translated)

我们提出了一种名为RopeTP的新框架,该框架结合了鲁棒的姿态估计与扩散轨迹先验,从视频中重建全局人体运动。RopeTP的核心是一种分层注意力机制,这种机制显著提高了对上下文的感知能力,这对于准确推断被遮挡身体部位的姿态至关重要。通过利用与可见解剖结构的关系,增强了局部姿态估计的准确性。这些局部估计改进后的鲁棒性使得能够重建精确且稳定的全局轨迹成为可能。此外,RopeTP还包含一个扩散轨迹模型,该模型可以从局部姿态序列预测出真实的人体运动。此模型确保生成的轨迹不仅与观察到的局部动作一致,而且随着时间自然展开,从而提高了三维人体运动重建的真实性和稳定性。广泛的实验验证表明,RopeTP在两个基准数据集上超越了当前的方法,在遮挡场景中表现尤为出色。它还优于依赖SLAM进行初始相机估计和广泛优化的方法,提供更准确且真实的轨迹。

URL

https://arxiv.org/abs/2410.20358

PDF

https://arxiv.org/pdf/2410.20358.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot