Paper Reading AI Learner

DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

2023-03-23 16:15:18
Ce Zheng, Guo-Jun Qi, Chen Chen

Abstract

Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR and MPS-Net rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW), which demonstrated the effectiveness and efficiency of our DDT.

Abstract (translated)

人类网格恢复(HMR)为各种实际应用程序,如游戏、人机交互和虚拟现实,提供了丰富的人体信息。与单张图像方法相比,视频方法可以利用时间信息进一步改善性能,通过引入人体运动先验。然而,许多对许多(VIBE)方法存在运动流畅度和时间一致性的问题。而许多对一的方法(TCMR和MPS-Net)则依赖于未来的帧,在推理期间具有非因果性和时间效率不高。为了解决这些挑战,我们提出了一种新的扩散驱动Transformer-based框架(DDT),用于视频人类网格恢复。DDT旨在从输入序列中解码特定的运动模式,增强运动流畅度和时间一致性。作为一种许多对许多的方法,我们的DDT解码器输出所有帧的人网格,使得DDT更适合那些时间效率是至关重要的且希望使用具有因果模型的应用。广泛的实验在广泛使用的数据集(人类3.6M、MPI-INF-3DHP和3DPW)上进行,证明了我们的DDT的有效性和效率。

URL

https://arxiv.org/abs/2303.13397

PDF

https://arxiv.org/pdf/2303.13397.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot