Paper Reading AI Learner

X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

2025-08-12 22:47:20
Guoxian Song, Hongyi Xu, Xiaochen Zhao, You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Linjie Luo

Abstract

We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens -- one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion--identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.

Abstract (translated)

我们提出了一种名为X-UniMotion的统一且表达性强的隐式潜在表示方法,用于全身人体运动,包括面部表情、身体姿态和手部动作。与以往依赖显式的骨骼姿势和启发式的跨身份调整的方法不同,我们的方法直接从单张图像中编码多粒度的运动信息,并将其压缩为四个解耦的潜在标记——一个用于面部表情,一个用于身体姿态,每个手各一个。这些运动潜变量既高度表达又能跨越不同的个体,使得在具有多样化身份、姿势和空间配置的不同主体之间进行高质量、细致的跨身份动作转移成为可能。 为了实现这一目标,我们引入了一个自我监督的端到端框架,该框架同时学习了运动编码器和潜在表示以及一个基于DiT(Diffusion Transformer)的视频生成模型,并在大规模、多样化的全身人类运动数据集上进行训练。通过二维空间和颜色增强以及跨身份主体对共享姿势下的合成三维渲染来强制执行动作-身份解耦。此外,我们还使用辅助解码器引导动作令牌学习,这些解码器促进精细的语义对齐和深度感知的动作嵌入。 广泛的实验表明,X-UniMotion在保持运动真实性和身份一致性的前提下,能够生成高度表达性动画,在性能上超越了现有的先进技术。

URL

https://arxiv.org/abs/2508.09383

PDF

https://arxiv.org/pdf/2508.09383.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot