Paper Reading AI Learner

PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

2025-12-15 16:03:26
Ruiyan Wang, Teng Hu, Kaihui Huang, Zihan Su, Ran Yi, Lizhuang Ma

Abstract

Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.

Abstract (translated)

姿势引导的视频生成是指通过一系列姿态序列来控制生成视频中主体的动作。这种技术能够对主体动作进行精确控制,并在动画制作中有重要应用。然而,目前的姿势引导视频生成方法仅能接受人体姿态作为输入,这导致其在处理其他生物或非人类角色的姿态时泛化能力较差。为解决这一问题,我们提出了PoseAnything——首个通用的姿势引导视频生成框架,能够同时处理人类和非人类角色,并支持任意骨骼结构的输入。 为了增强运动过程中的连贯性保持,我们引入了Part-aware Temporal Coherence Module(部件感知时间一致性模块),该模块将主体划分为不同部分,建立各个部分之间的对应关系,并计算各帧中相应部分之间的交叉注意力,从而实现细粒度级别的部件级连贯性。 此外,我们还提出了一种新的引导策略——Subject and Camera Motion Decoupled CFG(解耦的主体和相机运动控制CFG),这是首次在姿势引导视频生成中实现了独立控制相机移动的方法。通过将主体和相机的运动控制信息分别注入到CFG的正负锚点中实现。 最后,我们还推出了XPose,这是一个高质量的公开数据集,包含50,000对非人类姿态-视频配对,以及自动化注释和过滤管道。 广泛的实验表明,PoseAnything在有效性及泛化能力方面均显著优于现有的最先进的方法。

URL

https://arxiv.org/abs/2512.13465

PDF

https://arxiv.org/pdf/2512.13465.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot