Abstract
Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
Abstract (translated)
姿势引导的视频生成是指通过一系列姿态序列来控制生成视频中主体的动作。这种技术能够对主体动作进行精确控制,并在动画制作中有重要应用。然而,目前的姿势引导视频生成方法仅能接受人体姿态作为输入,这导致其在处理其他生物或非人类角色的姿态时泛化能力较差。为解决这一问题,我们提出了PoseAnything——首个通用的姿势引导视频生成框架,能够同时处理人类和非人类角色,并支持任意骨骼结构的输入。 为了增强运动过程中的连贯性保持,我们引入了Part-aware Temporal Coherence Module(部件感知时间一致性模块),该模块将主体划分为不同部分,建立各个部分之间的对应关系,并计算各帧中相应部分之间的交叉注意力,从而实现细粒度级别的部件级连贯性。 此外,我们还提出了一种新的引导策略——Subject and Camera Motion Decoupled CFG(解耦的主体和相机运动控制CFG),这是首次在姿势引导视频生成中实现了独立控制相机移动的方法。通过将主体和相机的运动控制信息分别注入到CFG的正负锚点中实现。 最后,我们还推出了XPose,这是一个高质量的公开数据集,包含50,000对非人类姿态-视频配对,以及自动化注释和过滤管道。 广泛的实验表明,PoseAnything在有效性及泛化能力方面均显著优于现有的最先进的方法。
URL
https://arxiv.org/abs/2512.13465