Abstract
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
Abstract (translated)
从单目视频中重建人体运动是计算机视觉中的一个基本挑战,具有广泛的应用场景,包括增强现实/虚拟现实、机器人技术和数字内容创作。然而,在实际环境中由于频繁的遮挡问题,这一任务仍然极具挑战性。基于回归的方法虽然效率高但对缺失观测非常敏感,而优化和扩散方法则通过牺牲推理速度并增加预处理步骤来提高鲁棒性。为了解决这些问题,我们利用最近在生成式掩码建模方面的进展,并提出了一种用于遮挡下人体运动恢复的框架——MoRo(Masked Modeling for human motion Recovery under Occlusions)。 MoRo是一种针对遮挡具有鲁棒性的端到端生成框架,它将运动重建视为一个视频条件下的任务,在全局坐标系中从RGB视频高效地恢复人类运动。通过掩码建模,MoRo能够自然处理遮挡问题,并支持高效的端到端推理。为了克服成对的视频-动作数据稀缺的问题,我们设计了一种跨模态学习方案,该方案从一组异构的数据集中学习多模式先验:(i)一种在MoCap数据集上训练的动作轨迹感知运动先验;(ii)一种基于图像的姿态先验,在图像姿态数据集上进行训练,捕捉每帧中多样的姿势;以及(iii)一个视频条件下的掩码变换器,该模型融合了动作和姿态的先验,并通过在视频-动作数据集上的微调与视觉线索结合运动动力学以实现稳健推理。 在EgoBody和RICH数据集上进行的大量实验表明,在遮挡条件下,MoRo在准确性和运动逼真度方面显著优于最先进的方法,而在非遮挡场景中则表现出相当的性能。此外,MoRo能够在单个H200 GPU上以每秒70帧的速度实现实时推理。
URL
https://arxiv.org/abs/2601.16079