Abstract
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at this https URL
Abstract (translated)
我们介绍了一种从单目摄像机流中检测和跟踪多人详细三维姿态的方法。我们的系统能够在拥挤且包含困难姿势和遮挡的场景中保持时间上连贯的预测结果。该模型同时执行强大的逐帧检测和学习到的姿态更新,以实现跨帧的人体追踪。与在不同时间段匹配检测结果的方式不同,我们直接从新输入图像更新姿态,这使得即使在存在遮挡的情况下也能进行在线跟踪。我们在大量的图像和视频数据集上进行了训练,并利用伪标签注释来生成一个模型,在三维姿态估计精度方面能够匹敌当前最先进的系统,同时在长时间内多个人体追踪的速度和准确性方面更为出色。代码和权重可在此 [https URL] 获取。
URL
https://arxiv.org/abs/2504.12186