Abstract
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.
Abstract (translated)
人类中心化感知在视觉模型工业应用中扮演着关键角色。虽然具体的人类中心化任务有其特定的相关语义方面需要关注,但它们也共享人类身体的基本语义结构。然而,只有少数工作尝试过利用这种一致性设计一个通用的人类中心化任务模型。在本工作中,我们回顾了广泛的人类中心化任务,并以 minimalist 的方式将它们统一起来。我们提出了 UniHCP,一个人类中心化感知的统一模型,它将广泛的人类中心化任务与简单的视觉变换架构相统一。通过在33个人类中心化数据集上进行大规模联合训练,UniHCP可以在几个内部和下游任务中通过直接评估比强基线表现更好。当适应特定的任务时,UniHCP在广泛的人类中心化任务中实现了新的 SOTA,例如,69.8 米IoU在 CIHP 中是人类Parsing,86.18 mA 在 PA-100K 中是属性预测,90.3 米AP在 Market1501 中是人类 ReID,和 85.8 JI 在 CrowdHuman 中用于 pedestrian detection,比每个任务专门的模型表现更好。
URL
https://arxiv.org/abs/2303.02936