Human_Parsing

Part-Attention Based Model Make Occluded Person Re-Identification Stronger

2024-04-04 13:43:11

Zhihao Chen, Yiyuan Ge

arXiv_CV

arXiv_CV Person_Re-identification Re-identification Attention Pose Human_Parsing
Abstract

The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods.

Abstract (translated)

遮挡人物识别（ReID）的目标是检索遮挡情况下的特定行人。然而，遮挡人物ReID仍然受到背景杂乱和低质量局部特征表示的限制，这限制了模型的性能。在我们的研究中，我们引入了一个新的框架PAB-ReID，这是一种新型的ReID模型，采用了部分注意机制来有效解决上述问题。首先，我们引入了人类解析标签来指导生成更准确的人的部分注意力图。此外，我们提出了一种细粒度特征关注器，用于在抑制背景干扰的同时生成细粒度的人局部特征表示。此外，我们还设计了一个部分三元组损失来指导人局部特征的学习，该损失优化了类内/类间距离。我们在专门的遮挡和普通ReID数据集上进行了广泛的实验，展示了我们的方法超越了现有最先进的方法。

URL

https://arxiv.org/abs/2404.03443

PDF

https://arxiv.org/pdf/2404.03443.pdf
Read All
Data Augmentation in Human-Centric Vision

2024-03-13 16:05:18

Wentao Jiang, Yige Zhang, Shaozhong Zheng, Si Liu, Shuicheng Yan

arXiv_CV

arXiv_CV Detection Review Survey Pose_Estimation Pose Human_Parsing Diffusion
Abstract

This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks, a first of its kind in the field. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection, addressing the significant challenges posed by overfitting and limited training data in these domains. Our work categorizes data augmentation methods into two main types: data generation and data perturbation. Data generation covers techniques like graphic engine-based generation, generative model-based generation, and data recombination, while data perturbation is divided into image-level and human-level perturbations. Each method is tailored to the unique requirements of human-centric tasks, with some applicable across multiple areas. Our contributions include an extensive literature review, providing deep insights into the influence of these augmentation techniques in human-centric vision and highlighting the nuances of each method. We also discuss open issues and future directions, such as the integration of advanced generative models like Latent Diffusion Models, for creating more realistic and diverse training data. This survey not only encapsulates the current state of data augmentation in human-centric vision but also charts a course for future research, aiming to develop more robust, accurate, and efficient human-centric vision systems.

Abstract (translated)

这项调查对人类中心化视觉任务中的数据增强技术进行了全面分析，是该领域独一无二的。它深入研究了包括人物识别、人解析、人姿势估计和行人检测在内的广泛研究领域，解决了过拟合和有限训练数据在这些领域带来的显著挑战。我们的工作将数据增强方法分为两种主要类型：数据生成和数据扰动。数据生成包括基于图形引擎生成、基于生成模型生成和数据重组等技术，而数据扰动则分为图像级别和人类级别扰动。每种方法都是针对人类中心化任务独特的需求进行定制的，有些方法可以应用于多个领域。我们的贡献包括广泛的文献 review，为这些增强技术在人类中心化视觉和每个方法的影响力提供了深刻的洞察。我们还讨论了未解决的问题和未来的研究方向，例如采用先进的生成模型如潜在扩散模型，以创建更真实和多样化的训练数据。这项调查不仅概括了当前数据增强在人类中心化视觉中的状态，而且为未来的研究奠定了基础，旨在开发更健壮、准确和高效的以人为本视觉系统。

URL

https://arxiv.org/abs/2403.08650

PDF

https://arxiv.org/pdf/2403.08650.pdf
Read All
Spatial Cascaded Clustering and Weighted Memory for Unsupervised Person Re-identification

2024-03-01 03:52:29

Jiahao Hong, Jialong Zuo, Chuchu Han, Ruochen Zheng, Ming Tian, Changxin Gao, Nong Sang

arXiv_CV

arXiv_CV Person_Re-identification Re-identification Unsupervised Pose Human_Parsing
Abstract

Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizontal division, which suffer from misalignment due to various human poses. Additionally, the misalignment of semantic information in part features restricts the use of metric learning, thus affecting the effectiveness of part-based methods. The two issues mentioned above result in the under-utilization of part features in part-based methods. We introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method to address these challenges. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the foreground omissions and spatial confusions issues in the previous method. Then, we propose foreground and space corrections to enhance the completeness and reasonableness of the human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, which enables better utilization of both global and part features. Extensive experiments on Market-1501 and MSMT17 validate the proposed method's effectiveness over many state-of-the-art methods.

Abstract (translated)

最近，无监督的人重新识别（Re-ID）方法通过利用细粒度局部上下文取得了高性能。这些方法被称为基于部分的（part-based）方法。然而，大多数基于部分的方法通过水平分割获得局部上下文，这会导致因为各种人体姿势而产生的错位。此外，部分特征中的语义信息错位限制了使用指标学习，从而影响了基于部分的方法的有效性。上述两个问题导致基于部分的方法中部分特征的利用率较低。我们引入了空间级联聚类和加权记忆（SCWM）方法来解决这些问题。SCWM旨在解析和校准不同人体部位更准确的局部上下文，同时允许记忆模块平衡难样本挖掘和噪声抑制。具体来说，我们首先分析了前方法中的前景缺失和空间混淆问题。然后，我们提出了前景和空间修正来提高人类解析结果的完整性和合理性。接下来，我们引入了加权记忆，并利用了两种加权策略。这些策略解决了全局特征的难样本挖掘问题，并提高了部分特征的噪声抵抗能力，从而更好地利用全局和部分特征。在Market-1501和MSMT17等大量实验中，我们验证了所提出方法的有效性超过了许多最先进的method。

URL

https://arxiv.org/abs/2403.00261

PDF

https://arxiv.org/pdf/2403.00261.pdf
Read All
DROP: Decouple Re-Identification and Human Parsing with Task-specific Features for Occluded Person Re-identification

2024-01-31 17:54:43

Shuguang Dou, Xiangyang Jiang, Yuanpeng Tu, Junyao Gao, Zefan Qu, Qingsong Zhao, Cairong Zhao

arXiv_CV

arXiv_CV Person_Re-identification Re-identification Attention Human_Parsing
Abstract

The paper introduces the Decouple Re-identificatiOn and human Parsing (DROP) method for occluded person re-identification (ReID). Unlike mainstream approaches using global features for simultaneous multi-task learning of ReID and human parsing, or relying on semantic information for attention guidance, DROP argues that the inferior performance of the former is due to distinct granularity requirements for ReID and human parsing features. ReID focuses on instance part-level differences between pedestrian parts, while human parsing centers on semantic spatial context, reflecting the internal structure of the human body. To address this, DROP decouples features for ReID and human parsing, proposing detail-preserving upsampling to combine varying resolution feature maps. Parsing-specific features for human parsing are decoupled, and human position information is exclusively added to the human parsing branch. In the ReID branch, a part-aware compactness loss is introduced to enhance instance-level part differences. Experimental results highlight the efficacy of DROP, especially achieving a Rank-1 accuracy of 76.8% on Occluded-Duke, surpassing two mainstream methods. The codebase is accessible at this https URL.

Abstract (translated)

本文介绍了一种名为Decouple Re-identification and human Parsing (DROP)的方法,用于解决遮挡的人重新识别(ReID)问题。与使用全局特征进行同时多任务学习ReID和人类解析的传统方法不同,或者依赖语义信息进行关注指导的方法,DROP认为,前者的性能差是由于ReID和人类解析特征的细粒度要求不同导致的。ReID关注于行人部分之间的实例级别差异,而人类解析关注于语义空间上下文,反映了人体内部的结构。为了应对这个问题,DROP解耦了ReID和人类解析的特征,提出了一种保留详细信息的上采样方法,结合了不同分辨率特征图。人类解析特定的特征被解耦,而人类位置信息被专门添加到人类解析分支中。在ReID分支中,引入了一种部分感知紧凑性损失,以增强实例级别的部分差异。实验结果强调了DROP的有效性,尤其是实现了在Occluded-Duke上的排名1准确度为76.8%,超过了两个主要方法。代码库可在此处访问:https://github.com/google-research/DROP。

URL

https://arxiv.org/abs/2401.18032

PDF

https://arxiv.org/pdf/2401.18032.pdf
Read All
Explore Human Parsing Modality for Action Recognition

2024-01-04 08:43:41

Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Shen Zhao, Mengyuan Liu

arXiv_CV

arXiv_CV CNN Recognition Action_Recognition Pose Action Human_Parsing
Abstract

Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: this https URL.

Abstract (translated)

多模态动作识别方法通过姿态和RGB模态取得了高度的成功。然而，骨架序列缺乏外观描述，RGB图像因模态限制而受到无关噪声的影响。为了解决这个问题，我们引入了人类解析特征图作为一种新的模式，因为它可以选择性地保留身体部位的有效语义特征，同时过滤出大多数无关噪声。我们提出了一个名为Ensemble Human Parsing and Pose Network (EPP-Net)的新双分支框架，它是第一个利用骨架和人类解析模态进行动作识别的。第一个人体姿态分支将稳健的骨架输入到图卷积网络中以建模姿态特征，而第二个人体解析分支则利用表示性卷积后端利用表示性卷积特征图建模通过表示性后端进行解析特征。通过晚融合策略将两个高级特征有效地结合以提高动作识别效果。在NTU RGB+D和NTU RGB+D 120基准测试中，我们进行了广泛的实验，验证了我们提出的EPP-Net的有效性，该方法超越了现有的动作识别方法。我们的代码可以从以下链接获得：https://this URL。

URL

https://arxiv.org/abs/2401.02138

PDF

https://arxiv.org/pdf/2401.02138.pdf
Read All
Part Representation Learning with Teacher-Student Decoder for Occluded Person Re-identification

2023-12-15 13:54:48

Shang Gao, Chenyang Yu, Pingping Zhang, Huchuan Lu

arXiv_CV

arXiv_CV Person_Re-identification Re-identification Attention Represenation_Learning Transformer Pose Human_Parsing
Abstract

Occluded person re-identification (ReID) is a very challenging task due to the occlusion disturbance and incomplete target information. Leveraging external cues such as human pose or parsing to locate and align part features has been proven to be very effective in occluded person ReID. Meanwhile, recent Transformer structures have a strong ability of long-range modeling. Considering the above facts, we propose a Teacher-Student Decoder (TSD) framework for occluded person ReID, which utilizes the Transformer decoder with the help of human parsing. More specifically, our proposed TSD consists of a Parsing-aware Teacher Decoder (PTD) and a Standard Student Decoder (SSD). PTD employs human parsing cues to restrict Transformer's attention and imparts this information to SSD through feature distillation. Thereby, SSD can learn from PTD to aggregate information of body parts automatically. Moreover, a mask generator is designed to provide discriminative regions for better ReID. In addition, existing occluded person ReID benchmarks utilize occluded samples as queries, which will amplify the role of alleviating occlusion interference and underestimate the impact of the feature absence issue. Contrastively, we propose a new benchmark with non-occluded queries, serving as a complement to the existing benchmark. Extensive experiments demonstrate that our proposed method is superior and the new benchmark is essential. The source codes are available at this https URL.

Abstract (translated)

遮挡的人重新识别（ReID）是一个非常具有挑战性的任务，由于遮挡干扰和缺乏目标信息，利用外部线索如人体姿态或解析来定位和对齐部分特征在遮挡的人重新识别中已经被证明非常有效。同时，最近使用的Transformer结构具有很强的长距离建模能力。考虑到上述事实，我们提出了一个教师-学生解码器（TSD）框架来进行遮挡的人重新识别，该框架利用了人类解析来辅助Transformer解码器。具体来说，我们提出的TSD由一个解析意识到的教师解码器（PTD）和一个标准学生解码器（SSD）组成。PTD利用人类解析线索来限制Transformer的注意力和传递信息给SSD通过特征蒸馏。因此，SSD可以从PTD中学到自动聚合身体部位的信息。此外，还设计了一个掩码生成器，用于提供更好的ReID。此外，现有的遮挡人重新识别基准采用遮挡样本作为查询，这会放大缓解遮挡干扰的作用，低估特征缺失问题的影响。相反，我们提出了一个新基准，作为现有基准的补充。大量实验证明，与我们的方法相比，我们的方法优越，新基准至关重要。源代码可以从该链接https://www.example.com/中获取。

URL

https://arxiv.org/abs/2312.09797

PDF

https://arxiv.org/pdf/2312.09797.pdf
Read All
360{deg} Volumetric Portrait Avatar

2023-12-08 19:00:03

Jalees Nehvi, Berna Kabadayi, Julien Valentin, Justus Thies

arXiv_CV

arXiv_CV Tracking Pose 3D Reconstruction Human_Parsing Facial_Landmark
Abstract

We propose 360° Volumetric Portrait (3VP) Avatar, a novel method for reconstructing 360° photo-realistic portrait avatars of human subjects solely based on monocular video inputs. State-of-the-art monocular avatar reconstruction methods rely on stable facial performance capturing. However, the common usage of 3DMM-based facial tracking has its limits; side-views can hardly be captured and it fails, especially, for back-views, as required inputs like facial landmarks or human parsing masks are missing. This results in incomplete avatar reconstructions that only cover the frontal hemisphere. In contrast to this, we propose a template-based tracking of the torso, head and facial expressions which allows us to cover the appearance of a human subject from all sides. Thus, given a sequence of a subject that is rotating in front of a single camera, we train a neural volumetric representation based on neural radiance fields. A key challenge to construct this representation is the modeling of appearance changes, especially, in the mouth region (i.e., lips and teeth). We, therefore, propose a deformation-field-based blend basis which allows us to interpolate between different appearance states. We evaluate our approach on captured real-world data and compare against state-of-the-art monocular reconstruction methods. In contrast to those, our method is the first monocular technique that reconstructs an entire 360° avatar.

Abstract (translated)

我们提出了360度立体肖像（3VP）Avatar，这是一种仅基于单目视频输入来重建人类 subject 的 360 度照片现实主义肖像的方法。最先进的单目 Avatar 重建方法依赖于稳定的面部表演捕捉。然而，基于 3DMM 的面部跟踪的常见用法有局限性；侧面视角很难被捕捉到，尤其是在背面视角时，因为缺少面部特征点或人类解析掩码等所需输入。这导致不完整的 Avatar 重建，仅覆盖到前额叶。相比之下，我们提出了一个基于模板的追踪方案，追踪全身、头部和面部表情，使我们能够从所有侧面覆盖人类 subject 的外观。因此，对于一个在单个相机前旋转的主体的序列，我们基于神经辐射场进行神经体积表示。构建这种表示的一个关键挑战是建模嘴部区域（即嘴唇和牙齿）的外观变化。因此，我们提出了一个变形场为基础的混合基础，使我们能够在不同外观状态之间平滑插值。我们对我们的方法在捕获的现实世界数据上进行评估，并将其与最先进的单目重建方法进行比较。与那些方法相比，我们的方法是第一个仅基于单目的 360 度Avatar 重建方法。

URL

https://arxiv.org/abs/2312.05311

PDF

https://arxiv.org/pdf/2312.05311.pdf
Read All
PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

2023-12-07 18:53:18

Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, Xiaoguang Han

arXiv_CV

arXiv_CV Pose Human_Parsing
Abstract

In this paper, we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types, our method allows flexible specification of style (text or image) and texture (full garment, cropped sections, or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions, we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage, we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage, we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works, we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.

Abstract (translated)

在本文中，我们提出了一个新颖的虚拟试穿方法，从无约束设计（ucVTON）任务中实现对输入人类图像的个性定制服装的渲染。与先前的艺术作品受到特定输入类型的限制不同，我们的方法允许灵活指定风格（文本或图像）和纹理（完整的衣物，裁剪部分或纹理补丁）条件。为了在用完整衣物图像作为条件时解决纠缠挑战，我们开发了一个两阶段流程，其中风格和纹理的显式解离。在第一阶段，我们根据输入生成一个人类解析图，反映所需的风格条件。在第二阶段，我们根据纹理输入在解析图区域上合成纹理。为了代表之前时尚编辑工作中没有实现过的复杂和非平稳纹理，我们首先提出了提取层次结构和平衡的CLIP特征，并在VTON中应用位置编码。实验证明，我们的方法具有卓越的合成质量和个性化的功能，为在线购物和时尚设计带来了全新的用户体验。通过灵活控制风格和纹理混合，使虚拟试穿达到了更高的用户体验水平。

URL

https://arxiv.org/abs/2312.04534

PDF

https://arxiv.org/pdf/2312.04534.pdf
Read All
UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

2023-10-13 10:03:01

Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao

arXiv_CV

arXiv_CV Segmentation Represenation_Learning Relation Optimization Pose Human_Parsing
Abstract

Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.

Abstract (translated)

多人类标注是一个需要同时考虑实例水平和细粒度类水平信息的图像分割任务。然而,先前的研究通常通过单独的分支和不同的输出格式处理这两种信息,导致效率低下和冗余的框架。本文介绍了一种名为UniParser的方法,将实例水平和类水平表示在三个关键方面进行整合:1)我们提出了一个统一的相关性表示学习方法,使得我们的网络能够在余弦空间中学习实例和类特征;2)我们通过统一的标签和辅助损失在实例和类特征上进行监督,实现了每个模块的输出形式为像素级分割结果;3)我们设计了一个联合优化程序来融合实例和类表示。通过统一实例水平和类水平输出,UniParser绕过了手动设计的后处理技术,超越了最先进的methods,实现了MHPv2.0上的49.3%AP和CIHP上的60.4%AP。我们将发布我们的源代码、预训练模型和在线演示,以促进未来的研究。

URL

https://arxiv.org/abs/2310.08984

PDF

https://arxiv.org/pdf/2310.08984.pdf
Read All
Exploring the Robustness of Human Parsers Towards Common Corruptions

2023-09-02 13:32:14

Sanyi Zhang, Xiaochun Cao, Rui Wang, Guo-Jun Qi, Jie Zhou

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Pose Human_Parsing
Abstract

Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.

Abstract (translated)

人类解析旨在以精细的语义分类对人类图像的每一像素进行分割。然而,当前使用干净数据训练的人类解析器容易受到诸如模糊和噪声等常见的图像损坏。为了改善人类解析器的鲁棒性,在本文中,我们建立了三个 corruption 鲁棒性基准,称为 LIP-C、ATR-C 和 Pascal-Person-Part-C,以协助我们评估人类解析模型的风险容忍度。受到数据增强策略启发,我们提出了一种异质增强机制,以在常见的损坏条件下增强鲁棒性。具体来说,从不同视角提供的数据增强有两种类型,即图像意识增强和模型意识的图像到图像变换,通过Sequentially integrated approach 适应意想不到的图像损坏。图像意识增强可以通过常见的图像操作丰富训练图像的高度多样性。模型意识增强策略通过考虑模型的随机性来提高输入数据的多样性。我们提出的方法是非模型特定的,它可以与任意先进的人类解析框架插件和玩耍。实验结果显示,我们提出的方法表现出良好的通用性,可以提高在面临各种常见图像损坏时人类解析模型和语义分割模型的鲁棒性。同时,在干净数据上仍然可以实现近似性能。

URL

https://arxiv.org/abs/2309.00938

PDF

https://arxiv.org/pdf/2309.00938.pdf
Read All
Parsing is All You Need for Accurate Gait Recognition in the Wild

2023-08-31 13:57:38

Jinkai Zheng, Xinchen Liu, Shuai Wang, Lihao Wang, Chenggang Yan, Wu Liu

arXiv_CV

arXiv_CV Segmentation CNN Recognition Pose 3D Human_Parsing Gait_Recognition
Abstract

Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait. The code and dataset are available at this https URL .

Abstract (translated)

二进制轮廓和关键点基于 skeleton 的骨骼结构已经主导了数十年的人步态识别研究,因为它们可以从视频帧中轻松提取。尽管在实验室环境下的人步态识别取得了成功,但在现实世界中通常失败,因为它们在步态表示方面的信息熵较低。为了实现野生状态下准确的步态识别,本文提出了一种新的步态表示方法,称为步态解析序列(GPS), GPS 是由精细的人类分割序列(即人类解析)提取的视频帧序列,因此它们具有更高的信息熵,以编码步行时精细人类部件的形状和动态。此外,为了更好地探索 GPS 表示的能力,我们提出了一种基于人类解析的步态识别框架,称为 ParsingGait。ParsingGait 包含一个卷积神经网络(CNN)基线和一个轻量级头,第一个头从 GPS 中提取全局语义特征,而另一个头通过学习部分级别的特征相互信息,通过图卷积网络模型模拟人类步行的详细动态。此外,由于缺少适当的数据集,我们建立了第一个基于解析的步态识别数据集,称为步态3D-解析,通过扩展大型且具有挑战性的步态3D数据集。基于步态3D-解析,我们全面地评估了我们的方法和现有的步态识别方法。实验结果显示,GPS 表示带来了显著的精度提高,以及 ParsingGait 的优越性。代码和数据集可在 this https URL 上获取。

URL

https://arxiv.org/abs/2308.16739

PDF

https://arxiv.org/pdf/2308.16739.pdf
Read All
DM-VTON: Distilled Mobile Real-time Virtual Try-On

2023-08-26 07:46:27

Khoi-Nguyen Nguyen-Ngoc, Thanh-Tung Phan-Nguyen, Khanh-Duy Le, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

arXiv_CV

arXiv_CV Knowledge Pose Human_Parsing
Abstract

The fashion e-commerce industry has witnessed significant growth in recent years, prompting exploring image-based virtual try-on techniques to incorporate Augmented Reality (AR) experiences into online shopping platforms. However, existing research has primarily overlooked a crucial aspect - the runtime of the underlying machine-learning model. While existing methods prioritize enhancing output quality, they often disregard the execution time, which restricts their applications on a limited range of devices. To address this gap, we propose Distilled Mobile Real-time Virtual Try-On (DM-VTON), a novel virtual try-on framework designed to achieve simplicity and efficiency. Our approach is based on a knowledge distillation scheme that leverages a strong Teacher network as supervision to guide a Student network without relying on human parsing. Notably, we introduce an efficient Mobile Generative Module within the Student network, significantly reducing the runtime while ensuring high-quality output. Additionally, we propose Virtual Try-on-guided Pose for Data Synthesis to address the limited pose variation observed in training images. Experimental results show that the proposed method can achieve 40 frames per second on a single Nvidia Tesla T4 GPU and only take up 37 MB of memory while producing almost the same output quality as other state-of-the-art methods. DM-VTON stands poised to facilitate the advancement of real-time AR applications, in addition to the generation of lifelike attired human figures tailored for diverse specialized training tasks. this https URL

Abstract (translated)

过去几年,时尚电子商务行业经历了显著增长,这促使我们探索基于图像的虚拟试穿技术,将其引入在线购物平台。然而,现有研究主要忽略了一个关键方面——底层机器学习模型的运行时间。虽然现有方法主要关注提高输出质量,但它们常常忽视了执行时间,这限制了它们在有限设备范围内的应用。为了解决这一差距,我们提出了蒸馏移动实时虚拟试穿(DM-VTON),这是一种创新的虚拟试穿框架,旨在实现简单和高效。我们的方法是基于知识蒸馏计划,利用强大的教师网络作为监督,指导学生网络,而无需依赖人类解析。值得注意的是,我们引入了在学生网络内部的高效移动生成模块, significantly reduce the runtime while ensuring high-quality output。此外,我们提出了虚拟试穿指导姿态的数据合成方法,以解决训练图像中观察到的有限姿态变化。实验结果显示,该方法可以在单个NvidiaTesla T4GPU上实现每秒40帧,仅占用37 MB内存,同时与其他任何先进的方法输出质量几乎相同。DM-VTON已成为推动实时增强现实应用进步的障碍,此外,它还生成定制为各种专业训练任务的生命like服装人物。 this https URL 是 DM-VTON 的一个示例链接。

URL

https://arxiv.org/abs/2308.13798

PDF

https://arxiv.org/pdf/2308.13798.pdf
Read All
CIParsing: Unifying Causality Properties into Multiple Human Parsing

2023-08-23 15:56:26

Xiaojia Chen, Xuanhan Wang, Lianli Gao, Beitao Chen, Jingkuan Song, HenTao Shen

arXiv_CV

arXiv_CV Relation Pose Human_Parsing
Abstract

Existing methods of multiple human parsing (MHP) apply statistical models to acquire underlying associations between images and labeled body parts. However, acquired associations often contain many spurious correlations that degrade model generalization, leading statistical models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causality inspired parsing paradigm termed CIParsing, which follows fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an input image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones cause the generation process of human parsing.Since causal/non-causal factors are unobservable, a human parser in proposed CIParsing is required to construct latent representations of causal factors and learns to enforce representations to satisfy the causal properties. In this way, the human parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CIParsing is designed in a plug-and-play fashion and can be integrated into any existing MHP models. Extensive experiments conducted on two widely used benchmarks demonstrate the effectiveness and generalizability of our method.

Abstract (translated)

现有的多人人类解析方法(MHP)应用统计模型获取图像和标注身体部位之间的 underlying Association(例如,未观测的图像风格/外部干预)。然而,获得的关系常常包含许多伪相关,降低模型的泛化能力,导致统计模型对图像的视觉上下文变化易感(例如,未观测的图像风格/外部干预)。为了解决这一问题,我们提出了一种基于因果关系的解析范式,称为 CIParsing,遵循人类解析涉及两个因果关系的属性的基本因果关系原则(即因果关系的多样性和因果关系的不变性)。具体来说,我们假设输入图像是由因果关系因素(身体部位的特点)和非因果关系因素(外部环境)的混合构建的,只有前一种因素导致人类解析的生成过程。由于因果关系/非因果关系是不可观测的,提议的 CIParsing 人类解析器需要构建因果关系因素的隐式表示,并学习强制表示来满足因果关系属性。这样,人类解析器可以依赖于因果关系因素而对相关证据而不是伪相关因素进行依赖,从而减轻模型退化并提高解析能力。值得注意的是, CIParsing 采用插孔并 Play 的方式设计,可以集成到任何现有的 MHP 模型中。我们对两个广泛使用基准进行了广泛的实验,证明了我们方法的有效性和通用性。

URL

https://arxiv.org/abs/2308.12218

PDF

https://arxiv.org/pdf/2308.12218.pdf
Read All
Semantic-Human: Neural Rendering of Humans from Monocular Video with Human Parsing

2023-08-19 03:18:19

Jie Zhang, Pengcheng Shi, Zaiwang Gu, Yiyang Zhou, Zhi Wang

arXiv_CV

arXiv_CV Regularization Face Pose Denoising Reconstruction Human_Parsing
Abstract

The neural rendering of humans is a topic of great research significance. However, previous works mostly focus on achieving photorealistic details, neglecting the exploration of human parsing. Additionally, classical semantic work are all limited in their ability to efficiently represent fine results in complex motions. Human parsing is inherently related to radiance reconstruction, as similar appearance and geometry often correspond to similar semantic part. Furthermore, previous works often design a motion field that maps from the observation space to the canonical space, while it tends to exhibit either underfitting or overfitting, resulting in limited generalization. In this paper, we present Semantic-Human, a novel method that achieves both photorealistic details and viewpoint-consistent human parsing for the neural rendering of humans. Specifically, we extend neural radiance fields (NeRF) to jointly encode semantics, appearance and geometry to achieve accurate 2D semantic labels using noisy pseudo-label supervision. Leveraging the inherent consistency and smoothness properties of NeRF, Semantic-Human achieves consistent human parsing in both continuous and novel views. We also introduce constraints derived from the SMPL surface for the motion field and regularization for the recovered volumetric geometry. We have evaluated the model using the ZJU-MoCap dataset, and the obtained highly competitive results demonstrate the effectiveness of our proposed Semantic-Human. We also showcase various compelling applications, including label denoising, label synthesis and image editing, and empirically validate its advantageous properties.

Abstract (translated)

人脑可视化是一个具有重要研究意义的课题。然而,以前的研究大多关注如何实现逼真的细节,而忽视了人类解析的研究。此外,传统的语义工作都面临着在复杂运动情况下高效表示 Fine 结果的能力有限的问题。人类解析本质上与光线重建相关,因为相似的外观和几何形状往往对应着相似的语义部分。此外,以前的研究往往设计一个从观察空间到标准空间的运动域,但往往表现出过拟合或欠拟合的现象,导致有限的泛化能力。在本文中,我们介绍了 Semantic-Human,一种实现人类脑可视化新方法,该方法同时实现了逼真的细节和视角一致的人类解析。具体来说,我们扩展了神经网络 Radiance fields (NeRF) 以联合编码语义、外观和几何,使用噪声伪标签监督实现准确的 2D 语义标签,利用 NeRF 固有的一致性和平滑性质,在连续和新的视角下实现了一致的人类解析。我们还介绍了从 SMPL 表面推导出的运动域约束和恢复体积几何的 Regularization 约束。我们使用 ZJU-MoCap 数据集评估了模型,得到的高性能结果证明了我们提出的 Semantic-Human 的高效性。我们还展示了各种令人 compelling 的应用,包括标签去噪、标签合成和图像编辑,并经验证了其有利的特性。

URL

https://arxiv.org/abs/2308.09894

PDF

https://arxiv.org/pdf/2308.09894.pdf
Read All
Exploring Part-Informed Visual-Language Learning for Person Re-Identification

2023-08-04 23:13:49

Yin Lin, Cong Liu, Yehansen Chen, Jinshui Hu, Bing Yin, Baocai Yin, Zengfu Wang

arXiv_CV

arXiv_CV Person_Re-identification Re-identification Inference Pose Human_Parsing
Abstract

Recently, visual-language learning has shown great potential in enhancing visual-based person re-identification (ReID). Existing visual-language learning-based ReID methods often focus on whole-body scale image-text feature alignment, while neglecting supervisions on fine-grained part features. This choice simplifies the learning process but cannot guarantee within-part feature semantic consistency thus hindering the final performance. Therefore, we propose to enhance fine-grained visual features with part-informed language supervision for ReID tasks. The proposed method, named Part-Informed Visual-language Learning ($\pi$-VL), suggests that (i) a human parsing-guided prompt tuning strategy and (ii) a hierarchical fusion-based visual-language alignment paradigm play essential roles in ensuring within-part feature semantic consistency. Specifically, we combine both identity labels and parsing maps to constitute pixel-level text prompts and fuse multi-stage visual features with a light-weight auxiliary head to perform fine-grained image-text alignment. As a plug-and-play and inference-free solution, our $\pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks, especially reporting 90.3% Rank-1 and 76.5% mAP for the most challenging MSMT17 database without bells and whistles.

Abstract (translated)

最近,视觉语言学习在增强基于视觉的人脸识别(ReID)任务中表现出了巨大的潜力。现有的基于视觉语言学习的ReID方法往往关注整体尺寸的图像文本特征对齐,而忽视了精细部件特征的监督。这种简化了学习过程,但不能保证整体特征语义一致性,从而阻碍了最终表现。因此,我们提出了一种利用部分 informed 语言监督增强精细视觉特征的ReID任务的方法。该方法名为部分 informed 视觉语言学习($\pi$-VL),暗示了(i)人类解析引导的prompt调整策略和(ii)基于层级融合的视觉语言对齐范式在确保整体特征语义一致性方面发挥着关键作用。具体来说,我们结合身份标签和解析地图,构成了像素级别的文本提示,并使用轻量级辅助头将多级视觉特征与轻型头部融合,以进行精细的图像文本对齐。作为一种可插拔且无需推理的解决方案,我们的$\pi$-VL在常用的四种ReID基准问题上实现了前人水平的重大改进,特别是报告了在没有音效和功能包的MSMT17数据库中高达90.3%的Rank-1和76.5%的mAP。

URL

https://arxiv.org/abs/2308.02738

PDF

https://arxiv.org/pdf/2308.02738.pdf
Read All
Integrating Human Parsing and Pose Network for Human Action Recognition

2023-07-16 07:58:29

Runwei Ding, Yuhang Wen, Jinfu Liu, Nan Dai, Fanyang Meng, Mengyuan Liu

arXiv_CV

arXiv_CV CNN Recognition Action_Recognition Detection Object_Detection Prediction Pose Action Human_Parsing
Abstract

Human skeletons and RGB sequences are both widely-adopted input modalities for human action recognition. However, skeletons lack appearance features and color data suffer large amount of irrelevant depiction. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain spatiotemporal features of the body parts, while filtering out noises regarding outfits, backgrounds, etc. We propose an Integrating Human Parsing and Pose Network (IPP-Net) for action recognition, which is the first to leverage both skeletons and human parsing feature maps in dual-branch approach. The human pose branch feeds compact skeletal representations of different modalities in graph convolutional network to model pose features. In human parsing branch, multi-frame body-part parsing features are extracted with human detector and parser, which is later learnt using a convolutional backbone. A late ensemble of two branches is adopted to get final predictions, considering both robust keypoints and rich semantic body-part features. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods. Our code is publicly available at this https URL .

Abstract (translated)

人类骨骼和RGB序列都是人类动作识别广泛应用的输入模式。然而,骨骼缺乏外观特征,颜色数据也遭受大量无关描绘。为了解决这一问题,我们提出了人类解析特征图作为一种新模式,因为它可以选择性保留身体部位的时间空间特征,同时过滤掉与服装、背景等有关的噪声。我们提出了一种集成人类解析和姿态网络(IPP-Net)的行为识别算法,它是第一个在双重分支方法中利用骨骼和人类解析特征图的方法。人类姿态分支通过在 graph卷积神经网络中提供不同模式骨骼的紧凑表示来建模姿态特征。在人类解析分支中,使用人类检测器和解析器提取多帧身体部位解析特征,然后使用卷积基线学习。最终 ensemble 两个分支以获得最终预测,考虑 both 稳健的关键点和丰富的语义身体部位特征。在NTU RGB+D和NTU RGB+D 120基准测试中,广泛的实验 consistently 证实了所提出的 IPP-Net 的有效性,它比现有的动作识别方法表现更好。我们的代码在这个 https URL 上公开可用。

URL

https://arxiv.org/abs/2307.07977

PDF

https://arxiv.org/pdf/2307.07977.pdf
Read All
milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing

2023-06-29 15:06:21

Fangqiang Ding, Zhen Luo, Peijun Zhao, Chris Xiaoxuan Lu

arXiv_CV

arXiv_CV Recognition Deep_Learning Tracking Gesture Pose_Estimation Pose Action Activity 3D Point_Cloud Human_Parsing
Abstract

Approaching the era of ubiquitous computing, human motion sensing plays a crucial role in smart systems for decision making, user interaction, and personalized services. Extensive research has been conducted on human tracking, pose estimation, gesture recognition, and activity recognition, which are predominantly based on cameras in traditional methods. However, the intrusive nature of cameras limits their use in smart home applications. To address this, mmWave radars have gained popularity due to their privacy-friendly features. In this work, we propose \textit{milliFlow}, a novel deep learning method for scene flow estimation as a complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method with an average 3D endpoint error of 4.6cm, significantly surpassing the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition, human parsing, and human body part tracking. To foster further research in this area, we provide our codebase and dataset for open access.

Abstract (translated)

随着无处不在计算时代的到来,人类运动感知在智能系统中扮演着关键角色,用于决策、用户交互和个性化服务。 extensive research has been conducted on human tracking、姿势估计、手势识别和活动识别,这些传统方法中主要基于相机。 However,相机的侵入性性质限制了它们在智能家居应用中的使用。 To address this,毫米波雷达因其隐私友好特性而变得越来越受欢迎。 In this work, we propose \textit{milliFlow}, a novel deep learning method for scene flow estimation, as a complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefits downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method with an average 3D endpoint error of 4.6cm, significantly surpassing the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition、人类解析和人体部位追踪。 To foster further research in this area, we provide our codebase and dataset for open access.

URL

https://arxiv.org/abs/2306.17010

PDF

https://arxiv.org/pdf/2306.17010.pdf
Read All
Single-stage Multi-human Parsing via Point Sets and Center-based Offsets

2023-04-22 09:30:50

Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao

arXiv_CV

arXiv_CV Segmentation Classification Attention Pose Human_Parsing Matching
Abstract

This work studies the multi-human parsing problem. Existing methods, either following top-down or bottom-up two-stage paradigms, usually involve expensive computational costs. We instead present a high-performance Single-stage Multi-human Parsing (SMP) deep architecture that decouples the multi-human parsing problem into two fine-grained sub-problems, i.e., locating the human body and parts. SMP leverages the point features in the barycenter positions to obtain their segmentation and then generates a series of offsets from the barycenter of the human body to the barycenters of parts, thus performing human body and parts matching without the grouping process. Within the SMP architecture, we propose a Refined Feature Retain module to extract the global feature of instances through generated mask attention and a Mask of Interest Reclassify module as a trainable plug-in module to refine the classification results with the predicted segmentation. Extensive experiments on the MHPv2.0 dataset demonstrate the best effectiveness and efficiency of the proposed method, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% in APvolp, and 1.2% in PCP50. In particular, the proposed method requires fewer training epochs and a less complex model architecture. We will release our source codes, pretrained models, and online demos to facilitate further studies.

Abstract (translated)

这项工作研究了多人解析问题。现有的方法，要么遵循上溯或下溯两阶段的范式，通常涉及昂贵的计算成本。我们则提出了一种高性能的单一阶段多人解析(SMP)深架构，将多人解析问题分解成两个精细的子问题，即确定身体和部件的位置。SMP利用点特征在坐标中心位置的表示，获取它们的分割，然后生成一系列从身体坐标中心到部件坐标中心的偏移量，从而实现身体和部件不匹配 without the grouping process。在SMP架构中，我们提出了一个Refined Feature Retain模块，通过生成掩膜注意力和感兴趣的掩膜类别模块来提取实例的全局特征，并将训练可编程插件模块作为训练可训练的模块，以优化预测的分割分类结果。在MHPv2.0数据集上进行广泛的实验表明，该方法的最佳效率和性能，在AP50p、APvolp和PCP50中分别高出现有方法2.1%、1.0%和1.2%。特别是，该方法需要更少的训练 epochs和更简单的模型架构。我们将发布我们的源代码、预训练模型和在线演示，以方便进一步研究。

URL

https://arxiv.org/abs/2304.11356

PDF

https://arxiv.org/pdf/2304.11356.pdf
Read All
Semantic Human Parsing via Scalable Semantic Transfer over Multiple Label Domains

2023-04-09 02:44:29

Jie Yang, Chaoqun Wang, Zhen Li, Junle Wang, Ruimao Zhang

arXiv_CV

arXiv_CV Segmentation Represenation_Learning Relation Inference Knowledge Prediction Pose Human_Parsing
Abstract

This paper presents Scalable Semantic Transfer (SST), a novel training paradigm, to explore how to leverage the mutual benefits of the data from different label domains (i.e. various levels of label granularity) to train a powerful human parsing network. In practice, two common application scenarios are addressed, termed universal parsing and dedicated parsing, where the former aims to learn homogeneous human representations from multiple label domains and switch predictions by only using different segmentation heads, and the latter aims to learn a specific domain prediction while distilling the semantic knowledge from other domains. The proposed SST has the following appealing benefits: (1) it can capably serve as an effective training scheme to embed semantic associations of human body parts from multiple label domains into the human representation learning process; (2) it is an extensible semantic transfer framework without predetermining the overall relations of multiple label domains, which allows continuously adding human parsing datasets to promote the training. (3) the relevant modules are only used for auxiliary training and can be removed during inference, eliminating the extra reasoning cost. Experimental results demonstrate SST can effectively achieve promising universal human parsing performance as well as impressive improvements compared to its counterparts on three human parsing benchmarks (i.e., PASCAL-Person-Part, ATR, and CIHP). Code is available at this https URL.

Abstract (translated)

本论文介绍了 Scalable Semantic Transfer (SST)，一种新颖的训练范式，旨在探索如何利用来自不同标签域的数据(即不同标签粒度)来训练强大的人类分词网络。在实践中，我们处理了两种常见的应用场景，称为通用分词和专门分词，其中前者旨在从多个标签域中学习统一的人名表示，并仅使用不同的分割头进行预测，而后者旨在学习特定域预测，同时从其他域中萃取语义知识。我们所提出的SST具有以下几个吸引人的优势：(1)它可以轻松地充当有效的训练计划，将来自不同标签域的人名身体部位语义关联嵌入到人名表示学习过程中；(2)它是一个可扩展的语义转移框架，在没有预先决定多个标签域的整体关系的情况下，可以不断添加人类分词数据来促进训练；(3)相关的模块仅用于辅助训练，可以在推理期间删除，消除额外的推理成本。实验结果表明，SST可以有效地实现令人期望的通用人类分词性能，并比其竞品在三个人类分词基准数据上取得了令人印象深刻的进步。代码可在本URL中获取。

URL

https://arxiv.org/abs/2304.04140

PDF

https://arxiv.org/pdf/2304.04140.pdf
Read All
Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks

2023-03-30 17:59:39

Weihua Chen, Xianzhe Xu, Jian Jia, Hao luo, Yaohua Wang, Fan Wang, Rong Jin, Xiuyu Sun

arXiv_CV

arXiv_CV Person_Re-identification Re-identification Attention Knowledge Pose Self-Supervised Human_Parsing
Abstract

Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in this https URL.

Abstract (translated)

人为中心的视觉任务因其广泛的应用而吸引了越来越多的研究关注。在本文中，我们旨在从大量未标记的人类图像中学习一种通用的人类表示，以最大限度地造福后续的人为中心的任务。我们称之为SOLIDER，这是一个具有语义控制意义的深度学习框架。与现有的自监督学习方法不同，从人类图像的先前知识中利用来构建伪语义标签，并将更多的语义信息导入学到的表示中。同时，我们注意到，不同的后续任务通常需要不同的语义信息和外观信息比例。例如，人类解析需要更多的语义信息，而人物身份确认需要更多的外观信息以满足身份识别目的。因此，一个单一的学到的表示无法适应所有要求。为了解决这一问题，SOLIDER引入了一个条件网络和一个语义控制器。训练完成后，用户可以向控制器发送值，以产生具有不同语义信息比例的表示，以适应后续任务的不同需求。最后，SOLIDER验证通过了六个后续的人为中心的视觉任务。它在这些任务中表现出色，并为这些任务建立了新的基准。代码在此httpsURL中发布。

URL

https://arxiv.org/abs/2303.17602

PDF

https://arxiv.org/pdf/2303.17602.pdf
Read All

Content

Human_Parsing (20)

Human_Parsing

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL