Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person by proposing CatVTON, a simple and efficient virtual try-on diffusion model. CatVTON facilitates the seamless transfer of in-shop or worn garments of any category to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of our model is demonstrated in three aspects: (1) Lightweight network: Only the original diffusion modules are used, without additional network modules. The text encoder and cross-attentions for text injection in the backbone are removed, reducing the parameters by 167.02M. (2) Parameter-efficient training: We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters, approximately 5.51 percent of the backbone network's parameters. (3) Simplified inference: CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only a garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.
https://arxiv.org/abs/2407.15886
Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods.
衣物更换人识别(CC-ReID)旨在从换衣场景中检索特定行人。它的主要挑战是区分相关和无关的特征。现有方法通过改变衣服的颜色来强制模型学习无关的特征。然而,由于缺乏真实数据,这些方法不可避免地引入噪声,破坏了可区分特征,导致不可控的解纠缠过程。在本文中,我们提出了一个名为特征重构解纠缠ReID(FRD-ReID)的新人物识别网络,可以控制地解耦相关和无关特征。具体来说,我们首先引入人类解析掩码作为重构过程的地面真值。同时,我们提出了用于服装无关特征和行人轮廓特征的远方注意(FAA)机制和人物轮廓注意(PCA)机制,以提高特征重构效率。在测试阶段,我们直接丢弃与推理相关的衣服特征,从而导致可控制的解纠缠过程。我们对PRCC、LTCC和Vc-Clothes数据集进行了广泛的实验,并证明了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2407.10694
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228
The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k textured scans. Creating a real-world clothing dataset is challenging, particularly in annotating and segmenting the extensive and complex 4D human scans. To address this, we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes, we establish several benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources, paving the way for advancements in research of lifelike human clothing. Website: this https URL.
为了填补这一空白,我们引入了4D-DRESS,这是第一个通过其高质量的4D纹理扫描和服装网格 Advance 人类服装研究 的真实世界4D数据集。4D-DRESS 捕捉了520个人动序列中的64个着装,共计78k个纹理扫描。创建真实世界的服装数据集具有挑战性,特别是在对广泛而复杂的4D人类扫描进行注释和分割方面。为了应对这个挑战,我们开发了一个半自动化的4D人体解析管道。我们有效地将人机交互过程与自动化相结合,准确地在各种服装和身体运动中标注4D扫描。利用精确的注释和高质量的服装网格,我们为服装模拟和重建建立了多个基准。4D-DRESS 提供了真实和具有挑战性的数据,补充了合成数据,为逼真的人类服装研究铺平了道路。网站:这是这个链接。
https://arxiv.org/abs/2404.18630
The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods.
遮挡人物识别(ReID)的目标是检索遮挡情况下的特定行人。然而,遮挡人物ReID仍然受到背景杂乱和低质量局部特征表示的限制,这限制了模型的性能。在我们的研究中,我们引入了一个新的框架PAB-ReID,这是一种新型的ReID模型,采用了部分注意机制来有效解决上述问题。首先,我们引入了人类解析标签来指导生成更准确的人的部分注意力图。此外,我们提出了一种细粒度特征关注器,用于在抑制背景干扰的同时生成细粒度的人局部特征表示。此外,我们还设计了一个部分三元组损失来指导人局部特征的学习,该损失优化了类内/类间距离。我们在专门的遮挡和普通ReID数据集上进行了广泛的实验,展示了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2404.03443
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks, a first of its kind in the field. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection, addressing the significant challenges posed by overfitting and limited training data in these domains. Our work categorizes data augmentation methods into two main types: data generation and data perturbation. Data generation covers techniques like graphic engine-based generation, generative model-based generation, and data recombination, while data perturbation is divided into image-level and human-level perturbations. Each method is tailored to the unique requirements of human-centric tasks, with some applicable across multiple areas. Our contributions include an extensive literature review, providing deep insights into the influence of these augmentation techniques in human-centric vision and highlighting the nuances of each method. We also discuss open issues and future directions, such as the integration of advanced generative models like Latent Diffusion Models, for creating more realistic and diverse training data. This survey not only encapsulates the current state of data augmentation in human-centric vision but also charts a course for future research, aiming to develop more robust, accurate, and efficient human-centric vision systems.
这项调查对人类中心化视觉任务中的数据增强技术进行了全面分析,是该领域独一无二的。它深入研究了包括人物识别、人解析、人姿势估计和行人检测在内的广泛研究领域,解决了过拟合和有限训练数据在这些领域带来的显著挑战。我们的工作将数据增强方法分为两种主要类型:数据生成和数据扰动。数据生成包括基于图形引擎生成、基于生成模型生成和数据重组等技术,而数据扰动则分为图像级别和人类级别扰动。每种方法都是针对人类中心化任务独特的需求进行定制的,有些方法可以应用于多个领域。我们的贡献包括广泛的文献 review,为这些增强技术在人类中心化视觉和每个方法的影响力提供了深刻的洞察。我们还讨论了未解决的问题和未来的研究方向,例如采用先进的生成模型如潜在扩散模型,以创建更真实和多样化的训练数据。这项调查不仅概括了当前数据增强在人类中心化视觉中的状态,而且为未来的研究奠定了基础,旨在开发更健壮、准确和高效的以人为本视觉系统。
https://arxiv.org/abs/2403.08650
Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizontal division, which suffer from misalignment due to various human poses. Additionally, the misalignment of semantic information in part features restricts the use of metric learning, thus affecting the effectiveness of part-based methods. The two issues mentioned above result in the under-utilization of part features in part-based methods. We introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method to address these challenges. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the foreground omissions and spatial confusions issues in the previous method. Then, we propose foreground and space corrections to enhance the completeness and reasonableness of the human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, which enables better utilization of both global and part features. Extensive experiments on Market-1501 and MSMT17 validate the proposed method's effectiveness over many state-of-the-art methods.
最近,无监督的人重新识别(Re-ID)方法通过利用细粒度局部上下文取得了高性能。这些方法被称为基于部分的(part-based)方法。然而,大多数基于部分的方法通过水平分割获得局部上下文,这会导致因为各种人体姿势而产生的错位。此外,部分特征中的语义信息错位限制了使用指标学习,从而影响了基于部分的方法的有效性。上述两个问题导致基于部分的方法中部分特征的利用率较低。我们引入了空间级联聚类和加权记忆(SCWM)方法来解决这些问题。SCWM旨在解析和校准不同人体部位更准确的局部上下文,同时允许记忆模块平衡难样本挖掘和噪声抑制。具体来说,我们首先分析了前方法中的前景缺失和空间混淆问题。然后,我们提出了前景和空间修正来提高人类解析结果的完整性和合理性。接下来,我们引入了加权记忆,并利用了两种加权策略。这些策略解决了全局特征的难样本挖掘问题,并提高了部分特征的噪声抵抗能力,从而更好地利用全局和部分特征。在Market-1501和MSMT17等大量实验中,我们验证了所提出方法的有效性超过了许多最先进的method。
https://arxiv.org/abs/2403.00261
The paper introduces the Decouple Re-identificatiOn and human Parsing (DROP) method for occluded person re-identification (ReID). Unlike mainstream approaches using global features for simultaneous multi-task learning of ReID and human parsing, or relying on semantic information for attention guidance, DROP argues that the inferior performance of the former is due to distinct granularity requirements for ReID and human parsing features. ReID focuses on instance part-level differences between pedestrian parts, while human parsing centers on semantic spatial context, reflecting the internal structure of the human body. To address this, DROP decouples features for ReID and human parsing, proposing detail-preserving upsampling to combine varying resolution feature maps. Parsing-specific features for human parsing are decoupled, and human position information is exclusively added to the human parsing branch. In the ReID branch, a part-aware compactness loss is introduced to enhance instance-level part differences. Experimental results highlight the efficacy of DROP, especially achieving a Rank-1 accuracy of 76.8% on Occluded-Duke, surpassing two mainstream methods. The codebase is accessible at this https URL.
本文介绍了一种名为Decouple Re-identification and human Parsing (DROP)的方法,用于解决遮挡的人重新识别(ReID)问题。与使用全局特征进行同时多任务学习ReID和人类解析的传统方法不同,或者依赖语义信息进行关注指导的方法,DROP认为,前者的性能差是由于ReID和人类解析特征的细粒度要求不同导致的。ReID关注于行人部分之间的实例级别差异,而人类解析关注于语义空间上下文,反映了人体内部的结构。为了应对这个问题,DROP解耦了ReID和人类解析的特征,提出了一种保留详细信息的上采样方法,结合了不同分辨率特征图。人类解析特定的特征被解耦,而人类位置信息被专门添加到人类解析分支中。在ReID分支中,引入了一种部分感知紧凑性损失,以增强实例级别的部分差异。实验结果强调了DROP的有效性,尤其是实现了在Occluded-Duke上的排名1准确度为76.8%,超过了两个主要方法。代码库可在此处访问:https://github.com/google-research/DROP。
https://arxiv.org/abs/2401.18032
Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: this https URL.
多模态动作识别方法通过姿态和RGB模态取得了高度的成功。然而,骨架序列缺乏外观描述,RGB图像因模态限制而受到无关噪声的影响。为了解决这个问题,我们引入了人类解析特征图作为一种新的模式,因为它可以选择性地保留身体部位的有效语义特征,同时过滤出大多数无关噪声。我们提出了一个名为Ensemble Human Parsing and Pose Network (EPP-Net)的新双分支框架,它是第一个利用骨架和人类解析模态进行动作识别的。第一个人体姿态分支将稳健的骨架输入到图卷积网络中以建模姿态特征,而第二个人体解析分支则利用表示性卷积后端利用表示性卷积特征图建模通过表示性后端进行解析特征。通过晚融合策略将两个高级特征有效地结合以提高动作识别效果。在NTU RGB+D和NTU RGB+D 120基准测试中,我们进行了广泛的实验,验证了我们提出的EPP-Net的有效性,该方法超越了现有的动作识别方法。我们的代码可以从以下链接获得:https://this URL。
https://arxiv.org/abs/2401.02138
Occluded person re-identification (ReID) is a very challenging task due to the occlusion disturbance and incomplete target information. Leveraging external cues such as human pose or parsing to locate and align part features has been proven to be very effective in occluded person ReID. Meanwhile, recent Transformer structures have a strong ability of long-range modeling. Considering the above facts, we propose a Teacher-Student Decoder (TSD) framework for occluded person ReID, which utilizes the Transformer decoder with the help of human parsing. More specifically, our proposed TSD consists of a Parsing-aware Teacher Decoder (PTD) and a Standard Student Decoder (SSD). PTD employs human parsing cues to restrict Transformer's attention and imparts this information to SSD through feature distillation. Thereby, SSD can learn from PTD to aggregate information of body parts automatically. Moreover, a mask generator is designed to provide discriminative regions for better ReID. In addition, existing occluded person ReID benchmarks utilize occluded samples as queries, which will amplify the role of alleviating occlusion interference and underestimate the impact of the feature absence issue. Contrastively, we propose a new benchmark with non-occluded queries, serving as a complement to the existing benchmark. Extensive experiments demonstrate that our proposed method is superior and the new benchmark is essential. The source codes are available at this https URL.
遮挡的人重新识别(ReID)是一个非常具有挑战性的任务,由于遮挡干扰和缺乏目标信息,利用外部线索如人体姿态或解析来定位和对齐部分特征在遮挡的人重新识别中已经被证明非常有效。同时,最近使用的Transformer结构具有很强的长距离建模能力。考虑到上述事实,我们提出了一个教师-学生解码器(TSD)框架来进行遮挡的人重新识别,该框架利用了人类解析来辅助Transformer解码器。具体来说,我们提出的TSD由一个解析意识到的教师解码器(PTD)和一个标准学生解码器(SSD)组成。PTD利用人类解析线索来限制Transformer的注意力和传递信息给SSD通过特征蒸馏。因此,SSD可以从PTD中学到自动聚合身体部位的信息。此外,还设计了一个掩码生成器,用于提供更好的ReID。此外,现有的遮挡人重新识别基准采用遮挡样本作为查询,这会放大缓解遮挡干扰的作用,低估特征缺失问题的影响。相反,我们提出了一个新基准,作为现有基准的补充。大量实验证明,与我们的方法相比,我们的方法优越,新基准至关重要。源代码可以从该链接https://www.example.com/中获取。
https://arxiv.org/abs/2312.09797
We propose 360° Volumetric Portrait (3VP) Avatar, a novel method for reconstructing 360° photo-realistic portrait avatars of human subjects solely based on monocular video inputs. State-of-the-art monocular avatar reconstruction methods rely on stable facial performance capturing. However, the common usage of 3DMM-based facial tracking has its limits; side-views can hardly be captured and it fails, especially, for back-views, as required inputs like facial landmarks or human parsing masks are missing. This results in incomplete avatar reconstructions that only cover the frontal hemisphere. In contrast to this, we propose a template-based tracking of the torso, head and facial expressions which allows us to cover the appearance of a human subject from all sides. Thus, given a sequence of a subject that is rotating in front of a single camera, we train a neural volumetric representation based on neural radiance fields. A key challenge to construct this representation is the modeling of appearance changes, especially, in the mouth region (i.e., lips and teeth). We, therefore, propose a deformation-field-based blend basis which allows us to interpolate between different appearance states. We evaluate our approach on captured real-world data and compare against state-of-the-art monocular reconstruction methods. In contrast to those, our method is the first monocular technique that reconstructs an entire 360° avatar.
我们提出了360度立体肖像(3VP)Avatar,这是一种仅基于单目视频输入来重建人类 subject 的 360 度照片现实主义肖像的方法。最先进的单目 Avatar 重建方法依赖于稳定的面部表演捕捉。然而,基于 3DMM 的面部跟踪的常见用法有局限性;侧面视角很难被捕捉到,尤其是在背面视角时,因为缺少面部特征点或人类解析掩码等所需输入。这导致不完整的 Avatar 重建,仅覆盖到前额叶。 相比之下,我们提出了一个基于模板的追踪方案,追踪全身、头部和面部表情,使我们能够从所有侧面覆盖人类 subject 的外观。因此,对于一个在单个相机前旋转的主体的序列,我们基于神经辐射场进行神经体积表示。构建这种表示的一个关键挑战是建模嘴部区域(即嘴唇和牙齿)的外观变化。因此,我们提出了一个变形场为基础的混合基础,使我们能够在不同外观状态之间平滑插值。我们对我们的方法在捕获的现实世界数据上进行评估,并将其与最先进的单目重建方法进行比较。与那些方法相比,我们的方法是第一个仅基于单目的 360 度Avatar 重建方法。
https://arxiv.org/abs/2312.05311
In this paper, we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types, our method allows flexible specification of style (text or image) and texture (full garment, cropped sections, or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions, we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage, we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage, we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works, we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.
在本文中,我们提出了一个新颖的虚拟试穿方法,从无约束设计(ucVTON)任务中实现对输入人类图像的个性定制服装的渲染。与先前的艺术作品受到特定输入类型的限制不同,我们的方法允许灵活指定风格(文本或图像)和纹理(完整的衣物,裁剪部分或纹理补丁)条件。为了在用完整衣物图像作为条件时解决纠缠挑战,我们开发了一个两阶段流程,其中风格和纹理的显式解离。在第一阶段,我们根据输入生成一个人类解析图,反映所需的风格条件。在第二阶段,我们根据纹理输入在解析图区域上合成纹理。为了代表之前时尚编辑工作中没有实现过的复杂和非平稳纹理,我们首先提出了提取层次结构和平衡的CLIP特征,并在VTON中应用位置编码。实验证明,我们的方法具有卓越的合成质量和个性化的功能,为在线购物和时尚设计带来了全新的用户体验。通过灵活控制风格和纹理混合,使虚拟试穿达到了更高的用户体验水平。
https://arxiv.org/abs/2312.04534
Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.
多人类标注是一个需要同时考虑实例水平和细粒度类水平信息的图像分割任务。然而,先前的研究通常通过单独的分支和不同的输出格式处理这两种信息,导致效率低下和冗余的框架。本文介绍了一种名为UniParser的方法,将实例水平和类水平表示在三个关键方面进行整合:1)我们提出了一个统一的相关性表示学习方法,使得我们的网络能够在余弦空间中学习实例和类特征;2)我们通过统一的标签和辅助损失在实例和类特征上进行监督,实现了每个模块的输出形式为像素级分割结果;3)我们设计了一个联合优化程序来融合实例和类表示。通过统一实例水平和类水平输出,UniParser绕过了手动设计的后处理技术,超越了最先进的methods,实现了MHPv2.0上的49.3%AP和CIHP上的60.4%AP。我们将发布我们的源代码、预训练模型和在线演示,以促进未来的研究。
https://arxiv.org/abs/2310.08984
Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.
人类解析旨在以精细的语义分类对人类图像的每一像素进行分割。然而,当前使用干净数据训练的人类解析器容易受到诸如模糊和噪声等常见的图像损坏。为了改善人类解析器的鲁棒性,在本文中,我们建立了三个 corruption 鲁棒性基准,称为 LIP-C、ATR-C 和 Pascal-Person-Part-C,以协助我们评估人类解析模型的风险容忍度。受到数据增强策略启发,我们提出了一种异质增强机制,以在常见的损坏条件下增强鲁棒性。具体来说,从不同视角提供的数据增强有两种类型,即图像意识增强和模型意识的图像到图像变换,通过Sequentially integrated approach 适应意想不到的图像损坏。图像意识增强可以通过常见的图像操作丰富训练图像的高度多样性。模型意识增强策略通过考虑模型的随机性来提高输入数据的多样性。我们提出的方法是非模型特定的,它可以与任意先进的人类解析框架插件和玩耍。实验结果显示,我们提出的方法表现出良好的通用性,可以提高在面临各种常见图像损坏时人类解析模型和语义分割模型的鲁棒性。同时,在干净数据上仍然可以实现近似性能。
https://arxiv.org/abs/2309.00938
Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait. The code and dataset are available at this https URL .
二进制轮廓和关键点基于 skeleton 的骨骼结构已经主导了数十年的人步态识别研究,因为它们可以从视频帧中轻松提取。尽管在实验室环境下的人步态识别取得了成功,但在现实世界中通常失败,因为它们在步态表示方面的信息熵较低。为了实现野生状态下准确的步态识别,本文提出了一种新的步态表示方法,称为步态解析序列(GPS), GPS 是由精细的人类分割序列(即人类解析)提取的视频帧序列,因此它们具有更高的信息熵,以编码步行时精细人类部件的形状和动态。此外,为了更好地探索 GPS 表示的能力,我们提出了一种基于人类解析的步态识别框架,称为 ParsingGait。ParsingGait 包含一个卷积神经网络(CNN)基线和一个轻量级头,第一个头从 GPS 中提取全局语义特征,而另一个头通过学习部分级别的特征相互信息,通过图卷积网络模型模拟人类步行的详细动态。此外,由于缺少适当的数据集,我们建立了第一个基于解析的步态识别数据集,称为步态3D-解析,通过扩展大型且具有挑战性的步态3D数据集。基于步态3D-解析,我们全面地评估了我们的方法和现有的步态识别方法。实验结果显示,GPS 表示带来了显著的精度提高,以及 ParsingGait 的优越性。代码和数据集可在 this https URL 上获取。
https://arxiv.org/abs/2308.16739
The fashion e-commerce industry has witnessed significant growth in recent years, prompting exploring image-based virtual try-on techniques to incorporate Augmented Reality (AR) experiences into online shopping platforms. However, existing research has primarily overlooked a crucial aspect - the runtime of the underlying machine-learning model. While existing methods prioritize enhancing output quality, they often disregard the execution time, which restricts their applications on a limited range of devices. To address this gap, we propose Distilled Mobile Real-time Virtual Try-On (DM-VTON), a novel virtual try-on framework designed to achieve simplicity and efficiency. Our approach is based on a knowledge distillation scheme that leverages a strong Teacher network as supervision to guide a Student network without relying on human parsing. Notably, we introduce an efficient Mobile Generative Module within the Student network, significantly reducing the runtime while ensuring high-quality output. Additionally, we propose Virtual Try-on-guided Pose for Data Synthesis to address the limited pose variation observed in training images. Experimental results show that the proposed method can achieve 40 frames per second on a single Nvidia Tesla T4 GPU and only take up 37 MB of memory while producing almost the same output quality as other state-of-the-art methods. DM-VTON stands poised to facilitate the advancement of real-time AR applications, in addition to the generation of lifelike attired human figures tailored for diverse specialized training tasks. this https URL
过去几年,时尚电子商务行业经历了显著增长,这促使我们探索基于图像的虚拟试穿技术,将其引入在线购物平台。然而,现有研究主要忽略了一个关键方面——底层机器学习模型的运行时间。虽然现有方法主要关注提高输出质量,但它们常常忽视了执行时间,这限制了它们在有限设备范围内的应用。为了解决这一差距,我们提出了蒸馏移动实时虚拟试穿(DM-VTON),这是一种创新的虚拟试穿框架,旨在实现简单和高效。我们的方法是基于知识蒸馏计划,利用强大的教师网络作为监督,指导学生网络,而无需依赖人类解析。值得注意的是,我们引入了在学生网络内部的高效移动生成模块, significantly reduce the runtime while ensuring high-quality output。此外,我们提出了虚拟试穿指导姿态的数据合成方法,以解决训练图像中观察到的有限姿态变化。实验结果显示,该方法可以在单个NvidiaTesla T4GPU上实现每秒40帧,仅占用37 MB内存,同时与其他任何先进的方法输出质量几乎相同。DM-VTON已成为推动实时增强现实应用进步的障碍,此外,它还生成定制为各种专业训练任务的生命like服装人物。 this https URL 是 DM-VTON 的一个示例链接。
https://arxiv.org/abs/2308.13798
Existing methods of multiple human parsing (MHP) apply statistical models to acquire underlying associations between images and labeled body parts. However, acquired associations often contain many spurious correlations that degrade model generalization, leading statistical models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causality inspired parsing paradigm termed CIParsing, which follows fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an input image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones cause the generation process of human parsing.Since causal/non-causal factors are unobservable, a human parser in proposed CIParsing is required to construct latent representations of causal factors and learns to enforce representations to satisfy the causal properties. In this way, the human parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CIParsing is designed in a plug-and-play fashion and can be integrated into any existing MHP models. Extensive experiments conducted on two widely used benchmarks demonstrate the effectiveness and generalizability of our method.
现有的多人人类解析方法(MHP)应用统计模型获取图像和标注身体部位之间的 underlying Association(例如,未观测的图像风格/外部干预)。然而,获得的关系常常包含许多伪相关,降低模型的泛化能力,导致统计模型对图像的视觉上下文变化易感(例如,未观测的图像风格/外部干预)。为了解决这一问题,我们提出了一种基于因果关系的解析范式,称为 CIParsing,遵循人类解析涉及两个因果关系的属性的基本因果关系原则(即因果关系的多样性和因果关系的不变性)。具体来说,我们假设输入图像是由因果关系因素(身体部位的特点)和非因果关系因素(外部环境)的混合构建的,只有前一种因素导致人类解析的生成过程。由于因果关系/非因果关系是不可观测的,提议的 CIParsing 人类解析器需要构建因果关系因素的隐式表示,并学习强制表示来满足因果关系属性。这样,人类解析器可以依赖于因果关系因素而对相关证据而不是伪相关因素进行依赖,从而减轻模型退化并提高解析能力。值得注意的是, CIParsing 采用插孔并 Play 的方式设计,可以集成到任何现有的 MHP 模型中。我们对两个广泛使用基准进行了广泛的实验,证明了我们方法的有效性和通用性。
https://arxiv.org/abs/2308.12218
The neural rendering of humans is a topic of great research significance. However, previous works mostly focus on achieving photorealistic details, neglecting the exploration of human parsing. Additionally, classical semantic work are all limited in their ability to efficiently represent fine results in complex motions. Human parsing is inherently related to radiance reconstruction, as similar appearance and geometry often correspond to similar semantic part. Furthermore, previous works often design a motion field that maps from the observation space to the canonical space, while it tends to exhibit either underfitting or overfitting, resulting in limited generalization. In this paper, we present Semantic-Human, a novel method that achieves both photorealistic details and viewpoint-consistent human parsing for the neural rendering of humans. Specifically, we extend neural radiance fields (NeRF) to jointly encode semantics, appearance and geometry to achieve accurate 2D semantic labels using noisy pseudo-label supervision. Leveraging the inherent consistency and smoothness properties of NeRF, Semantic-Human achieves consistent human parsing in both continuous and novel views. We also introduce constraints derived from the SMPL surface for the motion field and regularization for the recovered volumetric geometry. We have evaluated the model using the ZJU-MoCap dataset, and the obtained highly competitive results demonstrate the effectiveness of our proposed Semantic-Human. We also showcase various compelling applications, including label denoising, label synthesis and image editing, and empirically validate its advantageous properties.
人脑可视化是一个具有重要研究意义的课题。然而,以前的研究大多关注如何实现逼真的细节,而忽视了人类解析的研究。此外,传统的语义工作都面临着在复杂运动情况下高效表示 Fine 结果的能力有限的问题。人类解析本质上与光线重建相关,因为相似的外观和几何形状往往对应着相似的语义部分。此外,以前的研究往往设计一个从观察空间到标准空间的运动域,但往往表现出过拟合或欠拟合的现象,导致有限的泛化能力。在本文中,我们介绍了 Semantic-Human,一种实现人类脑可视化新方法,该方法同时实现了逼真的细节和视角一致的人类解析。具体来说,我们扩展了神经网络 Radiance fields (NeRF) 以联合编码语义、外观和几何,使用噪声伪标签监督实现准确的 2D 语义标签,利用 NeRF 固有的一致性和平滑性质,在连续和新的视角下实现了一致的人类解析。我们还介绍了从 SMPL 表面推导出的运动域约束和恢复体积几何的 Regularization 约束。我们使用 ZJU-MoCap 数据集评估了模型,得到的高性能结果证明了我们提出的 Semantic-Human 的高效性。我们还展示了各种令人 compelling 的应用,包括标签去噪、标签合成和图像编辑,并经验证了其有利的特性。
https://arxiv.org/abs/2308.09894
Recently, visual-language learning has shown great potential in enhancing visual-based person re-identification (ReID). Existing visual-language learning-based ReID methods often focus on whole-body scale image-text feature alignment, while neglecting supervisions on fine-grained part features. This choice simplifies the learning process but cannot guarantee within-part feature semantic consistency thus hindering the final performance. Therefore, we propose to enhance fine-grained visual features with part-informed language supervision for ReID tasks. The proposed method, named Part-Informed Visual-language Learning ($\pi$-VL), suggests that (i) a human parsing-guided prompt tuning strategy and (ii) a hierarchical fusion-based visual-language alignment paradigm play essential roles in ensuring within-part feature semantic consistency. Specifically, we combine both identity labels and parsing maps to constitute pixel-level text prompts and fuse multi-stage visual features with a light-weight auxiliary head to perform fine-grained image-text alignment. As a plug-and-play and inference-free solution, our $\pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks, especially reporting 90.3% Rank-1 and 76.5% mAP for the most challenging MSMT17 database without bells and whistles.
最近,视觉语言学习在增强基于视觉的人脸识别(ReID)任务中表现出了巨大的潜力。现有的基于视觉语言学习的ReID方法往往关注整体尺寸的图像文本特征对齐,而忽视了精细部件特征的监督。这种简化了学习过程,但不能保证整体特征语义一致性,从而阻碍了最终表现。因此,我们提出了一种利用部分 informed 语言监督增强精细视觉特征的ReID任务的方法。该方法名为部分 informed 视觉语言学习($\pi$-VL),暗示了(i)人类解析引导的prompt调整策略和(ii)基于层级融合的视觉语言对齐范式在确保整体特征语义一致性方面发挥着关键作用。具体来说,我们结合身份标签和解析地图,构成了像素级别的文本提示,并使用轻量级辅助头将多级视觉特征与轻型头部融合,以进行精细的图像文本对齐。作为一种可插拔且无需推理的解决方案,我们的$\pi$-VL在常用的四种ReID基准问题上实现了前人水平的重大改进,特别是报告了在没有音效和功能包的MSMT17数据库中高达90.3%的Rank-1和76.5%的mAP。
https://arxiv.org/abs/2308.02738
Human skeletons and RGB sequences are both widely-adopted input modalities for human action recognition. However, skeletons lack appearance features and color data suffer large amount of irrelevant depiction. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain spatiotemporal features of the body parts, while filtering out noises regarding outfits, backgrounds, etc. We propose an Integrating Human Parsing and Pose Network (IPP-Net) for action recognition, which is the first to leverage both skeletons and human parsing feature maps in dual-branch approach. The human pose branch feeds compact skeletal representations of different modalities in graph convolutional network to model pose features. In human parsing branch, multi-frame body-part parsing features are extracted with human detector and parser, which is later learnt using a convolutional backbone. A late ensemble of two branches is adopted to get final predictions, considering both robust keypoints and rich semantic body-part features. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods. Our code is publicly available at this https URL .
人类骨骼和RGB序列都是人类动作识别广泛应用的输入模式。然而,骨骼缺乏外观特征,颜色数据也遭受大量无关描绘。为了解决这一问题,我们提出了人类解析特征图作为一种新模式,因为它可以选择性保留身体部位的时间空间特征,同时过滤掉与服装、背景等有关的噪声。我们提出了一种集成人类解析和姿态网络(IPP-Net)的行为识别算法,它是第一个在双重分支方法中利用骨骼和人类解析特征图的方法。人类姿态分支通过在 graph卷积神经网络中提供不同模式骨骼的紧凑表示来建模姿态特征。在人类解析分支中,使用人类检测器和解析器提取多帧身体部位解析特征,然后使用卷积基线学习。最终 ensemble 两个分支以获得最终预测,考虑 both 稳健的关键点和丰富的语义身体部位特征。在NTU RGB+D和NTU RGB+D 120基准测试中,广泛的实验 consistently 证实了所提出的 IPP-Net 的有效性,它比现有的动作识别方法表现更好。我们的代码在这个 https URL 上公开可用。
https://arxiv.org/abs/2307.07977