This work studies the multi-human parsing problem. Existing methods, either following top-down or bottom-up two-stage paradigms, usually involve expensive computational costs. We instead present a high-performance Single-stage Multi-human Parsing (SMP) deep architecture that decouples the multi-human parsing problem into two fine-grained sub-problems, i.e., locating the human body and parts. SMP leverages the point features in the barycenter positions to obtain their segmentation and then generates a series of offsets from the barycenter of the human body to the barycenters of parts, thus performing human body and parts matching without the grouping process. Within the SMP architecture, we propose a Refined Feature Retain module to extract the global feature of instances through generated mask attention and a Mask of Interest Reclassify module as a trainable plug-in module to refine the classification results with the predicted segmentation. Extensive experiments on the MHPv2.0 dataset demonstrate the best effectiveness and efficiency of the proposed method, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% in APvolp, and 1.2% in PCP50. In particular, the proposed method requires fewer training epochs and a less complex model architecture. We will release our source codes, pretrained models, and online demos to facilitate further studies.
这项工作研究了多人解析问题。现有的方法,要么遵循上溯或下溯两阶段的范式,通常涉及昂贵的计算成本。我们则提出了一种高性能的单一阶段多人解析(SMP)深架构,将多人解析问题分解成两个精细的子问题,即确定身体和部件的位置。SMP利用点特征在坐标中心位置的表示,获取它们的分割,然后生成一系列从身体坐标中心到部件坐标中心的偏移量,从而实现身体和部件不匹配 without the grouping process。在SMP架构中,我们提出了一个Refined Feature Retain模块,通过生成掩膜注意力和感兴趣的掩膜类别模块来提取实例的全局特征,并将训练可编程插件模块作为训练可训练的模块,以优化预测的分割分类结果。在MHPv2.0数据集上进行广泛的实验表明,该方法的最佳效率和性能,在AP50p、APvolp和PCP50中分别高出现有方法2.1%、1.0%和1.2%。特别是,该方法需要更少的训练 epochs和更简单的模型架构。我们将发布我们的源代码、预训练模型和在线演示,以方便进一步研究。
https://arxiv.org/abs/2304.11356
This paper presents Scalable Semantic Transfer (SST), a novel training paradigm, to explore how to leverage the mutual benefits of the data from different label domains (i.e. various levels of label granularity) to train a powerful human parsing network. In practice, two common application scenarios are addressed, termed universal parsing and dedicated parsing, where the former aims to learn homogeneous human representations from multiple label domains and switch predictions by only using different segmentation heads, and the latter aims to learn a specific domain prediction while distilling the semantic knowledge from other domains. The proposed SST has the following appealing benefits: (1) it can capably serve as an effective training scheme to embed semantic associations of human body parts from multiple label domains into the human representation learning process; (2) it is an extensible semantic transfer framework without predetermining the overall relations of multiple label domains, which allows continuously adding human parsing datasets to promote the training. (3) the relevant modules are only used for auxiliary training and can be removed during inference, eliminating the extra reasoning cost. Experimental results demonstrate SST can effectively achieve promising universal human parsing performance as well as impressive improvements compared to its counterparts on three human parsing benchmarks (i.e., PASCAL-Person-Part, ATR, and CIHP). Code is available at this https URL.
本论文介绍了 Scalable Semantic Transfer (SST),一种新颖的训练范式,旨在探索如何利用来自不同标签域的数据(即不同标签粒度)来训练强大的人类分词网络。在实践中,我们处理了两种常见的应用场景,称为通用分词和专门分词,其中前者旨在从多个标签域中学习统一的人名表示,并仅使用不同的分割头进行预测,而后者旨在学习特定域预测,同时从其他域中萃取语义知识。我们所提出的SST具有以下几个吸引人的优势:(1)它可以轻松地充当有效的训练计划,将来自不同标签域的人名身体部位语义关联嵌入到人名表示学习过程中;(2)它是一个可扩展的语义转移框架,在没有预先决定多个标签域的整体关系的情况下,可以不断添加人类分词数据来促进训练;(3)相关的模块仅用于辅助训练,可以在推理期间删除,消除额外的推理成本。实验结果表明,SST可以有效地实现令人期望的通用人类分词性能,并比其竞品在三个人类分词基准数据上取得了令人印象深刻的进步。代码可在本URL中获取。
https://arxiv.org/abs/2304.04140
Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in this https URL.
人为中心的视觉任务因其广泛的应用而吸引了越来越多的研究关注。在本文中,我们旨在从大量未标记的人类图像中学习一种通用的人类表示,以最大限度地造福后续的人为中心的任务。我们称之为SOLIDER,这是一个具有语义控制意义的深度学习框架。与现有的自监督学习方法不同,从人类图像的先前知识中利用来构建伪语义标签,并将更多的语义信息导入学到的表示中。同时,我们注意到,不同的后续任务通常需要不同的语义信息和外观信息比例。例如,人类解析需要更多的语义信息,而人物身份确认需要更多的外观信息以满足身份识别目的。因此,一个单一的学到的表示无法适应所有要求。为了解决这一问题,SOLIDER引入了一个条件网络和一个语义控制器。训练完成后,用户可以向控制器发送值,以产生具有不同语义信息比例的表示,以适应后续任务的不同需求。最后,SOLIDER验证通过了六个后续的人为中心的视觉任务。它在这些任务中表现出色,并为这些任务建立了新的基准。代码在此httpsURL中发布。
https://arxiv.org/abs/2303.17602
Human-centric perceptions include a variety of vision tasks, which have widespread industrial applications, including surveillance, autonomous driving, and the metaverse. It is desirable to have a general pretrain model for versatile human-centric downstream tasks. This paper forges ahead along this path from the aspects of both benchmark and pretraining methods. Specifically, we propose a \textbf{HumanBench} based on existing datasets to comprehensively evaluate on the common ground the generalization abilities of different pretraining methods on 19 datasets from 6 diverse downstream tasks, including person ReID, pose estimation, human parsing, pedestrian attribute recognition, pedestrian detection, and crowd counting. To learn both coarse-grained and fine-grained knowledge in human bodies, we further propose a \textbf{P}rojector \textbf{A}ssis\textbf{T}ed \textbf{H}ierarchical pretraining method (\textbf{PATH}) to learn diverse knowledge at different granularity levels. Comprehensive evaluations on HumanBench show that our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets. The code will be publicly at \href{this https URL}{this https URL}.
人中心化感知包括多种视觉任务,具有广泛的工业应用,包括监控、自动驾驶和虚拟现实等。我们希望有一个通用的人类中心化后续任务预训练模型。本文从基准和预训练方法两个方面提出了一个 extbf{人bench},以综合评估不同预训练方法对19个不同后续任务数据集的泛化能力。这些任务包括人重识别、姿态估计、人类解析、行人属性识别、行人检测和人群计数。为了学习人体中的粗调和细粒度知识,我们进一步提出了一个 extbf{P}rojector extbf{A}ssis extbf{T}ed extbf{H}ierarchical extbf{Pre}training extbf{M}athon ( extbf{PATH}),以学习不同粒度级别的多种知识。对人bench的综合评估表明,我们的 PATH 在17个后续任务数据集上取得了新的先进技术结果,而在另外2个数据集上取得了与平均水平相当的结果。代码将公开在 href{this https URL}{this https URL}。
https://arxiv.org/abs/2303.05675
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.
人类中心化感知在视觉模型工业应用中扮演着关键角色。虽然具体的人类中心化任务有其特定的相关语义方面需要关注,但它们也共享人类身体的基本语义结构。然而,只有少数工作尝试过利用这种一致性设计一个通用的人类中心化任务模型。在本工作中,我们回顾了广泛的人类中心化任务,并以 minimalist 的方式将它们统一起来。我们提出了 UniHCP,一个人类中心化感知的统一模型,它将广泛的人类中心化任务与简单的视觉变换架构相统一。通过在33个人类中心化数据集上进行大规模联合训练,UniHCP可以在几个内部和下游任务中通过直接评估比强基线表现更好。当适应特定的任务时,UniHCP在广泛的人类中心化任务中实现了新的 SOTA,例如,69.8 米IoU在 CIHP 中是人类Parsing,86.18 mA 在 PA-100K 中是属性预测,90.3 米AP在 Market1501 中是人类 ReID,和 85.8 JI 在 CrowdHuman 中用于 pedestrian detection,比每个任务专门的模型表现更好。
https://arxiv.org/abs/2303.02936
Human parsing is a key topic in image processing with many applications, such as surveillance analysis, human-robot interaction, person search, and clothing category classification, among many others. Recently, due to the success of deep learning in computer vision, there are a number of works aimed at developing human parsing algorithms using deep learning models. As methods have been proposed, a comprehensive survey of this topic is of great importance. In this survey, we provide an analysis of state-of-the-art human parsing methods, covering a broad spectrum of pioneering works for semantic human parsing. We introduce five insightful categories: (1) structure-driven architectures exploit the relationship of different human parts and the inherent hierarchical structure of a human body, (2) graph-based networks capture the global information to achieve an efficient and complete human body analysis, (3) context-aware networks explore useful contexts across all pixel to characterize a pixel of the corresponding class, (4) LSTM-based methods can combine short-distance and long-distance spatial dependencies to better exploit abundant local and global contexts, and (5) combined auxiliary information approaches use related tasks or supervision to improve network performance. We also discuss the advantages/disadvantages of the methods in each category and the relationships between methods in different categories, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.
人类分词是图像处理中一个重要的主题,有许多应用,例如监控分析、人机互动、人名搜索和服装分类等。最近,由于深度学习在计算机视觉中的成功,有许多工作旨在使用深度学习模型开发人类分词算法。由于提出了方法,这是一个非常重要全面的调查话题。在这个调查中,我们将提供最先进的人类分词方法的分析,涵盖了语义人类分词的早期工作广泛的范围。我们介绍了五个深刻的分类:(1)结构驱动的建筑学利用不同人类部件之间的关系和人体固有的层次结构;(2)基于图的神经网络捕捉全球信息,实现高效的完整的人体分析;(3)具有上下文意识的网络在所有像素探索有用的上下文,以确定相应的类别像素;(4)基于LSTM的方法可以结合短距离和长距离空间依赖,更好地利用丰富的本地和全球上下文;(5)综合辅助信息方法使用相关任务或监督来提高网络性能。我们还讨论了每种方法的优点和缺点,以及不同方法之间的区别,研究了最常用的数据集,报告性能,并讨论这一领域有前景的未来的研究方向。
https://arxiv.org/abs/2301.12416
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: this https URL.
https://arxiv.org/abs/2301.00394
We propose IntegratedPIFu, a new pixel aligned implicit model that builds on the foundation set by PIFuHD. IntegratedPIFu shows how depth and human parsing information can be predicted and capitalised upon in a pixel-aligned implicit model. In addition, IntegratedPIFu introduces depth oriented sampling, a novel training scheme that improve any pixel aligned implicit model ability to reconstruct important human features without noisy artefacts. Lastly, IntegratedPIFu presents a new architecture that, despite using less model parameters than PIFuHD, is able to improves the structural correctness of reconstructed meshes. Our results show that IntegratedPIFu significantly outperforms existing state of the arts methods on single view human reconstruction. Our code has been made available online.
https://arxiv.org/abs/2211.07955
Person reidentification (re-ID) is becoming one of the most significant application areas of computer vision due to its importance for science and social security. Due to enormous size and scale of camera systems it is beneficial to develop edge computing re-ID applications where at least part of the analysis could be performed by the cameras. However, conventional re-ID relies heavily on deep learning (DL) computationally demanding models which are not readily applicable for edge computing. In this paper we adapt a recently proposed re-ID method that combines DL human parsing with analytical feature extraction and ranking schemes to be more suitable for edge computing re-ID. First, we compare parsers that use ResNet101, ResNet18, MobileNetV2, and OSNet backbones and show that parsing can be performed using compact backbones with sufficient accuracy. Second, we transfer parsers to tensor processing unit (TPU) of Google Coral Dev Board and show that it can act as a portable edge computing re-ID station. We also implement the analytical part of re-ID method on Coral CPU to ensure that it can perform a complete re-ID cycle. For quantitative analysis we compare inference speed, parsing masks, and re-ID accuracy on GPU and Coral TPU depending on parser backbone. We also discuss possible application scenarios of edge computing in re-ID taking into account known limitations mainly related to memory and storage space of portable devices.
https://arxiv.org/abs/2209.11024
Existing methods of multiple human parsing usually adopt a two-stage strategy (typically top-down and bottom-up), which suffers from either strong dependence on prior detection or highly computational redundancy during post-grouping. In this work, we present an end-to-end multiple human parsing framework using representative parts, termed RepParser. Different from mainstream methods, RepParser solves the multiple human parsing in a new single-stage manner without resorting to person detection or this http URL this end, RepParser decouples the parsing pipeline into instance-aware kernel generation and part-aware human parsing, which are responsible for instance separation and instance-specific part segmentation, respectively. In particular, we empower the parsing pipeline by representative parts, since they are characterized by instance-aware keypoints and can be utilized to dynamically parse each person instance. Specifically, representative parts are obtained by jointly localizing centers of instances and estimating keypoints of body part regions. After that, we dynamically predict instance-aware convolution kernels through representative parts, thus encoding person-part context into each kernel responsible for casting an image feature as an instance-specific representation.Furthermore, a multi-branch structure is adopted to divide each instance-specific representation into several part-aware representations for separate part this http URL this way, RepParser accordingly focuses on person instances with the guidance of representative parts and directly outputs parsing results for each person instance, thus eliminating the requirement of the prior detection or post-grouping.Extensive experiments on two challenging benchmarks demonstrate that our proposed RepParser is a simple yet effective framework and achieves very competitive performance.
https://arxiv.org/abs/2208.12908
Cloth-changing person re-identification (CC-ReID), which aims to match person identities under clothing changes, is a new rising research topic in recent years. However, typical biometrics-based CC-ReID methods often require cumbersome pose or body part estimators to learn cloth-irrelevant features from human biometric traits, which comes with high computational costs. Besides, the performance is significantly limited due to the resolution degradation of surveillance images. To address the above limitations, we propose an effective Identity-Sensitive Knowledge Propagation framework (DeSKPro) for CC-ReID. Specifically, a Cloth-irrelevant Spatial Attention module is introduced to eliminate the distraction of clothing appearance by acquiring knowledge from the human parsing module. To mitigate the resolution degradation issue and mine identity-sensitive cues from human faces, we propose to restore the missing facial details using prior facial knowledge, which is then propagated to a smaller network. After training, the extra computations for human parsing or face restoration are no longer required. Extensive experiments show that our framework outperforms state-of-the-art methods by a large margin. Our code is available at this https URL.
https://arxiv.org/abs/2208.12023
Person reidentification (re-ID) has been receiving increasing attention in recent years due to its importance for both science and society. Machine learning and particularly Deep Learning (DL) has become the main re-id tool that allowed researches to achieve unprecedented accuracy levels on benchmark datasets. However, there is a known problem of poor generalization of DL models. That is, models trained to achieve high accuracy on one dataset perform poorly on other ones and require re-training. To address this issue, we present a model without trainable parameters which shows great potential for high generalization. It combines a fully analytical feature extraction and similarity ranking scheme with DL-based human parsing used to obtain the initial subregion classification. We show that such combination to a high extent eliminates the drawbacks of existing analytical methods. We use interpretable color and texture features which have human-readable similarity measures associated with them. To verify the proposed method we conduct experiments on Market1501 and CUHK03 datasets achieving competitive rank-1 accuracy comparable with that of DL-models. Most importantly we show that our method achieves 63.9% and 93.5% rank-1 cross-domain accuracy when applied to transfer learning tasks. It is significantly higher than previously reported 30-50% transfer accuracy. We discuss the potential ways of adding new features to further improve the model. We also show the advantage of interpretable features for constructing human-generated queries from verbal description to conduct search without a query image.
https://arxiv.org/abs/2207.14243
Most state-of-the-art instance-level human parsing models adopt two-stage anchor-based detectors and, therefore, cannot avoid the heuristic anchor box design and the lack of analysis on a pixel level. To address these two issues, we have designed an instance-level human parsing network which is anchor-free and solvable on a pixel level. It consists of two simple sub-networks: an anchor-free detection head for bounding box predictions and an edge-guided parsing head for human segmentation. The anchor-free detector head inherits the pixel-like merits and effectively avoids the sensitivity of hyper-parameters as proved in object detection applications. By introducing the part-aware boundary clue, the edge-guided parsing head is capable to distinguish adjacent human parts from among each other up to 58 parts in a single human instance, even overlapping instances. Meanwhile, a refinement head integrating box-level score and part-level parsing quality is exploited to improve the quality of the parsing results. Experiments on two multiple human parsing datasets (i.e., CIHP and LV-MHP-v2.0) and one video instance-level human parsing dataset (i.e., VIP) show that our method achieves the best global-level and instance-level performance over state-of-the-art one-stage top-down alternatives.
https://arxiv.org/abs/2207.06854
Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework can generate more diverse and realistic human images compared to state-of-the-art methods.
https://arxiv.org/abs/2205.15996
Recently, much progress has been made for self-supervised action recognition. Most existing approaches emphasize the contrastive relations among videos, including appearance and motion consistency. However, two main issues remain for existing pre-training methods: 1) the learned representation is neutral and not informative for a specific task; 2) multi-task learning-based pre-training sometimes leads to sub-optimal solutions due to inconsistent domains of different tasks. To address the above issues, we propose a novel action recognition pre-training framework, which exploits human-centered prior knowledge that generates more informative representation, and avoids the conflict between multiple tasks by using task-dependent representations. Specifically, we distill knowledge from a human parsing model to enrich the semantic capability of representation. In addition, we combine knowledge distillation with contrastive learning to constitute a task-dependent multi-task framework. We achieve state-of-the-art performance on two popular benchmarks for action recognition task, i.e., UCF101 and HMDB51, verifying the effectiveness of our method.
https://arxiv.org/abs/2204.12729
Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.
https://arxiv.org/abs/2203.13815
Several computer vision applications such as person search or online fashion rely on human description. The use of instance-level human parsing (HP) is therefore relevant since it localizes semantic attributes and body parts within a person. But how to characterize these attributes? To our knowledge, only some single-HP datasets describe attributes with some color, size and/or pattern characteristics. There is a lack of dataset for multi-HP in the wild with such characteristics. In this article, we propose the dataset CCIHP based on the multi-HP dataset CIHP, with 20 new labels covering these 3 kinds of characteristics. In addition, we propose HPTR, a new bottom-up multi-task method based on transformers as a fast and scalable baseline. It is the fastest method of multi-HP state of the art while having precision comparable to the most precise bottom-up method. We hope this will encourage research for fast and accurate methods of precise human descriptions.
https://arxiv.org/abs/2201.09594
The objective of human parsing is to partition a human in an image into constituent parts. This task involves labeling each pixel of the human image according to the classes. Since the human body comprises hierarchically structured parts, each body part of an image can have its sole position distribution characteristics. Probably, a human head is less likely to be under the feet, and arms are more likely to be near the torso. Inspired by this observation, we make instance class distributions by accumulating the original human parsing label in the horizontal and vertical directions, which can be utilized as supervision signals. Using these horizontal and vertical class distribution labels, the network is guided to exploit the intrinsic position distribution of each class. We combine two guided features to form a spatial guidance map, which is then superimposed onto the baseline network by multiplication and concatenation to distinguish the human parts precisely. We conducted extensive experiments to demonstrate the effectiveness and superiority of our method on three well-known benchmarks: LIP, ATR, and CIHP databases.
https://arxiv.org/abs/2111.14173
In this work, we focus on Interactive Human Parsing (IHP), which aims to segment a human image into multiple human body parts with guidance from users' interactions. This new task inherits the class-aware property of human parsing, which cannot be well solved by traditional interactive image segmentation approaches that are generally class-agnostic. To tackle this new task, we first exploit user clicks to identify different human parts in the given image. These clicks are subsequently transformed into semantic-aware localization maps, which are concatenated with the RGB image to form the input of the segmentation network and generate the initial parsing result. To enable the network to better perceive user's purpose during the correction process, we investigate several principal ways for the refinement, and reveal that random-sampling-based click augmentation is the best way for promoting the correction effectiveness. Furthermore, we also propose a semantic-perceiving loss (SP-loss) to augment the training, which can effectively exploit the semantic relationships of clicks for better optimization. To the best knowledge, this work is the first attempt to tackle the human parsing task under the interactive setting. Our IHP solution achieves 85\% mIoU on the benchmark LIP, 80\% mIoU on PASCAL-Person-Part and CIHP, 75\% mIoU on Helen with only 1.95, 3.02, 2.84 and 1.09 clicks per class respectively. These results demonstrate that we can simply acquire high-quality human parsing masks with only a few human effort. We hope this work can motivate more researchers to develop data-efficient solutions to IHP in the future.
https://arxiv.org/abs/2111.06162
Transferring human motion from a source to a target person poses great potential in computer vision and graphics applications. A crucial step is to manipulate sequential future motion while retaining the appearance characteristic.Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person, which is not scalable in practice.This work studies a more general setting, in which we aim to learn a \emph{single} model to parsimoniously transfer motion from a source video to any target person given only one image of the person, named as Collaborative Parsing-Flow Network (CPF-Net). The paucity of information regarding the target person makes the task particularly challenging to faithfully preserve the appearance in varying designated this http URL address this issue, CPF-Net integrates the structured human parsing and appearance flow to guide the realistic foreground synthesis which is merged into the background by a spatio-temporal fusion this http URL particular, CPF-Net decouples the problem into stages of human parsing sequence generation, foreground sequence generation and final video generation. The human parsing generation stage captures both the pose and the body structure of the target. The appearance flow is beneficial to keep details in synthesized frames. The integration of human parsing and appearance flow effectively guides the generation of video frames with realistic appearance. Finally, the dedicated designed fusion network ensure the temporal coherence. We further collect a large set of human dancing videos to push forward this research field. Both quantitative and qualitative results show our method substantially improves over previous approaches and is able to generate appealing and photo-realistic target videos given any input person image. All source code and dataset will be released at this https URL.
https://arxiv.org/abs/2110.14147