With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at this https URL.
在对各种面部分析任务进行全面的调查和研究后,越来越多的研究者对发展统一的面部感知方法产生了浓厚的兴趣。现有的方法主要讨论了统一的表示和训练,缺乏任务的扩展性和应用效率。为解决这个问题,我们关注统一的模型结构,研究了一个面部通用模型。作为一种直观的设计,Naive Faceptor使具有相同输出形状和粒度的任务可以共享标准输出头的结构设计,从而实现提高任务扩展性的目标。此外,Faceptor还提出了一个设计良好的单编码器双解码器架构,允许任务特定的查询表示新兴的语义。这种设计在提高模型结构统一的同时,提高了存储开销的应用效率。此外,我们还引入了层注意力机制到Faceptor中,使模型能够动态选择最优层中的特征来执行所需任务。通过在13个面部感知数据集上进行联合训练,Faceptor在面部关键点定位、面部解析、年龄估计、表情识别、二进制属性分类和面部识别等方面取得了惊人的性能,超越了大多数专用方法。我们的训练框架也可以应用于辅助监督学习,在数据稀疏任务(如年龄估计和表情识别)中显著提高性能。代码和模型将在这个https:// URL上公开发布。
https://arxiv.org/abs/2403.09500
In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
在这项工作中,我们专注于学习可以在训练有效面部识别模型时进行自适应的面部表示。特别是在没有标签的情况下。首先,与现有的带有标签的人脸数据集相比,现实世界中存在大量未标记的脸。我们通过自监督预训练来探索这些未标记面部图像的学习策略,以转移通用的面部识别性能。此外,受到最近的一个发现的影响,即脸部显著区域对于面部识别至关重要,我们使用通过提取面部特征点来定位的补丁作为自监督学习中的标签。这使得我们的方法 - 基于LAndmark的特征点自监督学习LAFS) - 可以学习更具关键性的面部表示。我们还引入了两种特定特征点的自监督增强,以进一步规范化学习。通过学习基于特征点的面部表示,我们进一步通过正则化来缓解特征点位置的变化。我们的方法在多个面部识别基准测试中都实现了显著的改进,尤其是在更具挑战性的几拍场景中。
https://arxiv.org/abs/2403.08161
Multimodal deep learning methods capture synergistic features from multiple modalities and have the potential to improve accuracy for stress detection compared to unimodal methods. However, this accuracy gain typically comes from high computational cost due to the high-dimensional feature spaces, especially for intermediate fusion. Dimensionality reduction is one way to optimize multimodal learning by simplifying data and making the features more amenable to processing and analysis, thereby reducing computational complexity. This paper introduces an intermediate multimodal fusion network with manifold learning-based dimensionality reduction. The multimodal network generates independent representations from biometric signals and facial landmarks through 1D-CNN and 2D-CNN. Finally, these features are fused and fed to another 1D-CNN layer, followed by a fully connected dense layer. We compared various dimensionality reduction techniques for different variations of unimodal and multimodal networks. We observe that the intermediate-level fusion with the Multi-Dimensional Scaling (MDS) manifold method showed promising results with an accuracy of 96.00\% in a Leave-One-Subject-Out Cross-Validation (LOSO-CV) paradigm over other dimensional reduction methods. MDS had the highest computational cost among manifold learning methods. However, while outperforming other networks, it managed to reduce the computational cost of the proposed networks by 25\% when compared to six well-known conventional feature selection methods used in the preprocessing step.
多模态深度学习方法从多个模态中捕获协同特征,具有改善与单模态方法的准确性相比的压力检测的精度的潜力。然而,这种准确性提升通常来自高维特征空间的计算成本,特别是在中间融合阶段。降维是优化多模态学习的一种方式,通过简化数据并使特征更具处理和分析的灵活性,从而降低计算复杂性。本文介绍了一种基于多维学习基于降维的中间多模态融合网络。多模态网络通过1D-CNN和2D-CNN从生物特征信号和面部关键点生成独立表示。最后,这些特征被融合并输入到另一个1D-CNN层,接着是全连接密集层。我们比较了不同单模态和多模态网络的降维技术。我们观察到,在Leave-One-Subject-Out Cross-Validation(LOSO-CV)范式中,中间级融合与Multi-Dimensional Scaling(MDS)分形方法显示出有希望的结果,其准确率为96.00%。MDS在分形学习方法中具有最高的计算成本。然而,与其他网络相比,它通过与预处理步骤中使用的六个已知传统特征选择方法相比较,将所提出网络的计算成本降低了25%。
https://arxiv.org/abs/2403.08077
In this paper, we consider a novel and practical case for talking face video generation. Specifically, we focus on the scenarios involving multi-people interactions, where the talking context, such as audience or surroundings, is present. In these situations, the video generation should take the context into consideration in order to generate video content naturally aligned with driving audios and spatially coherent to the context. To achieve this, we provide a two-stage and cross-modal controllable video generation pipeline, taking facial landmarks as an explicit and compact control signal to bridge the driving audio, talking context and generated videos. Inside this pipeline, we devise a 3D video diffusion model, allowing for efficient contort of both spatial conditions (landmarks and context video), as well as audio condition for temporally coherent generation. The experimental results verify the advantage of the proposed method over other baselines in terms of audio-video synchronization, video fidelity and frame consistency.
在本文中,我们考虑了一种新的且实用的谈话面部视频生成案例。具体来说,我们关注涉及多人群互动的情况,其中谈话上下文(如观众或环境)是存在的。在这些情况下,视频生成应考虑上下文以生成与驾驶音频自然同步的视频内容,并使其在空间上与上下文一致。为了实现这一目标,我们提供了两个阶段的跨模态可控制视频生成管道,利用面部关键点作为显式且紧凑的控制信号来桥接驾驶音频、谈话上下文和生成视频。在这个管道中,我们设计了一个3D视频扩散模型,允许对空间条件(关键点和上下文视频)进行有效的弯曲,以及对音频条件的时域一致生成。实验结果证实了与基线相比,所提出方法在音频-视频同步性、视频质量和帧一致性方面的优势。
https://arxiv.org/abs/2402.18092
In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.
在这项工作中,我们通过关注音频线索和面部动作之间动态而微妙的關係,来应对在谈话视频生成中提高真实感和表现力的挑战。我们指出了传统技术通常无法捕捉到人类表情的完整范围,以及个体面部风格的独特性。为解决这些问题,我们提出了EMO,一种新框架,利用直接音频到视频合成方法,无需中间3D模型或面部关键点。我们的方法确保了视频中的平滑过渡和身份保持一致,从而产生了高度表现力和逼真的动画。实验结果表明,EMO能够产生不仅是令人信服的讲话视频,还有各种风格的音乐视频,在表现力和真实感方面显著超过了现有最先进的方法。
https://arxiv.org/abs/2402.17485
Video conferencing has caught much more attention recently. High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. Most pioneering methods rely on classic video compression codec without high-level feature embedding and thus can not reach the extremely low bandwidth. Recent works instead employ model-based neural compression to acquire ultra-low bitrates using sparse representations of each frame such as facial landmark information, while these approaches can not maintain high fidelity due to 2D image-based warping. In this paper, we propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing using implicit radiance fields to achieve both major objectives. We leverage dynamic neural radiance fields to reconstruct high-fidelity talking head with expression features, which are represented as frame substitution for transmission. The overall system employs deep model to encode expression features at the sender and reconstruct portrait at the receiver with volume rendering as decoder for ultra-low bandwidth. In particular, with the characteristic of neural radiance fields based model, our compression approach is resolution-agnostic, which means that the low bandwidth achieved by our approach is independent of video resolution, while maintaining fidelity for higher resolution reconstruction. Experimental results demonstrate that our novel framework can (1) construct ultra-low bandwidth video conferencing, (2) maintain high fidelity portrait and (3) have better performance on high-resolution video compression than previous works.
视频会议最近引起了更多关注。高清晰度和低带宽是视频压缩为视频会议应用的主要目标。大多数先驱方法依赖于经典视频压缩编码器,没有高层次的特征嵌入,因此无法达到极低的带宽。最近的工作则采用基于模型的神经压缩方法,利用每个帧的稀疏表示(如面部关键点信息)来获得超低带宽,而這些方法无法保持高清晰度,因为它们基于二维图像的变形。在本文中,我们提出了一个用于高清晰度人像视频会议的新型低带宽神经压缩方法,利用隐式辐射场实现这两个主要目标。我们利用动态神经辐射场重构高清晰度谈话头,表达特征用帧置换表示。整个系统采用深度模型对发送方的表达特征进行编码,使用体积渲染作为解码器来重构接收端的肖像,以实现超低带宽。特别地,基于神经辐射场模型的特点,我们的压缩方法对分辨率无关,这意味着我们方法的低带宽与视频分辨率无关,同时保持高清晰度的重建。实验结果表明,我们的新框架可以(1)构建超低带宽的视频会议,(2)保持高清晰度的人像,(3)在视频压缩方面的性能比以前的工作更好。
https://arxiv.org/abs/2402.16599
In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotional understanding can improve speech enhancement performance. We propose a novel emotion-aware AVSE system that leverages both auditory and visual information. It extracts emotional features from the facial landmarks of the speaker and fuses them with corresponding audio and visual modalities. This enriched data serves as input to a deep UNet-based encoder-decoder network, specifically designed to orchestrate the fusion of multimodal information enhanced with emotion. The network iteratively refines the enhanced speech representation through an encoder-decoder architecture, guided by perceptually-inspired loss functions for joint learning and optimization. We train and evaluate the model on the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset, a rich repository of audio-visual recordings with annotated emotions. Our comprehensive evaluation demonstrates the effectiveness of emotion as a contextual cue for AVSE. By integrating emotional features, the proposed system achieves significant improvements in both objective and subjective assessments of speech quality and intelligibility, especially in challenging noise environments. Compared to baseline AVSE and audio-only speech enhancement systems, our approach exhibits a noticeable increase in PESQ and STOI, indicating higher perceptual quality and intelligibility. Large-scale listening tests corroborate these findings, suggesting improved human understanding of enhanced speech.
在现实环境里,背景噪音显著地降低了人类语音的可听度和清晰度。音频-视觉语音增强(AVSE)试图恢复语音质量,但现有方法往往不够,特别是在动态噪音条件下。本研究探讨了将情感作为AVSE中的新颖上下文线索,假设将情感理解纳入其中可以提高 speech enhancement performance。我们提出了一个新颖的基于情感的AVSE系统,该系统利用听觉和视觉信息。它从发言者的面部特征中提取情感特征,并将它们与相应的音频和视觉模块融合。这个丰富的数据作为输入输入到专为情感信息融合而设计的深度UNet-based编码器-解码器网络中,该网络通过感知驱动的损失函数进行迭代优化。我们在CMU多模态情感意见(CMU-MOSEI)数据集上训练并评估该模型,这是一个丰富的音频-视觉录音数据集,带有注释的情感。全面的评估表明,情感作为上下文线索在AVSE中具有有效的效果。通过将情感特征融合到系统中,所提出的系统在客观和主观评估中的语音质量和可听性方面都取得了显著的改进,尤其是在具有挑战性噪音环境中。与基线AVSE和仅音频增强系统相比,我们的方法在感知质量和社会影响力方面明显增加,表明具有更高的可听度和理解性。大规模听力测试证实了这些发现,表明提高了人类对增强语音的理解。
https://arxiv.org/abs/2402.16394
Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
深度学习方法已经在面部关键点检测(FLD)任务上取得了显著的改进。然而,在具有挑战性的环境中检测关键点,例如头部姿态变化、夸张表情或不均匀照明,仍然是一个挑战,因为存在高度的变异性不足的样本。这种不足可以归因于模型无法有效地从输入图像中获取适当的面部结构信息。为了应对这个问题,我们提出了一个专门针对FLD任务设计的图像增强技术,以增强模型对面部结构的认知。为了有效地利用新提出的增强技术,我们采用了一种基于Siamese网络架构的训练方法,结合了基于深度卷积分析(DCCA)的损失,以实现从输入图像的两个不同视角进行集体学习。此外,我们还使用了一个自定义的时钟模块的Transformer + CNN网络作为Siamese框架的 robust 骨干网络。大量实验证明,我们的方法在各种基准数据集上超过了多个最先进的方法的性能。
https://arxiv.org/abs/2402.15044
Facial video inpainting plays a crucial role in a wide range of applications, including but not limited to the removal of obstructions in video conferencing and telemedicine, enhancement of facial expression analysis, privacy protection, integration of graphical overlays, and virtual makeup. This domain presents serious challenges due to the intricate nature of facial features and the inherent human familiarity with faces, heightening the need for accurate and persuasive completions. In addressing challenges specifically related to occlusion removal in this context, our focus is on the progressive task of generating complete images from facial data covered by masks, ensuring both spatial and temporal coherence. Our study introduces a network designed for expression-based video inpainting, employing generative adversarial networks (GANs) to handle static and moving occlusions across all frames. By utilizing facial landmarks and an occlusion-free reference image, our model maintains the user's identity consistently across frames. We further enhance emotional preservation through a customized facial expression recognition (FER) loss function, ensuring detailed inpainted outputs. Our proposed framework exhibits proficiency in eliminating occlusions from facial videos in an adaptive form, whether appearing static or dynamic on the frames, while providing realistic and coherent results.
面部视频修复在广泛的应用中扮演着关键角色,包括但不仅限于视频会议和远程医疗中的遮挡消除,面部表情分析的增强,隐私保护,图形覆盖物的集成以及虚拟化妆。由于面部特征的复杂性以及人类对脸的熟悉程度,这个领域提出了严峻的挑战,增加了准确和说服力的完整性的需求。在解决遮挡消除在这个场景中的具体挑战时,我们的关注点是生成完整的图像,确保在所有帧中都保持空间和时间上的连贯性。我们的研究引入了一个基于表达式视频修复的神经网络,使用生成对抗网络(GANs)处理所有帧中的静态和动态遮挡。通过利用面部特征点和一张无遮挡的参考图像,我们的模型在帧之间保持用户身份的一致性。我们还通过自定义面部表情识别(FER)损失函数来增强情感保留,确保详细的修复输出。我们提出的框架在消除面部视频遮挡方面表现出适应性,无论是静态还是动态出现在帧中,都能提供真实和连贯的结果。
https://arxiv.org/abs/2402.09100
Criminal and suspicious activity detection has become a popular research topic in recent years. The rapid growth of computer vision technologies has had a crucial impact on solving this issue. However, physical stalking detection is still a less explored area despite the evolution of modern technology. Nowadays, stalking in public places has become a common occurrence with women being the most affected. Stalking is a visible action that usually occurs before any criminal activity begins as the stalker begins to follow, loiter, and stare at the victim before committing any criminal activity such as assault, kidnapping, rape, and so on. Therefore, it has become a necessity to detect stalking as all of these criminal activities can be stopped in the first place through stalking detection. In this research, we propose a novel deep learning-based hybrid fusion model to detect potential stalkers from a single video with a minimal number of frames. We extract multiple relevant features, such as facial landmarks, head pose estimation, and relative distance, as numerical values from video frames. This data is fed into a multilayer perceptron (MLP) to perform a classification task between a stalking and a non-stalking scenario. Simultaneously, the video frames are fed into a combination of convolutional and LSTM models to extract the spatio-temporal features. We use a fusion of these numerical and spatio-temporal features to build a classifier to detect stalking incidents. Additionally, we introduce a dataset consisting of stalking and non-stalking videos gathered from various feature films and television series, which is also used to train the model. The experimental results show the efficiency and dynamism of our proposed stalker detection system, achieving 89.58% testing accuracy with a significant improvement as compared to the state-of-the-art approaches.
近年来,犯罪和可疑活动检测已成为一个热门的研究课题。计算机视觉技术的快速发展对解决这个问题起到了关键作用。然而,尽管现代技术的进步,物理跟踪检测仍然是一个相对较少研究的领域。如今,公共场所的跟踪已经成为女性最常遭受的侵犯行为。跟踪是一种明显的行动,通常在犯罪活动开始前就发生了,跟踪者开始跟随、窥视和盯着受害者,然后实施诸如攻击、绑架、强奸等犯罪活动。因此,通过跟踪检测这些犯罪活动已成为不可避免的必要性。在这个研究中,我们提出了一个基于深度学习的混合模型,用于从单个视频检测潜在跟踪者,且帧数最小。我们提取了视频帧中的多个相关特征,如面部特征点、头姿态估计和相对距离,并将其转换为数值值。将这个数据输入到多层感知器(MLP)中进行分类任务,将跟踪和非跟踪场景区分开来。同时,将视频帧输入到卷积和LSTM模型中提取空间和时间特征。我们使用这些数值和空间时间特征的融合来构建一个分类器,用于检测跟踪事件。此外,我们还引入了一个由各种电影和电视剧收集的跟踪和非跟踪视频组成的训练数据集,用于训练模型。实验结果表明,我们提出的跟踪检测系统具有高效率和动态性,在89.58%的测试准确率方面取得了显著的改善,与最先进的跟踪检测方法相比。
https://arxiv.org/abs/2402.03417
Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.
人类面部生成和编辑是计算机视觉和数字世界中的重要任务。最近的研究表明,多模态面部生成和编辑取得了显著进展,例如,通过面部分割来指导图像生成。然而,对于某些用户来说,手动创建这些调节模块可能具有挑战性。因此,我们引入了M3Face,一个可控制的多模态多语言框架,用于可控制的面部生成和编辑。该框架允许用户仅通过文本输入自动生成控制模块,例如语义分割或面部关键点,并随后生成面部图像。我们对我们的框架进行广泛的定性和定量实验,以展示其面部生成和编辑能力。此外,我们还提出了M3CelebA数据集,一个包含高质量图像、语义分割、面部关键点以及多种语言中每个图像的多个描述的大型多模态多语言面部数据集。代码和数据集将在发表时发布。
https://arxiv.org/abs/2402.02369
While state-of-the-art facial expression recognition (FER) classifiers achieve a high level of accuracy, they lack interpretability, an important aspect for end-users. To recognize basic facial expressions, experts resort to a codebook associating a set of spatial action units to a facial expression. In this paper, we follow the same expert footsteps, and propose a learning strategy that allows us to explicitly incorporate spatial action units (aus) cues into the classifier's training to build a deep interpretable model. In particular, using this aus codebook, input image expression label, and facial landmarks, a single action units heatmap is built to indicate the most discriminative regions of interest in the image w.r.t the facial expression. We leverage this valuable spatial cue to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \aus map. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with aus maps, simulating the experts' decision process. This is achieved using only the image class expression as supervision and without any extra manual annotations. Moreover, our method is generic. It can be applied to any CNN- or transformer-based deep classifier without the need for architectural change or adding significant training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on Class-Activation Mapping methods (CAMs), and we show that our training technique improves the CAM interpretability.
尽管最先进的面部表情识别(FER)分类器可以达到高水平的准确度,但他们缺乏可解释性,这对于最终用户来说非常重要。为了识别基本的面部表情,专家们不得不求助于将一系列空间动作单元与面部表情关联的代码本。在本文中,我们遵循了这位专家的步骤,并提出了一种可以将空间动作单元(AUS)线索明确融入分类器训练以构建深度可解释模型的学习策略。特别地,使用这个AUS代码本、输入图像表情标签和面部关键点,我们构建了一个单动作单元热力图,表示图像中与面部表情最具判别性的区域。我们利用这个有价值的空间线索来为FER训练一个深度可解释分类器。通过将分类器的空间层特征约束为与AUS映射相关联,我们训练了一个深度可解释分类器,同时正确分类图像,并产生与AUS映射相关的可解释视觉层级关注,模拟了专家的决策过程。仅使用图像类表达作为监督,没有进行任何额外的手动注释,我们证明了我们的方法是通用的。它可用于任何基于CNN或Transformer的深度分类器,而无需进行架构更改或增加显著的训练时间。我们对两个公开基准数据集RAFDB和AFFECTNET的广泛评估表明,我们提出的策略可以在不降低分类性能的情况下提高层间可解释性。此外,我们还研究了一种常见的可解释分类器类型,即基于类激活映射(CAM)的方法,并展示了我们的训练技术可以提高CAM的可解释性。
https://arxiv.org/abs/2402.00281
In this paper, we propose a novel approach for conducting face morphing attacks, which utilizes optimal-landmark-guided image blending. Current face morphing attacks can be categorized into landmark-based and generation-based approaches. Landmark-based methods use geometric transformations to warp facial regions according to averaged landmarks but often produce morphed images with poor visual quality. Generation-based methods, which employ generation models to blend multiple face images, can achieve better visual quality but are often unsuccessful in generating morphed images that can effectively evade state-of-the-art face recognition systems~(FRSs). Our proposed method overcomes the limitations of previous approaches by optimizing the morphing landmarks and using Graph Convolutional Networks (GCNs) to combine landmark and appearance features. We model facial landmarks as nodes in a bipartite graph that is fully connected and utilize GCNs to simulate their spatial and structural relationships. The aim is to capture variations in facial shape and enable accurate manipulation of facial appearance features during the warping process, resulting in morphed facial images that are highly realistic and visually faithful. Experiments on two public datasets prove that our method inherits the advantages of previous landmark-based and generation-based methods and generates morphed images with higher quality, posing a more significant threat to state-of-the-art FRSs.
在本文中,我们提出了一个新颖的进行面部变形攻击的方法,该方法利用最优特征点引导图像融合。当前的面部变形攻击可以分为基于地标和基于生成模型的方法。基于地标的 methods 使用几何变换根据平均地标扭曲面部区域,但通常会产生视觉质量较差的变形图像。基于生成的方法,使用生成模型将多个面部图像融合,可以实现更好的视觉质量,但通常无法生成能够有效逃避最先进面部识别系统(FRSs)的变形图像。我们提出的方法通过优化变形特征点并使用图卷积网络(GCNs)结合特征点和外观特征,克服了前方法的局限性。我们将面部特征点建模为二分图中的节点,并利用 GCNs 模拟其空间和结构关系。目标是在变形过程中捕捉面部形状的变异,并使面部外观特征在变形过程中得到准确的操作,从而生成高度逼真和视觉上忠实的外观图像。在两个公开数据集上的实验证明,我们的方法继承了前地标和生成模型的优势,生成的变形图像具有更高的质量,对最先进的 FRSs构成了更大的威胁。
https://arxiv.org/abs/2401.16722
Head-mounted displays (HMDs) serve as indispensable devices for observing extended reality (XR) environments and virtual content. However, HMDs present an obstacle to external recording techniques as they block the upper face of the user. This limitation significantly affects social XR applications, specifically teleconferencing, where facial features and eye gaze information play a vital role in creating an immersive user experience. In this study, we propose a new network for expression-aware video inpainting for HMD removal (EVI-HRnet) based on generative adversarial networks (GANs). Our model effectively fills in missing information with regard to facial landmarks and a single occlusion-free reference image of the user. The framework and its components ensure the preservation of the user's identity across frames using the reference frame. To further improve the level of realism of the inpainted output, we introduce a novel facial expression recognition (FER) loss function for emotion preservation. Our results demonstrate the remarkable capability of the proposed framework to remove HMDs from facial videos while maintaining the subject's facial expression and identity. Moreover, the outputs exhibit temporal consistency along the inpainted frames. This lightweight framework presents a practical approach for HMD occlusion removal, with the potential to enhance various collaborative XR applications without the need for additional hardware.
头戴显示器(HMDs)对于观察扩展现实(XR)环境和虚拟内容至关重要。然而,HMDs 对外部录制技术构成了障碍,因为它们挡住了用户的 upper face。这一限制大大影响了社交 XR 应用,特别是视频会议,因为在创建沉浸式用户体验的过程中,面部特征和眼动信息至关重要。在这项研究中,我们提出了一个基于生成对抗网络(GANs)的表达式注意视频修复(EVI-HRnet)新网络。我们的模型有效地通过修复面部特征和用户单张不遮挡的参考图像来填补缺失信息。该框架及其组件确保在帧之间保留用户的身份。为了进一步提高修复输出后的现实水平,我们引入了一种新的面部表情识别(FER)损失函数,用于情感保留。我们的结果表明,与修复后的面部视频相比,该建议框架可以有效地从面部视频中移除 HMD,同时保留主体的面部表情和身份。此外,修复后的输出在修复帧之间具有时间一致性。这个轻量级框架为 HMD 遮挡移除提供了一个实际的方法,具有不需要额外硬件来增强各种协作 XR 应用程序的潜力。
https://arxiv.org/abs/2401.14136
Recently, deep learning-based facial landmark detection for in-the-wild faces has achieved significant improvement. However, there are still challenges in face landmark detection in other domains (e.g. cartoon, caricature, etc). This is due to the scarcity of extensively annotated training data. To tackle this concern, we design a two-stage training approach that effectively leverages limited datasets and the pre-trained diffusion model to obtain aligned pairs of landmarks and face in multiple domains. In the first stage, we train a landmark-conditioned face generation model on a large dataset of real faces. In the second stage, we fine-tune the above model on a small dataset of image-landmark pairs with text prompts for controlling the domain. Our new designs enable our method to generate high-quality synthetic paired datasets from multiple domains while preserving the alignment between landmarks and facial features. Finally, we fine-tuned a pre-trained face landmark detection model on the synthetic dataset to achieve multi-domain face landmark detection. Our qualitative and quantitative results demonstrate that our method outperforms existing methods on multi-domain face landmark detection.
近年来,基于深度学习的野外面部关键点检测取得了显著的改进。然而,在其他领域(如卡通、漫画等)进行面部关键点检测仍然具有挑战性。这是因为缺乏大量注释的训练数据。为了应对这一问题,我们设计了一种两阶段训练方法,有效利用有限的数据集和预训练扩散模型,在多个领域获得对齐的关键点和面部。在第一阶段,我们在大量真实面部分割的大数据集上训练了一个带有关键点条件的面部生成模型。在第二阶段,我们在一个小型数据集中对上述模型进行微调,该数据集包含用于控制领域的图像关键点对。我们的新设计使我们的方法能够从多个领域生成高质量合成对齐的关键点数据,同时保留关键点与面部特征之间的对齐关系。最后,我们在合成数据集上对预训练面部关键点检测模型进行微调,以实现多领域面部关键点检测。我们的定性和定量结果表明,我们的方法在多领域面部关键点检测方面超过了现有方法。
https://arxiv.org/abs/2401.13191
Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.
简介:在人类-计算机交互和行为研究的领域,精确实时眼神检测是至关重要的。传统方法通常依赖于昂贵的设备或大量数据,这在许多场景下是不切实际的。本文介绍了一种新颖的基于几何的方法来解决这些挑战,利用消费级硬件实现更广泛的适用性。方法:我们利用具有快速检测消费者级芯片上面部关键点的神经网络来生成准确且稳定的面部和眼睛的三维关键点。从中,我们导出一个基于几何的描述符,构成一个8维的流形,表示眼和头的运动。这些描述符随后被用来形成预测眼 gaze 方向的线性方程。结果:我们的方法在角误差不到1.9度的情况下,展示了与最先进的系统相媲美的能力,同时在实时操作中,且对计算资源的需求非常小。结论:所开发的方法在目光检测技术上取得了显著的突破,为传统系统提供了一种高准确度、高效和易用性的替代方案。这为各种领域的实时应用提供了新的可能性,从游戏到心理学研究。
https://arxiv.org/abs/2401.00406
High-fidelity and efficient audio-driven talking head generation has been a key research topic in computer graphics and computer vision. In this work, we study vector image based audio-driven talking head generation. Compared with directly animating the raster image that most widely used in existing works, vector image enjoys its excellent scalability being used for many applications. There are two main challenges for vector image based talking head generation: the high-quality vector image reconstruction w.r.t. the source portrait image and the vivid animation w.r.t. the audio signal. To address these, we propose a novel scalable vector graphic reconstruction and animation method, dubbed VectorTalker. Specifically, for the highfidelity reconstruction, VectorTalker hierarchically reconstructs the vector image in a coarse-to-fine manner. For the vivid audio-driven facial animation, we propose to use facial landmarks as intermediate motion representation and propose an efficient landmark-driven vector image deformation module. Our approach can handle various styles of portrait images within a unified framework, including Japanese manga, cartoon, and photorealistic images. We conduct extensive quantitative and qualitative evaluations and the experimental results demonstrate the superiority of VectorTalker in both vector graphic reconstruction and audio-driven animation.
高度准确和高效的基于音频的二维头生成一直是计算机图形学和计算机视觉的研究热点。在这项研究中,我们研究基于矢量图像的音频驱动头生成。与目前工作中直接动画广泛使用的像素图像相比,矢量图像具有出色的可扩展性,被应用于许多应用。矢量图像基于头生成的两个主要挑战是:高质量矢量图像关于源肖像图像的重建以及生动的音乐信号关于矢量图像的动画。为了应对这些挑战,我们提出了名为VectorTalker的新可扩展矢量图形重构和动画方法。具体来说,对于高保真的重构,VectorTalker以粗到细的方式对矢量图像进行分层重构。对于生动的音乐驱动面部动画,我们提出使用面部特征作为中间运动表示,并提出了一个高效的标记驱动矢量图像变形模块。我们的方法可以在统一的框架中处理各种肖像图像风格,包括日本漫画、卡通和实写图像。我们进行了广泛的定量评估和实验,实验结果证明了VectorTalker在矢量图形重建和音频驱动动画方面的优越性。
https://arxiv.org/abs/2312.11568
Face swapping has gained significant traction, driven by the plethora of human face synthesis facilitated by deep learning methods. However, previous face swapping methods that used generative adversarial networks (GANs) as backbones have faced challenges such as inconsistency in blending, distortions, artifacts, and issues with training stability. To address these limitations, we propose an innovative end-to-end framework for high-fidelity face swapping. First, we introduce a StyleGAN-based facial attributes encoder that extracts essential features from faces and inverts them into a latent style code, encapsulating indispensable facial attributes for successful face swapping. Second, we introduce an attention-based style blending module to effectively transfer Face IDs from source to target. To ensure accurate and quality transferring, a series of constraint measures including contrastive face ID learning, facial landmark alignment, and dual swap consistency is implemented. Finally, the blended style code is translated back to the image space via the style decoder, which is of high training stability and generative capability. Extensive experiments on the CelebA-HQ dataset highlight the superior visual quality of generated images from our face-swapping methodology when compared to other state-of-the-art methods, and the effectiveness of each proposed module. Source code and weights will be publicly available.
面部换脸技术已经取得了显著的突破,得益于深度学习方法催生的丰富的人脸合成数据。然而,之前使用生成对抗网络(GANs)作为后端的换脸方法遇到了诸如混合不一致性、扭曲、伪影和训练稳定性等问题。为了应对这些局限,我们提出了一个高保真度面部换脸的端到端框架。首先,我们引入了一个基于StyleGAN的 facial属性编码器,从人脸中提取关键特征并将其转换为潜在风格代码,包含成功换脸所必需的面部属性。其次,我们引入了一个自注意力机制的换脸模块,有效地将源目标对人脸ID的转移。为确保准确和高质量的换脸,包括对比度人脸ID学习、面部关键点对齐和双替换一致性等的一系列约束措施得到了实现。最后,通过样式解码器将混合风格代码转换回图像空间,该解码器具有高训练稳定性和生成能力。对CelebA-HQ数据集的实验表明,与其他最先进的换脸方法相比,我们面部换脸方法生成的图像具有更高的视觉质量,每个提出的模块都具有显著的效果。源代码和权重将公开可用。
https://arxiv.org/abs/2312.10843
Dynamic NeRFs have recently garnered growing attention for 3D talking portrait synthesis. Despite advances in rendering speed and visual quality, challenges persist in enhancing efficiency and effectiveness. We present R2-Talker, an efficient and effective framework enabling realistic real-time talking head synthesis. Specifically, using multi-resolution hash grids, we introduce a novel approach for encoding facial landmarks as conditional features. This approach losslessly encodes landmark structures as conditional features, decoupling input diversity, and conditional spaces by mapping arbitrary landmarks to a unified feature space. We further propose a scheme of progressive multilayer conditioning in the NeRF rendering pipeline for effective conditional feature fusion. Our new approach has the following advantages as demonstrated by extensive experiments compared with the state-of-the-art works: 1) The lossless input encoding enables acquiring more precise features, yielding superior visual quality. The decoupling of inputs and conditional spaces improves generalizability. 2) The fusing of conditional features and MLP outputs at each MLP layer enhances conditional impact, resulting in more accurate lip synthesis and better visual quality. 3) It compactly structures the fusion of conditional features, significantly enhancing computational efficiency.
动态神经网络最近因3DTalker技术而受到了越来越多的关注。尽管渲染速度和视觉效果的提高,但在提高效率和效果方面仍然存在挑战。我们提出了R2-Talker,一种高效且有效的框架,实现真实实时谈话头合成。具体来说,我们使用多分辨率哈希网格引入了一种新的方法,将面部关键点编码为条件特征。这种方法无损地编码关键点结构作为条件特征,解耦输入多样性,通过将任意关键点映射到统一的特征空间,实现了条件空间。我们还提出了在NeRF渲染管道中进行逐步多层调节的方案,以实现有效的条件特征融合。通过与最先进的工作的广泛实验进行比较,我们的新方法具有以下优点:1)无损输入编码允许获得更精确的的特征,产生更好的视觉效果。解耦输入和条件空间有助于提高泛化能力。2)在MLP层中融合条件特征和MLP输出增强了条件影响,导致更准确的嘴合成和更好的视觉效果。3)它简化了条件特征融合,显著提高了计算效率。
https://arxiv.org/abs/2312.05572
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.
野外的动态面部表情识别(DFER)仍然受到数据限制的影响,例如,姿态、遮挡和光照不足的数量和多样性,以及面部表情的固有歧义性。相比之下,静态面部表情识别(SFER)目前表现出更高的性能,并可以从更丰富的高质量训练数据中受益。此外,DFER中表情特征和动态关系的隐蔽特征仍没有被充分利用。为解决这些挑战,我们引入了一种新颖的静态到动态模型(S2D),它利用现有的SFER知识以及从提取到的面部关键点感知特征中隐含的动态信息,从而显著提高了DFER的性能。首先,我们为SFER构建和训练了一个图像模型,该模型仅包含标准的Vision Transformer(ViT)和多视角互补提示(MCPs)。然后,通过在图像模型中插入时间建模器(TMAs),我们获得DFER的动态模型。MCPs通过来自标准面部关键点检测器的标记感知特征增强面部表情特征。TMAs捕捉并建模面部表情中动态变化之间的关系,从而有效地扩展了预训练的图像模型。值得注意的是,MCPs和TMAs仅增加了训练参数的一小部分(不到+10%)。此外,我们还通过自监督损失基于情感锚定物(i.e.为每个情感类别提供的参考样本)来降低模糊情感标签的负面影响,进一步增强我们的S2D。在流行SFER和DFER数据集上进行实验证明,我们达到了最先进水平。
https://arxiv.org/abs/2312.05447