Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
深度学习方法已经在面部关键点检测(FLD)任务上取得了显著的改进。然而,在具有挑战性的环境中检测关键点,例如头部姿态变化、夸张表情或不均匀照明,仍然是一个挑战,因为存在高度的变异性不足的样本。这种不足可以归因于模型无法有效地从输入图像中获取适当的面部结构信息。为了应对这个问题,我们提出了一个专门针对FLD任务设计的图像增强技术,以增强模型对面部结构的认知。为了有效地利用新提出的增强技术,我们采用了一种基于Siamese网络架构的训练方法,结合了基于深度卷积分析(DCCA)的损失,以实现从输入图像的两个不同视角进行集体学习。此外,我们还使用了一个自定义的时钟模块的Transformer + CNN网络作为Siamese框架的 robust 骨干网络。大量实验证明,我们的方法在各种基准数据集上超过了多个最先进的方法的性能。
https://arxiv.org/abs/2402.15044
Facial video inpainting plays a crucial role in a wide range of applications, including but not limited to the removal of obstructions in video conferencing and telemedicine, enhancement of facial expression analysis, privacy protection, integration of graphical overlays, and virtual makeup. This domain presents serious challenges due to the intricate nature of facial features and the inherent human familiarity with faces, heightening the need for accurate and persuasive completions. In addressing challenges specifically related to occlusion removal in this context, our focus is on the progressive task of generating complete images from facial data covered by masks, ensuring both spatial and temporal coherence. Our study introduces a network designed for expression-based video inpainting, employing generative adversarial networks (GANs) to handle static and moving occlusions across all frames. By utilizing facial landmarks and an occlusion-free reference image, our model maintains the user's identity consistently across frames. We further enhance emotional preservation through a customized facial expression recognition (FER) loss function, ensuring detailed inpainted outputs. Our proposed framework exhibits proficiency in eliminating occlusions from facial videos in an adaptive form, whether appearing static or dynamic on the frames, while providing realistic and coherent results.
面部视频修复在广泛的应用中扮演着关键角色,包括但不仅限于视频会议和远程医疗中的遮挡消除,面部表情分析的增强,隐私保护,图形覆盖物的集成以及虚拟化妆。由于面部特征的复杂性以及人类对脸的熟悉程度,这个领域提出了严峻的挑战,增加了准确和说服力的完整性的需求。在解决遮挡消除在这个场景中的具体挑战时,我们的关注点是生成完整的图像,确保在所有帧中都保持空间和时间上的连贯性。我们的研究引入了一个基于表达式视频修复的神经网络,使用生成对抗网络(GANs)处理所有帧中的静态和动态遮挡。通过利用面部特征点和一张无遮挡的参考图像,我们的模型在帧之间保持用户身份的一致性。我们还通过自定义面部表情识别(FER)损失函数来增强情感保留,确保详细的修复输出。我们提出的框架在消除面部视频遮挡方面表现出适应性,无论是静态还是动态出现在帧中,都能提供真实和连贯的结果。
https://arxiv.org/abs/2402.09100
Criminal and suspicious activity detection has become a popular research topic in recent years. The rapid growth of computer vision technologies has had a crucial impact on solving this issue. However, physical stalking detection is still a less explored area despite the evolution of modern technology. Nowadays, stalking in public places has become a common occurrence with women being the most affected. Stalking is a visible action that usually occurs before any criminal activity begins as the stalker begins to follow, loiter, and stare at the victim before committing any criminal activity such as assault, kidnapping, rape, and so on. Therefore, it has become a necessity to detect stalking as all of these criminal activities can be stopped in the first place through stalking detection. In this research, we propose a novel deep learning-based hybrid fusion model to detect potential stalkers from a single video with a minimal number of frames. We extract multiple relevant features, such as facial landmarks, head pose estimation, and relative distance, as numerical values from video frames. This data is fed into a multilayer perceptron (MLP) to perform a classification task between a stalking and a non-stalking scenario. Simultaneously, the video frames are fed into a combination of convolutional and LSTM models to extract the spatio-temporal features. We use a fusion of these numerical and spatio-temporal features to build a classifier to detect stalking incidents. Additionally, we introduce a dataset consisting of stalking and non-stalking videos gathered from various feature films and television series, which is also used to train the model. The experimental results show the efficiency and dynamism of our proposed stalker detection system, achieving 89.58% testing accuracy with a significant improvement as compared to the state-of-the-art approaches.
近年来,犯罪和可疑活动检测已成为一个热门的研究课题。计算机视觉技术的快速发展对解决这个问题起到了关键作用。然而,尽管现代技术的进步,物理跟踪检测仍然是一个相对较少研究的领域。如今,公共场所的跟踪已经成为女性最常遭受的侵犯行为。跟踪是一种明显的行动,通常在犯罪活动开始前就发生了,跟踪者开始跟随、窥视和盯着受害者,然后实施诸如攻击、绑架、强奸等犯罪活动。因此,通过跟踪检测这些犯罪活动已成为不可避免的必要性。在这个研究中,我们提出了一个基于深度学习的混合模型,用于从单个视频检测潜在跟踪者,且帧数最小。我们提取了视频帧中的多个相关特征,如面部特征点、头姿态估计和相对距离,并将其转换为数值值。将这个数据输入到多层感知器(MLP)中进行分类任务,将跟踪和非跟踪场景区分开来。同时,将视频帧输入到卷积和LSTM模型中提取空间和时间特征。我们使用这些数值和空间时间特征的融合来构建一个分类器,用于检测跟踪事件。此外,我们还引入了一个由各种电影和电视剧收集的跟踪和非跟踪视频组成的训练数据集,用于训练模型。实验结果表明,我们提出的跟踪检测系统具有高效率和动态性,在89.58%的测试准确率方面取得了显著的改善,与最先进的跟踪检测方法相比。
https://arxiv.org/abs/2402.03417
Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.
人类面部生成和编辑是计算机视觉和数字世界中的重要任务。最近的研究表明,多模态面部生成和编辑取得了显著进展,例如,通过面部分割来指导图像生成。然而,对于某些用户来说,手动创建这些调节模块可能具有挑战性。因此,我们引入了M3Face,一个可控制的多模态多语言框架,用于可控制的面部生成和编辑。该框架允许用户仅通过文本输入自动生成控制模块,例如语义分割或面部关键点,并随后生成面部图像。我们对我们的框架进行广泛的定性和定量实验,以展示其面部生成和编辑能力。此外,我们还提出了M3CelebA数据集,一个包含高质量图像、语义分割、面部关键点以及多种语言中每个图像的多个描述的大型多模态多语言面部数据集。代码和数据集将在发表时发布。
https://arxiv.org/abs/2402.02369
While state-of-the-art facial expression recognition (FER) classifiers achieve a high level of accuracy, they lack interpretability, an important aspect for end-users. To recognize basic facial expressions, experts resort to a codebook associating a set of spatial action units to a facial expression. In this paper, we follow the same expert footsteps, and propose a learning strategy that allows us to explicitly incorporate spatial action units (aus) cues into the classifier's training to build a deep interpretable model. In particular, using this aus codebook, input image expression label, and facial landmarks, a single action units heatmap is built to indicate the most discriminative regions of interest in the image w.r.t the facial expression. We leverage this valuable spatial cue to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \aus map. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with aus maps, simulating the experts' decision process. This is achieved using only the image class expression as supervision and without any extra manual annotations. Moreover, our method is generic. It can be applied to any CNN- or transformer-based deep classifier without the need for architectural change or adding significant training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on Class-Activation Mapping methods (CAMs), and we show that our training technique improves the CAM interpretability.
尽管最先进的面部表情识别(FER)分类器可以达到高水平的准确度,但他们缺乏可解释性,这对于最终用户来说非常重要。为了识别基本的面部表情,专家们不得不求助于将一系列空间动作单元与面部表情关联的代码本。在本文中,我们遵循了这位专家的步骤,并提出了一种可以将空间动作单元(AUS)线索明确融入分类器训练以构建深度可解释模型的学习策略。特别地,使用这个AUS代码本、输入图像表情标签和面部关键点,我们构建了一个单动作单元热力图,表示图像中与面部表情最具判别性的区域。我们利用这个有价值的空间线索来为FER训练一个深度可解释分类器。通过将分类器的空间层特征约束为与AUS映射相关联,我们训练了一个深度可解释分类器,同时正确分类图像,并产生与AUS映射相关的可解释视觉层级关注,模拟了专家的决策过程。仅使用图像类表达作为监督,没有进行任何额外的手动注释,我们证明了我们的方法是通用的。它可用于任何基于CNN或Transformer的深度分类器,而无需进行架构更改或增加显著的训练时间。我们对两个公开基准数据集RAFDB和AFFECTNET的广泛评估表明,我们提出的策略可以在不降低分类性能的情况下提高层间可解释性。此外,我们还研究了一种常见的可解释分类器类型,即基于类激活映射(CAM)的方法,并展示了我们的训练技术可以提高CAM的可解释性。
https://arxiv.org/abs/2402.00281
In this paper, we propose a novel approach for conducting face morphing attacks, which utilizes optimal-landmark-guided image blending. Current face morphing attacks can be categorized into landmark-based and generation-based approaches. Landmark-based methods use geometric transformations to warp facial regions according to averaged landmarks but often produce morphed images with poor visual quality. Generation-based methods, which employ generation models to blend multiple face images, can achieve better visual quality but are often unsuccessful in generating morphed images that can effectively evade state-of-the-art face recognition systems~(FRSs). Our proposed method overcomes the limitations of previous approaches by optimizing the morphing landmarks and using Graph Convolutional Networks (GCNs) to combine landmark and appearance features. We model facial landmarks as nodes in a bipartite graph that is fully connected and utilize GCNs to simulate their spatial and structural relationships. The aim is to capture variations in facial shape and enable accurate manipulation of facial appearance features during the warping process, resulting in morphed facial images that are highly realistic and visually faithful. Experiments on two public datasets prove that our method inherits the advantages of previous landmark-based and generation-based methods and generates morphed images with higher quality, posing a more significant threat to state-of-the-art FRSs.
在本文中,我们提出了一个新颖的进行面部变形攻击的方法,该方法利用最优特征点引导图像融合。当前的面部变形攻击可以分为基于地标和基于生成模型的方法。基于地标的 methods 使用几何变换根据平均地标扭曲面部区域,但通常会产生视觉质量较差的变形图像。基于生成的方法,使用生成模型将多个面部图像融合,可以实现更好的视觉质量,但通常无法生成能够有效逃避最先进面部识别系统(FRSs)的变形图像。我们提出的方法通过优化变形特征点并使用图卷积网络(GCNs)结合特征点和外观特征,克服了前方法的局限性。我们将面部特征点建模为二分图中的节点,并利用 GCNs 模拟其空间和结构关系。目标是在变形过程中捕捉面部形状的变异,并使面部外观特征在变形过程中得到准确的操作,从而生成高度逼真和视觉上忠实的外观图像。在两个公开数据集上的实验证明,我们的方法继承了前地标和生成模型的优势,生成的变形图像具有更高的质量,对最先进的 FRSs构成了更大的威胁。
https://arxiv.org/abs/2401.16722
Head-mounted displays (HMDs) serve as indispensable devices for observing extended reality (XR) environments and virtual content. However, HMDs present an obstacle to external recording techniques as they block the upper face of the user. This limitation significantly affects social XR applications, specifically teleconferencing, where facial features and eye gaze information play a vital role in creating an immersive user experience. In this study, we propose a new network for expression-aware video inpainting for HMD removal (EVI-HRnet) based on generative adversarial networks (GANs). Our model effectively fills in missing information with regard to facial landmarks and a single occlusion-free reference image of the user. The framework and its components ensure the preservation of the user's identity across frames using the reference frame. To further improve the level of realism of the inpainted output, we introduce a novel facial expression recognition (FER) loss function for emotion preservation. Our results demonstrate the remarkable capability of the proposed framework to remove HMDs from facial videos while maintaining the subject's facial expression and identity. Moreover, the outputs exhibit temporal consistency along the inpainted frames. This lightweight framework presents a practical approach for HMD occlusion removal, with the potential to enhance various collaborative XR applications without the need for additional hardware.
头戴显示器(HMDs)对于观察扩展现实(XR)环境和虚拟内容至关重要。然而,HMDs 对外部录制技术构成了障碍,因为它们挡住了用户的 upper face。这一限制大大影响了社交 XR 应用,特别是视频会议,因为在创建沉浸式用户体验的过程中,面部特征和眼动信息至关重要。在这项研究中,我们提出了一个基于生成对抗网络(GANs)的表达式注意视频修复(EVI-HRnet)新网络。我们的模型有效地通过修复面部特征和用户单张不遮挡的参考图像来填补缺失信息。该框架及其组件确保在帧之间保留用户的身份。为了进一步提高修复输出后的现实水平,我们引入了一种新的面部表情识别(FER)损失函数,用于情感保留。我们的结果表明,与修复后的面部视频相比,该建议框架可以有效地从面部视频中移除 HMD,同时保留主体的面部表情和身份。此外,修复后的输出在修复帧之间具有时间一致性。这个轻量级框架为 HMD 遮挡移除提供了一个实际的方法,具有不需要额外硬件来增强各种协作 XR 应用程序的潜力。
https://arxiv.org/abs/2401.14136
Recently, deep learning-based facial landmark detection for in-the-wild faces has achieved significant improvement. However, there are still challenges in face landmark detection in other domains (e.g. cartoon, caricature, etc). This is due to the scarcity of extensively annotated training data. To tackle this concern, we design a two-stage training approach that effectively leverages limited datasets and the pre-trained diffusion model to obtain aligned pairs of landmarks and face in multiple domains. In the first stage, we train a landmark-conditioned face generation model on a large dataset of real faces. In the second stage, we fine-tune the above model on a small dataset of image-landmark pairs with text prompts for controlling the domain. Our new designs enable our method to generate high-quality synthetic paired datasets from multiple domains while preserving the alignment between landmarks and facial features. Finally, we fine-tuned a pre-trained face landmark detection model on the synthetic dataset to achieve multi-domain face landmark detection. Our qualitative and quantitative results demonstrate that our method outperforms existing methods on multi-domain face landmark detection.
近年来,基于深度学习的野外面部关键点检测取得了显著的改进。然而,在其他领域(如卡通、漫画等)进行面部关键点检测仍然具有挑战性。这是因为缺乏大量注释的训练数据。为了应对这一问题,我们设计了一种两阶段训练方法,有效利用有限的数据集和预训练扩散模型,在多个领域获得对齐的关键点和面部。在第一阶段,我们在大量真实面部分割的大数据集上训练了一个带有关键点条件的面部生成模型。在第二阶段,我们在一个小型数据集中对上述模型进行微调,该数据集包含用于控制领域的图像关键点对。我们的新设计使我们的方法能够从多个领域生成高质量合成对齐的关键点数据,同时保留关键点与面部特征之间的对齐关系。最后,我们在合成数据集上对预训练面部关键点检测模型进行微调,以实现多领域面部关键点检测。我们的定性和定量结果表明,我们的方法在多领域面部关键点检测方面超过了现有方法。
https://arxiv.org/abs/2401.13191
Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.
简介:在人类-计算机交互和行为研究的领域,精确实时眼神检测是至关重要的。传统方法通常依赖于昂贵的设备或大量数据,这在许多场景下是不切实际的。本文介绍了一种新颖的基于几何的方法来解决这些挑战,利用消费级硬件实现更广泛的适用性。方法:我们利用具有快速检测消费者级芯片上面部关键点的神经网络来生成准确且稳定的面部和眼睛的三维关键点。从中,我们导出一个基于几何的描述符,构成一个8维的流形,表示眼和头的运动。这些描述符随后被用来形成预测眼 gaze 方向的线性方程。结果:我们的方法在角误差不到1.9度的情况下,展示了与最先进的系统相媲美的能力,同时在实时操作中,且对计算资源的需求非常小。结论:所开发的方法在目光检测技术上取得了显著的突破,为传统系统提供了一种高准确度、高效和易用性的替代方案。这为各种领域的实时应用提供了新的可能性,从游戏到心理学研究。
https://arxiv.org/abs/2401.00406
High-fidelity and efficient audio-driven talking head generation has been a key research topic in computer graphics and computer vision. In this work, we study vector image based audio-driven talking head generation. Compared with directly animating the raster image that most widely used in existing works, vector image enjoys its excellent scalability being used for many applications. There are two main challenges for vector image based talking head generation: the high-quality vector image reconstruction w.r.t. the source portrait image and the vivid animation w.r.t. the audio signal. To address these, we propose a novel scalable vector graphic reconstruction and animation method, dubbed VectorTalker. Specifically, for the highfidelity reconstruction, VectorTalker hierarchically reconstructs the vector image in a coarse-to-fine manner. For the vivid audio-driven facial animation, we propose to use facial landmarks as intermediate motion representation and propose an efficient landmark-driven vector image deformation module. Our approach can handle various styles of portrait images within a unified framework, including Japanese manga, cartoon, and photorealistic images. We conduct extensive quantitative and qualitative evaluations and the experimental results demonstrate the superiority of VectorTalker in both vector graphic reconstruction and audio-driven animation.
高度准确和高效的基于音频的二维头生成一直是计算机图形学和计算机视觉的研究热点。在这项研究中,我们研究基于矢量图像的音频驱动头生成。与目前工作中直接动画广泛使用的像素图像相比,矢量图像具有出色的可扩展性,被应用于许多应用。矢量图像基于头生成的两个主要挑战是:高质量矢量图像关于源肖像图像的重建以及生动的音乐信号关于矢量图像的动画。为了应对这些挑战,我们提出了名为VectorTalker的新可扩展矢量图形重构和动画方法。具体来说,对于高保真的重构,VectorTalker以粗到细的方式对矢量图像进行分层重构。对于生动的音乐驱动面部动画,我们提出使用面部特征作为中间运动表示,并提出了一个高效的标记驱动矢量图像变形模块。我们的方法可以在统一的框架中处理各种肖像图像风格,包括日本漫画、卡通和实写图像。我们进行了广泛的定量评估和实验,实验结果证明了VectorTalker在矢量图形重建和音频驱动动画方面的优越性。
https://arxiv.org/abs/2312.11568
Face swapping has gained significant traction, driven by the plethora of human face synthesis facilitated by deep learning methods. However, previous face swapping methods that used generative adversarial networks (GANs) as backbones have faced challenges such as inconsistency in blending, distortions, artifacts, and issues with training stability. To address these limitations, we propose an innovative end-to-end framework for high-fidelity face swapping. First, we introduce a StyleGAN-based facial attributes encoder that extracts essential features from faces and inverts them into a latent style code, encapsulating indispensable facial attributes for successful face swapping. Second, we introduce an attention-based style blending module to effectively transfer Face IDs from source to target. To ensure accurate and quality transferring, a series of constraint measures including contrastive face ID learning, facial landmark alignment, and dual swap consistency is implemented. Finally, the blended style code is translated back to the image space via the style decoder, which is of high training stability and generative capability. Extensive experiments on the CelebA-HQ dataset highlight the superior visual quality of generated images from our face-swapping methodology when compared to other state-of-the-art methods, and the effectiveness of each proposed module. Source code and weights will be publicly available.
面部换脸技术已经取得了显著的突破,得益于深度学习方法催生的丰富的人脸合成数据。然而,之前使用生成对抗网络(GANs)作为后端的换脸方法遇到了诸如混合不一致性、扭曲、伪影和训练稳定性等问题。为了应对这些局限,我们提出了一个高保真度面部换脸的端到端框架。首先,我们引入了一个基于StyleGAN的 facial属性编码器,从人脸中提取关键特征并将其转换为潜在风格代码,包含成功换脸所必需的面部属性。其次,我们引入了一个自注意力机制的换脸模块,有效地将源目标对人脸ID的转移。为确保准确和高质量的换脸,包括对比度人脸ID学习、面部关键点对齐和双替换一致性等的一系列约束措施得到了实现。最后,通过样式解码器将混合风格代码转换回图像空间,该解码器具有高训练稳定性和生成能力。对CelebA-HQ数据集的实验表明,与其他最先进的换脸方法相比,我们面部换脸方法生成的图像具有更高的视觉质量,每个提出的模块都具有显著的效果。源代码和权重将公开可用。
https://arxiv.org/abs/2312.10843
Dynamic NeRFs have recently garnered growing attention for 3D talking portrait synthesis. Despite advances in rendering speed and visual quality, challenges persist in enhancing efficiency and effectiveness. We present R2-Talker, an efficient and effective framework enabling realistic real-time talking head synthesis. Specifically, using multi-resolution hash grids, we introduce a novel approach for encoding facial landmarks as conditional features. This approach losslessly encodes landmark structures as conditional features, decoupling input diversity, and conditional spaces by mapping arbitrary landmarks to a unified feature space. We further propose a scheme of progressive multilayer conditioning in the NeRF rendering pipeline for effective conditional feature fusion. Our new approach has the following advantages as demonstrated by extensive experiments compared with the state-of-the-art works: 1) The lossless input encoding enables acquiring more precise features, yielding superior visual quality. The decoupling of inputs and conditional spaces improves generalizability. 2) The fusing of conditional features and MLP outputs at each MLP layer enhances conditional impact, resulting in more accurate lip synthesis and better visual quality. 3) It compactly structures the fusion of conditional features, significantly enhancing computational efficiency.
动态神经网络最近因3DTalker技术而受到了越来越多的关注。尽管渲染速度和视觉效果的提高,但在提高效率和效果方面仍然存在挑战。我们提出了R2-Talker,一种高效且有效的框架,实现真实实时谈话头合成。具体来说,我们使用多分辨率哈希网格引入了一种新的方法,将面部关键点编码为条件特征。这种方法无损地编码关键点结构作为条件特征,解耦输入多样性,通过将任意关键点映射到统一的特征空间,实现了条件空间。我们还提出了在NeRF渲染管道中进行逐步多层调节的方案,以实现有效的条件特征融合。通过与最先进的工作的广泛实验进行比较,我们的新方法具有以下优点:1)无损输入编码允许获得更精确的的特征,产生更好的视觉效果。解耦输入和条件空间有助于提高泛化能力。2)在MLP层中融合条件特征和MLP输出增强了条件影响,导致更准确的嘴合成和更好的视觉效果。3)它简化了条件特征融合,显著提高了计算效率。
https://arxiv.org/abs/2312.05572
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.
野外的动态面部表情识别(DFER)仍然受到数据限制的影响,例如,姿态、遮挡和光照不足的数量和多样性,以及面部表情的固有歧义性。相比之下,静态面部表情识别(SFER)目前表现出更高的性能,并可以从更丰富的高质量训练数据中受益。此外,DFER中表情特征和动态关系的隐蔽特征仍没有被充分利用。为解决这些挑战,我们引入了一种新颖的静态到动态模型(S2D),它利用现有的SFER知识以及从提取到的面部关键点感知特征中隐含的动态信息,从而显著提高了DFER的性能。首先,我们为SFER构建和训练了一个图像模型,该模型仅包含标准的Vision Transformer(ViT)和多视角互补提示(MCPs)。然后,通过在图像模型中插入时间建模器(TMAs),我们获得DFER的动态模型。MCPs通过来自标准面部关键点检测器的标记感知特征增强面部表情特征。TMAs捕捉并建模面部表情中动态变化之间的关系,从而有效地扩展了预训练的图像模型。值得注意的是,MCPs和TMAs仅增加了训练参数的一小部分(不到+10%)。此外,我们还通过自监督损失基于情感锚定物(i.e.为每个情感类别提供的参考样本)来降低模糊情感标签的负面影响,进一步增强我们的S2D。在流行SFER和DFER数据集上进行实验证明,我们达到了最先进水平。
https://arxiv.org/abs/2312.05447
We propose 360° Volumetric Portrait (3VP) Avatar, a novel method for reconstructing 360° photo-realistic portrait avatars of human subjects solely based on monocular video inputs. State-of-the-art monocular avatar reconstruction methods rely on stable facial performance capturing. However, the common usage of 3DMM-based facial tracking has its limits; side-views can hardly be captured and it fails, especially, for back-views, as required inputs like facial landmarks or human parsing masks are missing. This results in incomplete avatar reconstructions that only cover the frontal hemisphere. In contrast to this, we propose a template-based tracking of the torso, head and facial expressions which allows us to cover the appearance of a human subject from all sides. Thus, given a sequence of a subject that is rotating in front of a single camera, we train a neural volumetric representation based on neural radiance fields. A key challenge to construct this representation is the modeling of appearance changes, especially, in the mouth region (i.e., lips and teeth). We, therefore, propose a deformation-field-based blend basis which allows us to interpolate between different appearance states. We evaluate our approach on captured real-world data and compare against state-of-the-art monocular reconstruction methods. In contrast to those, our method is the first monocular technique that reconstructs an entire 360° avatar.
我们提出了360度立体肖像(3VP)Avatar,这是一种仅基于单目视频输入来重建人类 subject 的 360 度照片现实主义肖像的方法。最先进的单目 Avatar 重建方法依赖于稳定的面部表演捕捉。然而,基于 3DMM 的面部跟踪的常见用法有局限性;侧面视角很难被捕捉到,尤其是在背面视角时,因为缺少面部特征点或人类解析掩码等所需输入。这导致不完整的 Avatar 重建,仅覆盖到前额叶。 相比之下,我们提出了一个基于模板的追踪方案,追踪全身、头部和面部表情,使我们能够从所有侧面覆盖人类 subject 的外观。因此,对于一个在单个相机前旋转的主体的序列,我们基于神经辐射场进行神经体积表示。构建这种表示的一个关键挑战是建模嘴部区域(即嘴唇和牙齿)的外观变化。因此,我们提出了一个变形场为基础的混合基础,使我们能够在不同外观状态之间平滑插值。我们对我们的方法在捕获的现实世界数据上进行评估,并将其与最先进的单目重建方法进行比较。与那些方法相比,我们的方法是第一个仅基于单目的 360 度Avatar 重建方法。
https://arxiv.org/abs/2312.05311
This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
本文探讨了在EmotiW挑战2023中实现隐私合规的团级情感识别“在野”的问题。团级情感识别在很多领域都有用,包括社交机器人学、对话机器人、电子辅导和学习分析等。这项研究通过仅使用全局特征来避免个体特征,即所有可以用来识别或跟踪视频中的人的面部特征(如面部表情、身体姿势、音频解码等)来实现。所提出的多模态模型由视频和音频分支组成,各分支之间存在注意力交叉。视频分支基于微调的ViT架构。音频分支提取Mel频谱图,并将其通过CNN块传递给Transformer编码器。我们 的训练范式包括生成合成数据集以在数据驱动的方式增加模型对图像中面部表情的灵敏度。丰富的实验结果表明,我们的方法的有效性。我们的隐私合规方案在EmotiW挑战中表现得相当好,在验证和测试集上的最佳模型分别达到79.24%和75.13%的准确度。值得注意的是,我们的研究结果强调了使用仅5帧均匀分布于视频上的隐私合规特征可以达到这种准确度水平。
https://arxiv.org/abs/2312.05265
Facial landmark tracking for thermal images requires tracking certain important regions of subjects' faces, using images from thermal images, which omit lighting and shading, but show the temperatures of their subjects. The fluctuations of heat in particular places reflect physiological changes like bloodflow and perspiration, which can be used to remotely gauge things like anxiety and excitement. Past work in this domain has been limited to only a very limited set of architectures and techniques. This work goes further by trying a comprehensive suit of various models with different components, such as residual connections, channel and feature-wise attention, as well as the practice of ensembling components of the network to work in parallel. The best model integrated convolutional and residual layers followed by a channel-wise self-attention layer, requiring less than 100K parameters.
面部关键点跟踪热图像需要跟踪受试者面部的某些重要区域,使用热图像的图像,这些图像省略了照明和阴影,但显示了他们的受试者的温度。特别是热力图中热力波动的地方反映了生理变化,如血流和出汗,可以用来远程衡量诸如焦虑和兴奋之类的事情。此领域过去的进展仅限于非常有限的一组架构和技巧。本研究在尝试全面尝试各种模型,包括残差连接、通道和特征级关注,以及将网络组件聚类以并行工作的实践中,向前迈进了一步。最佳模型包括集成卷积和残差层,然后是一个通道级的自注意层,总参数不到100K个。
https://arxiv.org/abs/2311.08308
In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at this https URL.
在最近几年,视频会议在个人和商务目的中已经取得了基本的地位。实现视频会议的关键技术是实现低带宽视频流实时视频传输的压缩算法,因为它们减少了所需的带宽。然而,损失性的视频压缩降低了视觉质量。因此,近年来提出了许多减少压缩伪影和提高视频视觉质量的技术。在这项工作中,我们提出了一个基于GAN的新方法来减少视频会议中的压缩伪影。 由于在当前视频中,讲话者通常在摄像头前,且在整个传输过程中保持不变,我们可以从传输中的高质量I帧中提取一系列参考关键帧,并利用它们来指导视觉质量的提高。这种方法的一个新颖之处是更新策略,它维护并更新了一个紧凑而有效的参考关键帧集。首先,我们从压缩和参考帧中提取多尺度特征。然后,根据面部特征点将这些特征以渐进的方式组合在一起。这允许在视频压缩后恢复高频细节。实验证明,与高压缩率相比,所提出的方法可以提高视觉质量并产生照片现实的结果。代码和预训练网络可以在该https URL上获取。
https://arxiv.org/abs/2311.04263
The development of various sensing technologies is improving measurements of stress and the well-being of individuals. Although progress has been made with single signal modalities like wearables and facial emotion recognition, integrating multiple modalities provides a more comprehensive understanding of stress, given that stress manifests differently across different people. Multi-modal learning aims to capitalize on the strength of each modality rather than relying on a single signal. Given the complexity of processing and integrating high-dimensional data from limited subjects, more research is needed. Numerous research efforts have been focused on fusing stress and emotion signals at an early stage, e.g., feature-level fusion using basic machine learning methods and 1D-CNN Methods. This paper proposes a multi-modal learning approach for stress detection that integrates facial landmarks and biometric signals. We test this multi-modal integration with various early-fusion and late-fusion techniques to integrate the 1D-CNN model from biometric signals and 2-D CNN using facial landmarks. We evaluate these architectures using a rigorous test of models' generalizability using the leave-one-subject-out mechanism, i.e., all samples related to a single subject are left out to train the model. Our findings show that late-fusion achieved 94.39\% accuracy, and early-fusion surpassed it with a 98.38\% accuracy rate. This research contributes valuable insights into enhancing stress detection through a multi-modal approach. The proposed research offers important knowledge in improving stress detection using a multi-modal approach.
各种传感技术的不断发展提高了对压力和个体健康状况的测量。虽然单信号模态(如可穿戴设备和面部表情识别)的进步已经取得,但整合多个模态提供了对压力更全面的了解,因为压力在每个人身上表现方式不同。多模态学习旨在利用每个模态的优势,而不是依赖单一信号。考虑到处理和整合高维数据的精复杂度,需要进行更多的研究。已经有很多研究将压力和情感信号在早期阶段进行融合,例如使用基本机器学习方法和1D-CNN方法进行特征级融合。本文提出了一种多模态学习方法来进行压力检测,整合面部特征和生物特征信号。我们测试了这些早期融合和晚期融合技术对1D-CNN模型和2D CNN模型的整合效果。我们使用严格的模型泛化测试来评估这些架构,即所有与单一主题相关的样本都被排除,以训练模型。我们的研究结果表明,晚期融合获得了94.39%的准确率,而早期融合超过了98.38%的准确率。这项研究为通过多模态方法增强压力检测提供了宝贵的洞见。所提出的研究为通过多模态方法提高压力检测提供了重要的知识。
https://arxiv.org/abs/2311.03606
Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a recurrent token mixing mechanism, an axis-landmark-positional embedding mechanism, as well as a confidence-enhanced multi-head attention mechanism to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group structure modeling mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking.
近年来,基于1D地标表示的热力图回归方法在检测面部标志物方面表现出优异性能。然而,之前的的方法忽略了探索1D地标表示在多个标志物序列和结构建模中的潜在优势。为了克服这一局限,我们提出了一个Transformer架构,即1DFormer,通过在时间和空间维度上捕获标志物的动态和几何模式来学习有用的1D地标表示。对于时间建模,我们提出了一个循环标记混合机制、轴心标记位置嵌入机制以及一个增强置信的多头注意力机制,以适应和稳健地将长期标志物动态嵌入其1D表示中;对于结构建模,我们设计了组内和组间结构建模机制,通过空间维度上的1D卷积层将标志物的组件级和全局级结构模式编码为对1D表示的改进。在300VW和TF数据库上的实验结果表明,1DFormer成功地建模了长距离的时间序列模式以及固有面部结构,并获得了卓越的面部标志物跟踪性能。
https://arxiv.org/abs/2311.00241
The field of animal affective computing is rapidly emerging, and analysis of facial expressions is a crucial aspect. One of the most significant challenges that researchers in the field currently face is the scarcity of high-quality, comprehensive datasets that allow the development of models for facial expressions analysis. One of the possible approaches is the utilisation of facial landmarks, which has been shown for humans and animals. In this paper we present a novel dataset of cat facial images annotated with bounding boxes and 48 facial landmarks grounded in cat facial anatomy. We also introduce a landmark detection convolution neural network-based model which uses a magnifying ensembe method. Our model shows excellent performance on cat faces and is generalizable to human facial landmark detection.
动物情感计算领域正在迅速崛起,而面部表情的分析是一个关键方面。目前该领域研究人员面临的一个最显著的挑战是高质量、全面的数据集的稀缺性,这使得开发面部表情分析模型变得困难。一种可能的解决方案是利用面部标志点,这在人类和动物身上已经被证明是有效的。在本文中,我们介绍了一个由边界框和基于猫面部解剖学结构的48个面部标志点注释的猫面部图像的新型数据集。我们还介绍了一种使用卷积神经网络-based模型的地标检测方法。我们的模型在猫面部表现出色,并且可以扩展到人类面部地标检测。
https://arxiv.org/abs/2310.09793