With the advent of social media, fun selfie filters have come into tremendous mainstream use affecting the functioning of facial biometric systems as well as image recognition systems. These filters vary from beautification filters and Augmented Reality (AR)-based filters to filters that modify facial landmarks. Hence, there is a need to assess the impact of such filters on the performance of existing face recognition systems. The limitation associated with existing solutions is that these solutions focus more on the beautification filters. However, the current AR-based filters and filters which distort facial key points are in vogue recently and make the faces highly unrecognizable even to the naked eye. Also, the filters considered are mostly obsolete with limited variations. To mitigate these limitations, we aim to perform a holistic impact analysis of the latest filters and propose an user recognition model with the filtered images. We have utilized a benchmark dataset for baseline images, and applied the latest filters over them to generate a beautified/filtered dataset. Next, we have introduced a model FaceFilterNet for beautified user recognition. In this framework, we also utilize our model to comment on various attributes of the person including age, gender, and ethnicity. In addition, we have also presented a filter-wise impact analysis on face recognition, age estimation, gender, and ethnicity prediction. The proposed method affirms the efficacy of our dataset with an accuracy of 87.25% and an optimal accuracy for facial attribute analysis.
随着社交媒体的出现,有趣的自拍滤镜已经进入了 mainstream 使用,对面部生物特征系统和图像识别系统产生了重大影响。这些滤镜从美颜滤镜和基于增强现实 (AR) 的滤镜到修改面部特征的滤镜。因此,有必要评估这类滤镜对现有面部识别系统性能的影响。现有解决方案的局限性在于,这些解决方案更关注美颜滤镜。然而,当前的 AR 基滤镜和扭曲面部关键点的滤镜最近很流行,使脸部高度难以识别,甚至对裸眼观察者来说也是如此。此外,考虑的滤镜大多是过时的,且变化有限。为了减轻这些限制,我们旨在对最新滤镜进行全面的评估,并提出了带有滤镜的用户识别模型。我们在基准图像上利用了基准数据集,并应用最新滤镜生成一个美颜/滤镜化数据集。接下来,我们引入了 FaceFilterNet 模型进行美颜用户识别。在这个框架下,我们还利用我们的模型来评论人员的各种属性,包括年龄、性别和种族。此外,我们还对面部识别、年龄估计、性别和种族预测进行了滤镜逐个影响分析。所提出的方法证实了我们的数据集的有效性,准确率为 87.25%,面部属性分析的最优准确率。
https://arxiv.org/abs/2404.08277
The domain of computer vision has experienced significant advancements in facial-landmark detection, becoming increasingly essential across various applications such as augmented reality, facial recognition, and emotion analysis. Unlike object detection or semantic segmentation, which focus on identifying objects and outlining boundaries, faciallandmark detection aims to precisely locate and track critical facial features. However, deploying deep learning-based facial-landmark detection models on embedded systems with limited computational resources poses challenges due to the complexity of facial features, especially in dynamic settings. Additionally, ensuring robustness across diverse ethnicities and expressions presents further obstacles. Existing datasets often lack comprehensive representation of facial nuances, particularly within populations like those in Taiwan. This paper introduces a novel approach to address these challenges through the development of a knowledge distillation method. By transferring knowledge from larger models to smaller ones, we aim to create lightweight yet powerful deep learning models tailored specifically for facial-landmark detection tasks. Our goal is to design models capable of accurately locating facial landmarks under varying conditions, including diverse expressions, orientations, and lighting environments. The ultimate objective is to achieve high accuracy and real-time performance suitable for deployment on embedded systems. This method was successfully implemented and achieved a top 6th place finish out of 165 participants in the IEEE ICME 2024 PAIR competition.
计算机视觉领域在面部特征检测方面取得了显著的进步,越来越成为各种应用(如增强现实、面部识别和情感分析)中不可或缺的技术。与物体检测或语义分割不同,面部特征检测的目标是精确地定位和跟踪关键面部特征。然而,将基于深度学习的面部特征检测模型部署在具有有限计算资源的嵌入式系统上,会面临由于面部特征的复杂性而导致的挑战。此外,确保在不同种族和表情上具有稳健性也带来了进一步的障碍。现有的数据集通常缺乏对面部细节的全面代表,特别是在像台湾这样的种族中。本文通过开发一种知识蒸馏方法来解决这些挑战。通过将知识从大模型传递到小模型,我们旨在创建专为面部特征检测任务设计的轻量级但强大的深度学习模型。我们的目标是设计出在各种条件下准确定位面部特征的模型,包括不同的表情、方向和光线环境。最终目标是实现适合嵌入式系统的较高准确性和实时性能。该方法在IEEE ICME 2024 PAIR比赛中取得了第六名的成绩。
https://arxiv.org/abs/2404.06029
Automated data labeling techniques are crucial for accelerating the development of deep learning models, particularly in complex medical imaging applications. However, ensuring accuracy and efficiency remains challenging. This paper presents iterative refinement strategies for automated data labeling in facial landmark diagnosis to enhance accuracy and efficiency for deep learning models in medical applications, including dermatology, plastic surgery, and ophthalmology. Leveraging feedback mechanisms and advanced algorithms, our approach iteratively refines initial labels, reducing reliance on manual intervention while improving label quality. Through empirical evaluation and case studies, we demonstrate the effectiveness of our proposed strategies in deep learning tasks across medical imaging domains. Our results highlight the importance of iterative refinement in automated data labeling to enhance the capabilities of deep learning systems in medical imaging applications.
自动数据标注技术对于加速深度学习模型的开发,尤其是在复杂的医学成像应用中,至关重要。然而,确保准确性和效率仍然具有挑战性。本文介绍了一种用于面部 landmark 诊断的自动数据标注的迭代改进策略,以提高医学应用中深度学习模型的准确性和效率,包括皮肤科、整形外科和眼科。通过利用反馈机制和先进算法,我们的方法迭代地优化初始标签,减少对手动干预的依赖,同时提高标签质量。通过实验评估和案例研究,我们证明了我们在医学成像领域中的深度学习任务的实际效果。我们的结果强调了在自动数据标注中进行迭代改进对于增强深度学习系统在医学成像应用中的能力的重要性。
https://arxiv.org/abs/2404.05348
The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.
近年来,条件文本到图像扩散模型已经引起了广泛的关注。然而,这些模型的精度往往因为两个主要原因而妥协,即模糊的条件输入和单向去噪损失中的条件指导不足。为了应对这些挑战,我们提出了两种创新解决方案。首先,我们提出了一种空间引导器(SGI),通过为文本输入提供精确的注释信息来增强条件细节。这种方法直接解决了模糊控制输入的问题,并为模型提供了明确的、注释的指导。其次,为了克服有限的条件监督问题,我们引入了扩散一致损失(DCL),对任意给定时间点的去噪潜在码进行监督。这鼓励了每个时间步的潜在码与输入信号之间的一致性,从而提高了输出的稳健性和准确性。SGI和DCL的结合我们产生了有效的可控制网络(ECNet),它具有更精确的控制输入和更强的可控制监督,从而实现了更准确、更可靠的文本到图像生成。我们通过在各种条件下进行广泛的实验来验证我们的方法。实验结果一致表明,我们的方法显著增强了生成图像的可控性和稳健性,超过了现有最先进的可控制文本到图像模型。
https://arxiv.org/abs/2403.18417
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at this https URL
在这项研究中,我们提出了AniPortrait,一种基于音频和参考肖像图像生成高质量动画的新框架。我们的方法分为两个阶段。首先,我们从音频中提取3D中间表示,并将其投影为一系列2D面部关键点。随后,我们采用了一个鲁棒的扩散模型,与运动模块相结合,将关键点序列转换为逼真且时间一致的肖像动画。实验结果表明,AniPortrait在面部自然性、姿态多样性和视觉质量方面具有优越性,从而提供了更加逼真的感知体验。此外,我们的方法在灵活性和可控制性方面具有很大潜力,可以有效地应用于诸如面部运动编辑或面部复原等领域。代码和模型权重现在可以从该链接下载:
https://arxiv.org/abs/2403.17694
Engagement in virtual learning is crucial for a variety of factors including learner satisfaction, performance, and compliance with learning programs, but measuring it is a challenging task. There is therefore considerable interest in utilizing artificial intelligence and affective computing to measure engagement in natural settings as well as on a large scale. This paper introduces a novel, privacy-preserving method for engagement measurement from videos. It uses facial landmarks, which carry no personally identifiable information, extracted from videos via the MediaPipe deep learning solution. The extracted facial landmarks are fed to a Spatial-Temporal Graph Convolutional Network (ST-GCN) to output the engagement level of the learner in the video. To integrate the ordinal nature of the engagement variable into the training process, ST-GCNs undergo training in a novel ordinal learning framework based on transfer learning. Experimental results on two video student engagement measurement datasets show the superiority of the proposed method compared to previous methods with improved state-of-the-art on the EngageNet dataset with a %3.1 improvement in four-class engagement level classification accuracy and on the Online Student Engagement dataset with a %1.5 improvement in binary engagement classification accuracy. The relatively lightweight ST-GCN and its integration with the real-time MediaPipe deep learning solution make the proposed approach capable of being deployed on virtual learning platforms and measuring engagement in real time.
在虚拟学习中进行参与度对于各种因素(包括学习者的满意度、表现和遵守学习计划)至关重要,但测量它是一个具有挑战性的任务。因此,人们对利用人工智能和情感计算从自然环境中测量参与度以及在大规模上测量参与度的兴趣很大。本文介绍了一种新的、隐私保护的从视频测量参与度的方法。它使用通过MediaPipe深度学习解决方案提取的视频中的面部特征点。提取的面部特征点通过Spatial-Temporal Graph Convolutional Network (ST-GCN)输出视频中学习者的参与度。为了将参与变量的顺序特性融入培训过程,ST-GCNs基于迁移学习在一种新颖的有序学习框架中进行训练。在两个视频学生参与度测量数据集上的实验结果表明,与之前的方法相比,所提出的方法在EngageNet数据集上提高了3.1%的四类参与度分类准确率,在在线学生参与度数据集上提高了1.5%的二分类参与度分类准确率。相对较轻的ST-GCN和其与实时MediaPipe深度学习解决方案的集成使得所提出的方法能够部署到虚拟学习平台上,实时测量参与度。
https://arxiv.org/abs/2403.17175
We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.
我们提出了X-Portrait,一种针对生成具有表现力和时间一致性的肖像动画的创新条件扩散模型。具体来说,给定一个单张肖像作为 appearance 参考,我们旨在通过来自驱动视频的运动来动画它,捕捉高动态度和微妙面部表情,并实现广泛的头部运动。其核心在于,我们利用预训练扩散模型的生成先验作为渲染骨架,同时通过 ControlNet 中的新控制信号实现细粒度头部姿势和表情控制。与传统的粗显控制方法(如面部特征)相比,我们的运动控制模块是在原始驱动 RGB 输入的框架内学习的,可以直接从原始驱动信号中解释动态。通过基于补丁的控制模块,可以进一步增强对小规模微妙的运动关注,比如眼睛位置。值得注意的是,为了减轻来自驱动信号的身份泄漏,我们通过缩放增强交叉熵图像来训练我们的运动控制模块,确保从表现参考模块的最大分离。实验结果表明,X-Portrait 在各种面部肖像和表现驱动序列中具有普遍的有效性,并展示了其在生成具有保持一致身份特性的引人入胜肖像动画方面的卓越能力。
https://arxiv.org/abs/2403.15931
In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.
在本文中,我们解决了使ViT模型对未见到的平移变换更加鲁棒的问题。这种鲁棒性在各种识别任务中变得有用,例如在面部识别中,当图像对齐失败时。我们提出了一种名为KP-RPE的新方法,它利用关键点(例如~面部关键点)使ViT更加弹性,对平移和姿态变化具有鲁棒性。我们首先观察到,相对位置编码(RPE)是将平移变换推广到ViT的好的方法。然而,RPE只能将模型注入到附近像素比远距离像素更重要这一先验知识。关键点RPE(KP-RPE)是这一原则的扩展,其中像素的重要性不仅由其邻近性决定,还由其与图像中特定关键点之间的相对位置决定。通过将像素的重要性锚定在关键点上,模型可以在平移变换破坏时更有效地保留空间关系。我们在面部和步态识别中展示了KP-RPE的优点。实验结果表明,从低质量图像中提高面部识别性能的有效方法,特别是在对齐容易失败的地方。代码和预训练模型可获得。
https://arxiv.org/abs/2403.14852
We propose Semantic Facial Feature Control (SeFFeC) - a novel method for fine-grained face shape editing. Our method enables the manipulation of human-understandable, semantic face features, such as nose length or mouth width, which are defined by different groups of facial landmarks. In contrast to existing methods, the use of facial landmarks enables precise measurement of the facial features, which then enables training SeFFeC without any manually annotated labels. SeFFeC consists of a transformer-based encoder network that takes a latent vector of a pre-trained generative model and a facial feature embedding as input, and learns to modify the latent vector to perform the desired face edit operation. To ensure that the desired feature measurement is changed towards the target value without altering uncorrelated features, we introduced a novel semantic face feature loss. Qualitative and quantitative results show that SeFFeC enables precise and fine-grained control of 23 facial features, some of which could not previously be controlled by other methods, without requiring manual annotations. Unlike existing methods, SeFFeC also provides deterministic control over the exact values of the facial features and more localised and disentangled face edits.
我们提出了 Semantic Facial Feature Control (SeFFeC) - 一种用于细粒度面部形状编辑的新方法。我们的方法允许用户操纵可理解、语义的面部特征,如鼻子长度或嘴巴宽度,这些特征由不同的面部标志组定义。与现有方法相比,使用面部标志进行操作使得可以精确测量面部特征,从而在没有任何手动注释标签的情况下训练 SeFFeC。SeFFeC 由一个基于 Transformer 的编码器网络组成,该网络接受预训练生成模型的潜在向量和一个面部特征嵌入作为输入,并学会修改潜在向量以执行所需的面部编辑操作。为了确保所需的特征测量值朝目标值改变而不会改变无关特征,我们引入了一种新的语义面部特征损失。定性和定量的结果表明,SeFFeC 能够精确控制 23 个面部特征,其中一些特征以前不能由其他方法控制,而无需手动注释。与现有方法不同,SeFFeC 还提供了对面部特征确切值的确定性和更局部和分离的面部编辑的控制。
https://arxiv.org/abs/2403.13972
With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at this https URL.
在对各种面部分析任务进行全面的调查和研究后,越来越多的研究者对发展统一的面部感知方法产生了浓厚的兴趣。现有的方法主要讨论了统一的表示和训练,缺乏任务的扩展性和应用效率。为解决这个问题,我们关注统一的模型结构,研究了一个面部通用模型。作为一种直观的设计,Naive Faceptor使具有相同输出形状和粒度的任务可以共享标准输出头的结构设计,从而实现提高任务扩展性的目标。此外,Faceptor还提出了一个设计良好的单编码器双解码器架构,允许任务特定的查询表示新兴的语义。这种设计在提高模型结构统一的同时,提高了存储开销的应用效率。此外,我们还引入了层注意力机制到Faceptor中,使模型能够动态选择最优层中的特征来执行所需任务。通过在13个面部感知数据集上进行联合训练,Faceptor在面部关键点定位、面部解析、年龄估计、表情识别、二进制属性分类和面部识别等方面取得了惊人的性能,超越了大多数专用方法。我们的训练框架也可以应用于辅助监督学习,在数据稀疏任务(如年龄估计和表情识别)中显著提高性能。代码和模型将在这个https:// URL上公开发布。
https://arxiv.org/abs/2403.09500
In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
在这项工作中,我们专注于学习可以在训练有效面部识别模型时进行自适应的面部表示。特别是在没有标签的情况下。首先,与现有的带有标签的人脸数据集相比,现实世界中存在大量未标记的脸。我们通过自监督预训练来探索这些未标记面部图像的学习策略,以转移通用的面部识别性能。此外,受到最近的一个发现的影响,即脸部显著区域对于面部识别至关重要,我们使用通过提取面部特征点来定位的补丁作为自监督学习中的标签。这使得我们的方法 - 基于LAndmark的特征点自监督学习LAFS) - 可以学习更具关键性的面部表示。我们还引入了两种特定特征点的自监督增强,以进一步规范化学习。通过学习基于特征点的面部表示,我们进一步通过正则化来缓解特征点位置的变化。我们的方法在多个面部识别基准测试中都实现了显著的改进,尤其是在更具挑战性的几拍场景中。
https://arxiv.org/abs/2403.08161
Multimodal deep learning methods capture synergistic features from multiple modalities and have the potential to improve accuracy for stress detection compared to unimodal methods. However, this accuracy gain typically comes from high computational cost due to the high-dimensional feature spaces, especially for intermediate fusion. Dimensionality reduction is one way to optimize multimodal learning by simplifying data and making the features more amenable to processing and analysis, thereby reducing computational complexity. This paper introduces an intermediate multimodal fusion network with manifold learning-based dimensionality reduction. The multimodal network generates independent representations from biometric signals and facial landmarks through 1D-CNN and 2D-CNN. Finally, these features are fused and fed to another 1D-CNN layer, followed by a fully connected dense layer. We compared various dimensionality reduction techniques for different variations of unimodal and multimodal networks. We observe that the intermediate-level fusion with the Multi-Dimensional Scaling (MDS) manifold method showed promising results with an accuracy of 96.00\% in a Leave-One-Subject-Out Cross-Validation (LOSO-CV) paradigm over other dimensional reduction methods. MDS had the highest computational cost among manifold learning methods. However, while outperforming other networks, it managed to reduce the computational cost of the proposed networks by 25\% when compared to six well-known conventional feature selection methods used in the preprocessing step.
多模态深度学习方法从多个模态中捕获协同特征,具有改善与单模态方法的准确性相比的压力检测的精度的潜力。然而,这种准确性提升通常来自高维特征空间的计算成本,特别是在中间融合阶段。降维是优化多模态学习的一种方式,通过简化数据并使特征更具处理和分析的灵活性,从而降低计算复杂性。本文介绍了一种基于多维学习基于降维的中间多模态融合网络。多模态网络通过1D-CNN和2D-CNN从生物特征信号和面部关键点生成独立表示。最后,这些特征被融合并输入到另一个1D-CNN层,接着是全连接密集层。我们比较了不同单模态和多模态网络的降维技术。我们观察到,在Leave-One-Subject-Out Cross-Validation(LOSO-CV)范式中,中间级融合与Multi-Dimensional Scaling(MDS)分形方法显示出有希望的结果,其准确率为96.00%。MDS在分形学习方法中具有最高的计算成本。然而,与其他网络相比,它通过与预处理步骤中使用的六个已知传统特征选择方法相比较,将所提出网络的计算成本降低了25%。
https://arxiv.org/abs/2403.08077
In this paper, we consider a novel and practical case for talking face video generation. Specifically, we focus on the scenarios involving multi-people interactions, where the talking context, such as audience or surroundings, is present. In these situations, the video generation should take the context into consideration in order to generate video content naturally aligned with driving audios and spatially coherent to the context. To achieve this, we provide a two-stage and cross-modal controllable video generation pipeline, taking facial landmarks as an explicit and compact control signal to bridge the driving audio, talking context and generated videos. Inside this pipeline, we devise a 3D video diffusion model, allowing for efficient contort of both spatial conditions (landmarks and context video), as well as audio condition for temporally coherent generation. The experimental results verify the advantage of the proposed method over other baselines in terms of audio-video synchronization, video fidelity and frame consistency.
在本文中,我们考虑了一种新的且实用的谈话面部视频生成案例。具体来说,我们关注涉及多人群互动的情况,其中谈话上下文(如观众或环境)是存在的。在这些情况下,视频生成应考虑上下文以生成与驾驶音频自然同步的视频内容,并使其在空间上与上下文一致。为了实现这一目标,我们提供了两个阶段的跨模态可控制视频生成管道,利用面部关键点作为显式且紧凑的控制信号来桥接驾驶音频、谈话上下文和生成视频。在这个管道中,我们设计了一个3D视频扩散模型,允许对空间条件(关键点和上下文视频)进行有效的弯曲,以及对音频条件的时域一致生成。实验结果证实了与基线相比,所提出方法在音频-视频同步性、视频质量和帧一致性方面的优势。
https://arxiv.org/abs/2402.18092
In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.
在这项工作中,我们通过关注音频线索和面部动作之间动态而微妙的關係,来应对在谈话视频生成中提高真实感和表现力的挑战。我们指出了传统技术通常无法捕捉到人类表情的完整范围,以及个体面部风格的独特性。为解决这些问题,我们提出了EMO,一种新框架,利用直接音频到视频合成方法,无需中间3D模型或面部关键点。我们的方法确保了视频中的平滑过渡和身份保持一致,从而产生了高度表现力和逼真的动画。实验结果表明,EMO能够产生不仅是令人信服的讲话视频,还有各种风格的音乐视频,在表现力和真实感方面显著超过了现有最先进的方法。
https://arxiv.org/abs/2402.17485
Video conferencing has caught much more attention recently. High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. Most pioneering methods rely on classic video compression codec without high-level feature embedding and thus can not reach the extremely low bandwidth. Recent works instead employ model-based neural compression to acquire ultra-low bitrates using sparse representations of each frame such as facial landmark information, while these approaches can not maintain high fidelity due to 2D image-based warping. In this paper, we propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing using implicit radiance fields to achieve both major objectives. We leverage dynamic neural radiance fields to reconstruct high-fidelity talking head with expression features, which are represented as frame substitution for transmission. The overall system employs deep model to encode expression features at the sender and reconstruct portrait at the receiver with volume rendering as decoder for ultra-low bandwidth. In particular, with the characteristic of neural radiance fields based model, our compression approach is resolution-agnostic, which means that the low bandwidth achieved by our approach is independent of video resolution, while maintaining fidelity for higher resolution reconstruction. Experimental results demonstrate that our novel framework can (1) construct ultra-low bandwidth video conferencing, (2) maintain high fidelity portrait and (3) have better performance on high-resolution video compression than previous works.
视频会议最近引起了更多关注。高清晰度和低带宽是视频压缩为视频会议应用的主要目标。大多数先驱方法依赖于经典视频压缩编码器,没有高层次的特征嵌入,因此无法达到极低的带宽。最近的工作则采用基于模型的神经压缩方法,利用每个帧的稀疏表示(如面部关键点信息)来获得超低带宽,而這些方法无法保持高清晰度,因为它们基于二维图像的变形。在本文中,我们提出了一个用于高清晰度人像视频会议的新型低带宽神经压缩方法,利用隐式辐射场实现这两个主要目标。我们利用动态神经辐射场重构高清晰度谈话头,表达特征用帧置换表示。整个系统采用深度模型对发送方的表达特征进行编码,使用体积渲染作为解码器来重构接收端的肖像,以实现超低带宽。特别地,基于神经辐射场模型的特点,我们的压缩方法对分辨率无关,这意味着我们方法的低带宽与视频分辨率无关,同时保持高清晰度的重建。实验结果表明,我们的新框架可以(1)构建超低带宽的视频会议,(2)保持高清晰度的人像,(3)在视频压缩方面的性能比以前的工作更好。
https://arxiv.org/abs/2402.16599
In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotional understanding can improve speech enhancement performance. We propose a novel emotion-aware AVSE system that leverages both auditory and visual information. It extracts emotional features from the facial landmarks of the speaker and fuses them with corresponding audio and visual modalities. This enriched data serves as input to a deep UNet-based encoder-decoder network, specifically designed to orchestrate the fusion of multimodal information enhanced with emotion. The network iteratively refines the enhanced speech representation through an encoder-decoder architecture, guided by perceptually-inspired loss functions for joint learning and optimization. We train and evaluate the model on the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset, a rich repository of audio-visual recordings with annotated emotions. Our comprehensive evaluation demonstrates the effectiveness of emotion as a contextual cue for AVSE. By integrating emotional features, the proposed system achieves significant improvements in both objective and subjective assessments of speech quality and intelligibility, especially in challenging noise environments. Compared to baseline AVSE and audio-only speech enhancement systems, our approach exhibits a noticeable increase in PESQ and STOI, indicating higher perceptual quality and intelligibility. Large-scale listening tests corroborate these findings, suggesting improved human understanding of enhanced speech.
在现实环境里,背景噪音显著地降低了人类语音的可听度和清晰度。音频-视觉语音增强(AVSE)试图恢复语音质量,但现有方法往往不够,特别是在动态噪音条件下。本研究探讨了将情感作为AVSE中的新颖上下文线索,假设将情感理解纳入其中可以提高 speech enhancement performance。我们提出了一个新颖的基于情感的AVSE系统,该系统利用听觉和视觉信息。它从发言者的面部特征中提取情感特征,并将它们与相应的音频和视觉模块融合。这个丰富的数据作为输入输入到专为情感信息融合而设计的深度UNet-based编码器-解码器网络中,该网络通过感知驱动的损失函数进行迭代优化。我们在CMU多模态情感意见(CMU-MOSEI)数据集上训练并评估该模型,这是一个丰富的音频-视觉录音数据集,带有注释的情感。全面的评估表明,情感作为上下文线索在AVSE中具有有效的效果。通过将情感特征融合到系统中,所提出的系统在客观和主观评估中的语音质量和可听性方面都取得了显著的改进,尤其是在具有挑战性噪音环境中。与基线AVSE和仅音频增强系统相比,我们的方法在感知质量和社会影响力方面明显增加,表明具有更高的可听度和理解性。大规模听力测试证实了这些发现,表明提高了人类对增强语音的理解。
https://arxiv.org/abs/2402.16394
Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model's inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model's understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
深度学习方法已经在面部关键点检测(FLD)任务上取得了显著的改进。然而,在具有挑战性的环境中检测关键点,例如头部姿态变化、夸张表情或不均匀照明,仍然是一个挑战,因为存在高度的变异性不足的样本。这种不足可以归因于模型无法有效地从输入图像中获取适当的面部结构信息。为了应对这个问题,我们提出了一个专门针对FLD任务设计的图像增强技术,以增强模型对面部结构的认知。为了有效地利用新提出的增强技术,我们采用了一种基于Siamese网络架构的训练方法,结合了基于深度卷积分析(DCCA)的损失,以实现从输入图像的两个不同视角进行集体学习。此外,我们还使用了一个自定义的时钟模块的Transformer + CNN网络作为Siamese框架的 robust 骨干网络。大量实验证明,我们的方法在各种基准数据集上超过了多个最先进的方法的性能。
https://arxiv.org/abs/2402.15044
Facial video inpainting plays a crucial role in a wide range of applications, including but not limited to the removal of obstructions in video conferencing and telemedicine, enhancement of facial expression analysis, privacy protection, integration of graphical overlays, and virtual makeup. This domain presents serious challenges due to the intricate nature of facial features and the inherent human familiarity with faces, heightening the need for accurate and persuasive completions. In addressing challenges specifically related to occlusion removal in this context, our focus is on the progressive task of generating complete images from facial data covered by masks, ensuring both spatial and temporal coherence. Our study introduces a network designed for expression-based video inpainting, employing generative adversarial networks (GANs) to handle static and moving occlusions across all frames. By utilizing facial landmarks and an occlusion-free reference image, our model maintains the user's identity consistently across frames. We further enhance emotional preservation through a customized facial expression recognition (FER) loss function, ensuring detailed inpainted outputs. Our proposed framework exhibits proficiency in eliminating occlusions from facial videos in an adaptive form, whether appearing static or dynamic on the frames, while providing realistic and coherent results.
面部视频修复在广泛的应用中扮演着关键角色,包括但不仅限于视频会议和远程医疗中的遮挡消除,面部表情分析的增强,隐私保护,图形覆盖物的集成以及虚拟化妆。由于面部特征的复杂性以及人类对脸的熟悉程度,这个领域提出了严峻的挑战,增加了准确和说服力的完整性的需求。在解决遮挡消除在这个场景中的具体挑战时,我们的关注点是生成完整的图像,确保在所有帧中都保持空间和时间上的连贯性。我们的研究引入了一个基于表达式视频修复的神经网络,使用生成对抗网络(GANs)处理所有帧中的静态和动态遮挡。通过利用面部特征点和一张无遮挡的参考图像,我们的模型在帧之间保持用户身份的一致性。我们还通过自定义面部表情识别(FER)损失函数来增强情感保留,确保详细的修复输出。我们提出的框架在消除面部视频遮挡方面表现出适应性,无论是静态还是动态出现在帧中,都能提供真实和连贯的结果。
https://arxiv.org/abs/2402.09100
Criminal and suspicious activity detection has become a popular research topic in recent years. The rapid growth of computer vision technologies has had a crucial impact on solving this issue. However, physical stalking detection is still a less explored area despite the evolution of modern technology. Nowadays, stalking in public places has become a common occurrence with women being the most affected. Stalking is a visible action that usually occurs before any criminal activity begins as the stalker begins to follow, loiter, and stare at the victim before committing any criminal activity such as assault, kidnapping, rape, and so on. Therefore, it has become a necessity to detect stalking as all of these criminal activities can be stopped in the first place through stalking detection. In this research, we propose a novel deep learning-based hybrid fusion model to detect potential stalkers from a single video with a minimal number of frames. We extract multiple relevant features, such as facial landmarks, head pose estimation, and relative distance, as numerical values from video frames. This data is fed into a multilayer perceptron (MLP) to perform a classification task between a stalking and a non-stalking scenario. Simultaneously, the video frames are fed into a combination of convolutional and LSTM models to extract the spatio-temporal features. We use a fusion of these numerical and spatio-temporal features to build a classifier to detect stalking incidents. Additionally, we introduce a dataset consisting of stalking and non-stalking videos gathered from various feature films and television series, which is also used to train the model. The experimental results show the efficiency and dynamism of our proposed stalker detection system, achieving 89.58% testing accuracy with a significant improvement as compared to the state-of-the-art approaches.
近年来,犯罪和可疑活动检测已成为一个热门的研究课题。计算机视觉技术的快速发展对解决这个问题起到了关键作用。然而,尽管现代技术的进步,物理跟踪检测仍然是一个相对较少研究的领域。如今,公共场所的跟踪已经成为女性最常遭受的侵犯行为。跟踪是一种明显的行动,通常在犯罪活动开始前就发生了,跟踪者开始跟随、窥视和盯着受害者,然后实施诸如攻击、绑架、强奸等犯罪活动。因此,通过跟踪检测这些犯罪活动已成为不可避免的必要性。在这个研究中,我们提出了一个基于深度学习的混合模型,用于从单个视频检测潜在跟踪者,且帧数最小。我们提取了视频帧中的多个相关特征,如面部特征点、头姿态估计和相对距离,并将其转换为数值值。将这个数据输入到多层感知器(MLP)中进行分类任务,将跟踪和非跟踪场景区分开来。同时,将视频帧输入到卷积和LSTM模型中提取空间和时间特征。我们使用这些数值和空间时间特征的融合来构建一个分类器,用于检测跟踪事件。此外,我们还引入了一个由各种电影和电视剧收集的跟踪和非跟踪视频组成的训练数据集,用于训练模型。实验结果表明,我们提出的跟踪检测系统具有高效率和动态性,在89.58%的测试准确率方面取得了显著的改善,与最先进的跟踪检测方法相比。
https://arxiv.org/abs/2402.03417
Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.
人类面部生成和编辑是计算机视觉和数字世界中的重要任务。最近的研究表明,多模态面部生成和编辑取得了显著进展,例如,通过面部分割来指导图像生成。然而,对于某些用户来说,手动创建这些调节模块可能具有挑战性。因此,我们引入了M3Face,一个可控制的多模态多语言框架,用于可控制的面部生成和编辑。该框架允许用户仅通过文本输入自动生成控制模块,例如语义分割或面部关键点,并随后生成面部图像。我们对我们的框架进行广泛的定性和定量实验,以展示其面部生成和编辑能力。此外,我们还提出了M3CelebA数据集,一个包含高质量图像、语义分割、面部关键点以及多种语言中每个图像的多个描述的大型多模态多语言面部数据集。代码和数据集将在发表时发布。
https://arxiv.org/abs/2402.02369