Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.
音频驱动的对话式头像动画是一个具有许多实际应用场景的具有挑战性的研究领域。最近的工作主要集中在创建逼真的2D动画,而学习不同的说话或唱歌风格仍然是一个未解决的难题。在本文中,我们提出了一种可以学习可编程风格参考的新方法,以生成对话式头像动画。给定一组风格参考框架,我们的框架可以根据一个输入图像和一个音频流重建2D对话式头像动画。我们的方法首先从音频流中提取面部关键帧运动,并从风格参考图像中构建中间风格模式。我们将两个输出输入到风格意识到图像生成器,以生成逼真和细节丰富的2D动画。在实践中,我们的框架可以提取特定角色的风格信息,并将其转移到对话式头像动画中的任何新的静态图像。密集的实验结果显示,我们的方法取得了比最近先进的方法 qualitative和 quantitative 更好的结果。
https://arxiv.org/abs/2303.09799
Most facial landmark detection methods predict landmarks by mapping the input facial appearance features to landmark heatmaps and have achieved promising results. However, when the face image is suffering from large poses, heavy occlusions and complicated illuminations, they cannot learn discriminative feature representations and effective facial shape constraints, nor can they accurately predict the value of each element in the landmark heatmap, limiting their detection accuracy. To address this problem, we propose a novel Reference Heatmap Transformer (RHT) by introducing reference heatmap information for more precise facial landmark detection. The proposed RHT consists of a Soft Transformation Module (STM) and a Hard Transformation Module (HTM), which can cooperate with each other to encourage the accurate transformation of the reference heatmap information and facial shape constraints. Then, a Multi-Scale Feature Fusion Module (MSFFM) is proposed to fuse the transformed heatmap features and the semantic features learned from the original face images to enhance feature representations for producing more accurate target heatmaps. To the best of our knowledge, this is the first study to explore how to enhance facial landmark detection by transforming the reference heatmap information. The experimental results from challenging benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art methods in the literature.
大多数面部地标检测方法通过将输入的面部外观特征映射到地标热图来预测地标,并取得了良好的结果。然而,当面部图像受到大型姿态、严重遮挡和复杂的照明条件时,它们无法学习鲜明的特征表示和有效的面部形状限制,也无法准确地预测地标热图每个元素的值,从而限制了它们的检测精度。为了解决这一问题,我们提出了一种新的参考热图Transformer(RHT),通过引入参考热图信息来提高更精确的面部地标检测。 proposed RHT由一个软转换模块(STM)和一个硬转换模块(HTM)组成,可以互相合作,鼓励准确转换参考热图信息和面部形状限制。然后,我们提出了一个多尺度特征融合模块(MSFFM),将转换后热图特征和从原始面部图像中学习到的语义特征进行融合,以增强特征表示,以产生更准确的目标热图。据我们所知,这是第一个研究探索通过转换参考热图信息来提高面部地标检测的方法。挑战性基准数据集的实验结果表明,我们提出的方法在文献中比最先进的方法表现更好。
https://arxiv.org/abs/2303.07840
One of the key issues in facial expression recognition in the wild (FER-W) is that curating large-scale labeled facial images is challenging due to the inherent complexity and ambiguity of facial images. Therefore, in this paper, we propose a self-supervised simple facial landmark encoding (SimFLE) method that can learn effective encoding of facial landmarks, which are important features for improving the performance of FER-W, without expensive labels. Specifically, we introduce novel FaceMAE module for this purpose. FaceMAE reconstructs masked facial images with elaborately designed semantic masking. Unlike previous random masking, semantic masking is conducted based on channel information processed in the backbone, so rich semantics of channels can be explored. Additionally, the semantic masking process is fully trainable, enabling FaceMAE to guide the backbone to learn spatial details and contextual properties of fine-grained facial landmarks. Experimental results on several FER-W benchmarks prove that the proposed SimFLE is superior in facial landmark localization and noticeably improved performance compared to the supervised baseline and other self-supervised methods.
在野生面部表情识别中(FER-W)的一个关键问题是如何编辑大规模标记的面部图像,这些图像需要进行昂贵的标签。因此,在本文中,我们提出了一种自监督的简单面部地标编码方法(SimFLE),该方法可以学习有效的面部地标编码,这是改善FER-W性能的重要特征,而无需昂贵的标签。具体而言,我们介绍了一种新型的FaceMAE模块,该模块使用精心设计的语义掩码来重构 mask 过的面部图像。与以前的随机掩码不同,语义掩码基于主干线的通道信息进行,因此可以探索通道丰富的语义。此外,语义掩码过程是完全可训练的,使 FaceMAE 可以指导主干线学习精细面部地标的空间细节和上下文特性。在多个FER-W基准测试中,实验结果表明, proposed SimFLE 在面部地标定位和显著改进性能方面比监督基线和其他自监督方法更好。
https://arxiv.org/abs/2303.07648
Development of human machine interface has become a necessity for modern day machines to catalyze more autonomy and more efficiency. Gaze driven human intervention is an effective and convenient option for creating an interface to alleviate human errors. Facial landmark detection is very crucial for designing a robust gaze detection system. Regression based methods capacitate good spatial localization of the landmarks corresponding to different parts of the faces. But there are still scope of improvements which have been addressed by incorporating attention. In this paper, we have proposed a deep coarse-to-fine architecture called LocalEyenet for localization of only the eye regions that can be trained end-to-end. The model architecture, build on stacked hourglass backbone, learns the self-attention in feature maps which aids in preserving global as well as local spatial dependencies in face image. We have incorporated deep layer aggregation in each hourglass to minimize the loss of attention over the depth of architecture. Our model shows good generalization ability in cross-dataset evaluation and in real-time localization of eyes.
人类机器界面的发展已经成为当代机器促进更多自主和更高效的必要条件。视觉驱动的人类干预是一种有效和方便的方式,用于创建减轻人类错误的界面。面部地标检测对于设计可靠的视觉检测系统非常重要。基于回归的方法能够确保对与面部不同部分对应的地标进行良好的空间定位。但是,仍然可以通过引入注意力来解决改进的空间。在本文中,我们提出了一种叫做Local Eyenet的深度粗到细架构,用于仅训练可以 end-to-end 训练的 eye 区域的定位。模型架构基于栈式漏斗 backbone 建立,学习特征映射中的自我关注,有助于保留面部图像的全局和局部空间依赖关系。在每个漏斗层中,我们进行了深度层聚合,以最小化架构深度中的注意力损失。我们的模型在跨数据集评估和实时眼部定位方面表现出良好的泛化能力。
https://arxiv.org/abs/2303.12728
Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.
对话面容生成因其广泛的应用而被广泛研究。用于对话面容生成的两个主要框架包括一个基于文本的框架,该框架从文本中生成同步的口语和对话面容,以及一个基于语音的框架,该框架从语音中生成对话面容。为了整合这些框架,本文提出了一个统一的面部地标生成器(UniFLG)。该提出的系统利用端到端文本到语音技术,不仅合成语音,还提取文本和语音中共同的隐态表示,并将其喂给地标解码器生成面部地标。我们证明,我们系统的语音合成和面部地标生成相比现有基于文本的方法更加自然。我们还证明,我们系统可以从没有面部视频数据或甚至没有语音数据的发言中生成面部地标。
https://arxiv.org/abs/2302.14337
This paper explores automated face and facial landmark detection of neonates, which is an important first step in many video-based neonatal health applications, such as vital sign estimation, pain assessment, sleep-wake classification, and jaundice detection. Utilising three publicly available datasets of neonates in the clinical environment, 366 images (258 subjects) and 89 (66 subjects) were annotated for training and testing, respectively. Transfer learning was applied to two YOLO-based models, with input training images augmented with random horizontal flipping, photo-metric colour distortion, translation and scaling during each training epoch. Additionally, the re-orientation of input images and fusion of trained deep learning models was explored. Our proposed model based on YOLOv7Face outperformed existing methods with a mean average precision of 84.8% for face detection, and a normalised mean error of 0.072 for facial landmark detection. Overall, this will assist in the development of fully automated neonatal health assessment algorithms.
本 paper 探讨了自动检测新生儿的面部和面部地标,这在许多基于视频的新生儿健康应用中是一个重要的的第一步,例如估计生命体征、评估疼痛、睡眠-清醒分类和检测黄疸。利用临床环境中公开可用的三个新生儿数据集,共进行了 366 张照片(258 名受试者)和 89 张照片(66 名受试者)的标注,用于训练和测试。 Transfer learning 应用于两个基于 YOLO 的模型,在每个训练 epoch 中,输入训练图像随机地进行水平翻转、photo-metric 颜色扭曲、旋转和缩放。此外,探索了输入图像的重新定向和训练深度神经网络的融合。我们提出的基于 YOLOv7Face 的模型在面部检测方面表现更好,面部地标检测的平均精度为 84.8%,均值误差为 0.072。Overall,这将协助开发完全自动化的新生儿健康评估算法。
https://arxiv.org/abs/2302.04341
Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.
Lipreading 是指理解并进一步将视频中的说话者的发言翻译成自然语言。目前最先进的 Lipreading 方法 excel 在解释重叠说话者,即出现在训练集和推断集中的说话者。然而,将这些方法推广到未出现在训练集中的说话者会导致灾难性的性能下降,因为这些方法依赖于训练集中的有限数量说话者和不同说话者 lips的形状/颜色等明显的视觉变异。因此,仅仅依赖于嘴唇的可见变化往往会导致模型过拟合。为了解决这个问题,我们建议使用视觉和地标的跨modal 特征,这些特征可以描述嘴唇的运动,而不考虑说话者的身份。然后,我们基于视觉地标Transformer 开发了一个句子级别的 Lipreading 框架,即 Lipformer。具体来说, Lipformer 包括一个嘴唇运动流、一个面部地标流和一个跨modal 融合流。两个流的嵌入都是通过自注意力产生的,然后被输入到跨注意力模块以实现视觉和地标之间的对齐。最后,这些融合的特征可以通过一个Cascade seq2seq 模型解码输出文本。实验结果表明,我们的方法和方法可以有效地增强模型对未出现在训练集中的说话者 generalization。
https://arxiv.org/abs/2302.02141
Facial expression recognition (FER) plays an important role in a variety of real-world applications such as human-computer interaction. POSTER V1 achieves the state-of-the-art (SOTA) performance in FER by effectively combining facial landmark and image features through two-stream pyramid cross-fusion design. However, the architecture of POSTER V1 is undoubtedly complex. It causes expensive computational costs. In order to relieve the computational pressure of POSTER V1, in this paper, we propose POSTER V2. It improves POSTER V1 in three directions: cross-fusion, two-stream, and multi-scale feature extraction. In cross-fusion, we use window-based cross-attention mechanism replacing vanilla cross-attention mechanism. We remove the image-to-landmark branch in the two-stream design. For multi-scale feature extraction, POSTER V2 combines images with landmark's multi-scale features to replace POSTER V1's pyramid design. Extensive experiments on several standard datasets show that our POSTER V2 achieves the SOTA FER performance with the minimum computational cost. For example, POSTER V2 reached 92.21\% on RAF-DB, 67.49\% on AffectNet (7 cls) and 63.77\% on AffectNet (8 cls), respectively, using only 8.4G floating point operations (FLOPs) and 43.7M parameters (Param). This demonstrates the effectiveness of our improvements. The code and models are available at ~\url{this https URL}.
面部表情识别(FER)在诸如人机交互等许多实际应用程序中发挥着重要作用。 POSTER V1通过有效地结合面部地标和图像特征,通过二流金字塔交叉融合设计实现了最先进的(SOTA)面部表情识别性能。然而,POSTER V1的架构无疑是复杂的,这导致了昂贵的计算成本。为了减轻POSTER V1的计算压力,在本文中,我们提出了POSTER V2。它从三个方向改进了POSTER V1:交叉融合、二流和多尺度特征提取。在交叉融合中,我们使用窗口基交叉注意力机制取代了传统的交叉注意力机制。在二流设计中,我们删除了图像到地标分支。对于多尺度特征提取,POSTER V2将图像与地标的多尺度特征组合起来,取代了POSTER V1的金字塔设计。对多个标准数据集进行了广泛的实验,表明我们的POSTER V2使用最小计算成本实现了最先进的面部表情识别性能。例如,POSTER V2在RAF-DB上达到了92.21%、AffectNet(7cls)上达到了67.49%、AffectNet(8cls)上达到了63.77%。仅使用8.4G的浮点运算(FLOPs)和43.7M参数(Param),POSTER V2取得了SOTA的面部表情识别性能。这证明了我们改进的有效性。代码和模型可访问~url{this https URL}。
https://arxiv.org/abs/2301.12149
Craniofacial Superimposition involves the superimposition of an image of a skull with a number of ante-mortem face images of an individual and the analysis of their morphological correspondence. Despite being used for one century, it is not yet a mature and fully accepted technique due to the absence of solid scientific approaches, significant reliability studies, and international standards. In this paper we present a comprehensive experimentation on the limitations of Craniofacial Superimposition as a forensic identification technique. The study involves different experiments over more than 1 Million comparisons performed by a landmark-based automatic 3D/2D superimposition method. The total sample analyzed consists of 320 subjects and 29 craniofacial landmarks.
https://arxiv.org/abs/2301.09461
This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the camera). The multimodal system uses the following modules based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose estimation, facial landmark detection, and facial expression features. First, we individually evaluate the proposed modules in the task of estimating the student's attention level captured during online e-learning sessions. For that we trained binary classifiers (high or low attention) based on Support Vector Machines (SVM) for each module. Secondly, we find out to what extent multimodal score level fusion improves the attention level estimation. The mEBAL database is used in the experimental framework, a public multi-modal database for attention level estimation obtained in an e-learning environment that contains data from 38 users while conducting several e-learning tasks of variable difficulty (creating changes in student cognitive loads).
这项工作提出了基于多感官面部分析的新多感官系统,用于远程注意力水平估计。我们的多感官方法使用从行为和生理过程中提取的相关参数和信号,例如面部手势(例如眨眼速率、面部动作单位)和用户行为(例如头部姿势、与摄像头的距离),以及使用卷积神经网络(CNNs)进行计算:眼动检测、头部姿势估计、面部地标检测和面部表情特征。多感官系统使用以下模块:眼睛眨眼检测、头部姿势估计、面部地标检测和面部表现特征。首先,我们对提出的模块在在线学习会话中估计学生注意力水平的任务进行了个体评估。为此,我们基于支持向量机(SVM)为每个模块训练了二进制分类器(高或低注意力)。其次,我们想知道多感官分数级融合如何提高注意力水平估计。mEBAL数据库被应用于实验框架,它是一个公开的多感官注意力水平估计公共多感官数据库,在开展多个具有不同难度的在线学习任务(导致学生认知负荷的变化)的在线学习环境中,收集了38个用户的数据(同时执行多个学习任务)。
https://arxiv.org/abs/2301.09174
In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.
https://arxiv.org/abs/2301.04474
Recent years have witnessed significant growth of face alignment. Though dense facial landmark is highly demanded in various scenarios, e.g., cosmetic medicine and facial beautification, most works only consider sparse face alignment. To address this problem, we present a framework that can enrich landmark density by existing sparse landmark datasets, e.g., 300W with 68 points and WFLW with 98 points. Firstly, we observe that the local patches along each semantic contour are highly similar in appearance. Then, we propose a weakly-supervised idea of learning the refinement ability on original sparse landmarks and adapting this ability to enriched dense landmarks. Meanwhile, several operators are devised and organized together to implement the idea. Finally, the trained model is applied as a plug-and-play module to the existing face alignment networks. To evaluate our method, we manually label the dense landmarks on 300W testset. Our method yields state-of-the-art accuracy not only in newly-constructed dense 300W testset but also in the original sparse 300W and WFLW testsets without additional cost.
https://arxiv.org/abs/2212.09525
In this paper, we investigate the problem of multi-domain translation: given an element $a$ of domain $A$, we would like to generate a corresponding $b$ sample in another domain $B$, and vice versa. Acquiring supervision in multiple domains can be a tedious task, also we propose to learn this translation from one domain to another when supervision is available as a pair $(a,b)\sim A\times B$ and leveraging possible unpaired data when only $a\sim A$ or only $b\sim B$ is available. We introduce a new unified framework called Latent Space Mapping (\model) that exploits the manifold assumption in order to learn, from each domain, a latent space. Unlike existing approaches, we propose to further regularize each latent space using available domains by learning each dependency between pairs of domains. We evaluate our approach in three tasks performing i) synthetic dataset with image translation, ii) real-world task of semantic segmentation for medical images, and iii) real-world task of facial landmark detection.
https://arxiv.org/abs/2212.03361
Holistic methods using CNNs and margin-based losses have dominated research on face recognition. In this work, we depart from this setting in two ways: (a) we employ the Vision Transformer as an architecture for training a very strong baseline for face recognition, simply called fViT, which already surpasses most state-of-the-art face recognition methods. (b) Secondly, we capitalize on the Transformer's inherent property to process information (visual tokens) extracted from irregular grids to devise a pipeline for face recognition which is reminiscent of part-based face recognition methods. Our pipeline, called part fViT, simply comprises a lightweight network to predict the coordinates of facial landmarks followed by the Vision Transformer operating on patches extracted from the predicted landmarks, and it is trained end-to-end with no landmark supervision. By learning to extract discriminative patches, our part-based Transformer further boosts the accuracy of our Vision Transformer baseline achieving state-of-the-art accuracy on several face recognition benchmarks.
https://arxiv.org/abs/2212.00057
This paper presents the first significant work on directly predicting 3D face landmarks on neural radiance fields (NeRFs), without using intermediate representations such as 2D images, depth maps, or point clouds. Our 3D coarse-to-fine Face Landmarks NeRF (FLNeRF) model efficiently samples from the NeRF on the whole face with individual facial features for accurate landmarks. To mitigate the limited number of facial expressions in the available data, local and non-linear NeRF warp is applied at facial features in fine scale to simulate large emotions range, including exaggerated facial expressions (e.g., cheek blowing, wide opening mouth, eye blinking), for training FLNeRF. With such expression augmentation, our model can predict 3D landmarks not limited to the 20 discrete expressions given in the data. Robust 3D NeRF facial landmarks contribute to many downstream tasks. As an example, we modify MoFaNeRF to enable high-quality face editing and swapping using face landmarks on NeRF, allowing more direct control and wider range of complex expressions. Experiments show that the improved model using landmarks achieves comparable to better results. Github link: this https URL.
https://arxiv.org/abs/2211.11202
Animating portraits using speech has received growing attention in recent years, with various creative and practical use cases. An ideal generated video should have good lip sync with the audio, natural facial expressions and head motions, and high frame quality. In this work, we present SPACEx, which uses speech and a single image to generate high-resolution, and expressive videos with realistic head pose, without requiring a driving video. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACEx also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons. The project website is available at this https URL
https://arxiv.org/abs/2211.09809
Around 40 percent of accidents related to driving on highways in India occur due to the driver falling asleep behind the steering wheel. Several types of research are ongoing to detect driver drowsiness but they suffer from the complexity and cost of the models. In this paper, SleepyWheels a revolutionary method that uses a lightweight neural network in conjunction with facial landmark identification is proposed to identify driver fatigue in real time. SleepyWheels is successful in a wide range of test scenarios, including the lack of facial characteristics while covering the eye or mouth, the drivers varying skin tones, camera placements, and observational angles. It can work well when emulated to real time systems. SleepyWheels utilized EfficientNetV2 and a facial landmark detector for identifying drowsiness detection. The model is trained on a specially created dataset on driver sleepiness and it achieves an accuracy of 97 percent. The model is lightweight hence it can be further deployed as a mobile application for various platforms.
https://arxiv.org/abs/2211.00718
We apply computer vision pose estimation techniques developed expressly for the data-scarce infant domain to the study of torticollis, a common condition in infants for which early identification and treatment is critical. Specifically, we use a combination of facial landmark and body joint estimation techniques designed for infants to estimate a range of geometric measures pertaining to face and upper body symmetry, drawn an array of sources in the physical therapy and ophthalmology research literature in torticollis. We gauge performance with a range of metrics and show that the estimates of most these geometric measures are successful, yielding very strong to strong Spearman's $\rho$ correlation with ground truth values. Furthermore, we show that these estimates derived from pose estimation neural networks designed for the infant domain cleanly outperform estimates derived from more widely known networks designed for the adult domain.
https://arxiv.org/abs/2210.15022
Emotion recognition aims to interpret the emotional states of a person based on various inputs including audio, visual, and textual cues. This paper focuses on emotion recognition using visual features. To leverage the correlation between facial expression and the emotional state of a person, pioneering methods rely primarily on facial features. However, facial features are often unreliable in natural unconstrained scenarios, such as in crowded scenes, as the face lacks pixel resolution and contains artifacts due to occlusion and blur. To address this, in the wild emotion recognition exploits full-body person crops as well as the surrounding scene context. In a bid to use body pose for emotion recognition, such methods fail to realize the potential that facial expressions, when available, offer. Thus, the aim of this paper is two-fold. First, we demonstrate our method, PERI, to leverage both body pose and facial landmarks. We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks. This allows us to exploit body pose in addition to facial context whenever available. Second, to reason from the PAS images, we introduce context infusion (Cont-In) blocks. These blocks attend to part-specific information, and pass them onto the intermediate features of an emotion recognition network. Our approach is conceptually simple and can be applied to any existing emotion recognition method. We provide our results on the publicly available in the wild EMOTIC dataset. Compared to existing methods, PERI achieves superior performance and leads to significant improvements in the mAP of emotion categories, while decreasing Valence, Arousal and Dominance errors. Importantly, we observe that our method improves performance in both images with fully visible faces as well as in images with occluded or blurred faces.
https://arxiv.org/abs/2210.10130
Facial landmark detection plays an important role for the similarity analysis in artworks to compare portraits of the same or similar artists. With facial landmarks, portraits of different genres, such as paintings and prints, can be automatically aligned using control-point-based image registration. We propose a deep-learning-based method for facial landmark detection in high-resolution images of paintings and prints. It divides the task into a global network for coarse landmark prediction and multiple region networks for precise landmark refinement in regions of the eyes, nose, and mouth that are automatically determined based on the predicted global landmark coordinates. We created a synthetically augmented facial landmark art dataset including artistic style transfer and geometric landmark shifts. Our method demonstrates an accurate detection of the inner facial landmarks for our high-resolution dataset of artworks while being comparable for a public low-resolution artwork dataset in comparison to competing methods.
https://arxiv.org/abs/2210.09204