Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.
尽管面部特征点检测(FLD)已经取得了显著进展,现有的FLD方法在处理部分不可见的面部时仍然面临性能下降的问题,比如被遮挡或处于极端光照条件和姿态下的面部。为了解决这一问题,我们引入了ORFormer,这是一种基于Transformer的新方法,能够检测非可见区域并从可见部分恢复其缺失特征。具体来说,ORFormer将每个图像块标记与一个额外的可学习标记——信使标记相关联。该信使标记聚合除了它所属块外的所有其他特征。通过这种方式,可以通过参考常规嵌入和信使嵌入之间的相似性来评估一块与其他块之间的一致性,从而实现非可见区域识别。然后,我们的方法使用由信使标记聚集的特征恢复被遮挡的区块。借助恢复出的特征,ORFormer为下游FLD任务编译高质量的热图。广泛的实验表明,我们提出的方法生成的热图能够抵御部分遮挡的影响。通过将这些生成的热图整合到现有的FLD方法中,在WFLW和COFW等具有挑战性的数据集上,我们的方法表现优于现有技术。
https://arxiv.org/abs/2412.13174
This work presents the IMPROVE dataset, designed to evaluate the effects of mobile phone usage on learners during online education. The dataset not only assesses academic performance and subjective learner feedback but also captures biometric, behavioral, and physiological signals, providing a comprehensive analysis of the impact of mobile phone use on learning. Multimodal data were collected from 120 learners in three groups with different phone interaction levels. A setup involving 16 sensors was implemented to collect data that have proven to be effective indicators for understanding learner behavior and cognition, including electroencephalography waves, videos, eye tracker, etc. The dataset includes metadata from the processed videos like face bounding boxes, facial landmarks, and Euler angles for head pose estimation. In addition, learner performance data and self-reported forms are included. Phone usage events were labeled, covering both supervisor-triggered and uncontrolled events. A semi-manual re-labeling system, using head pose and eye tracker data, is proposed to improve labeling accuracy. Technical validation confirmed signal quality, with statistical analyses revealing biometric changes during phone use.
本研究介绍了IMPROVE数据集,该数据集旨在评估移动电话使用对在线教育中学习者的影响。除了评估学术表现和主观的学习者反馈外,该数据集还捕捉了生物特征、行为和生理信号,从而全面分析手机使用对学习的综合影响。来自三个不同手机互动水平组别的120名学习者的多模态数据被收集。通过实施包含16个传感器的设置来收集已被证明能有效指示理解学习者行为与认知的数据,包括脑电波、视频和眼动追踪等。该数据集包含了处理后的视频元数据,如面部边界框、面部特征点以及用于头部姿态估计的欧拉角。此外,还包括了学习者的成绩数据和个人报告表单。手机使用事件被标记,涵盖了由监督者触发的和未控制的事件。提出了一种半自动重新标注系统,利用头部姿态和眼动追踪数据来提高标注准确性。技术验证确认了信号质量,并通过统计分析揭示了在使用手机时生物特征的变化。
https://arxiv.org/abs/2412.14195
Landmark-guided character animation generation is an important field. Generating character animations with facial features consistent with a reference image remains a significant challenge in conditional video generation, especially involving complex motions like dancing. Existing methods often fail to maintain facial feature consistency due to mismatches between the facial landmarks extracted from source videos and the target facial features in the reference image. To address this problem, we propose a facial landmark transformation method based on the 3D Morphable Model (3DMM). We obtain transformed landmarks that align with the target facial features by reconstructing 3D faces from the source landmarks and adjusting the 3DMM parameters to match the reference image. Our method improves the facial consistency between the generated videos and the reference images, effectively improving the facial feature mismatch problem.
地标引导的角色动画生成是一个重要的领域。在条件视频生成中,特别是涉及复杂动作如舞蹈时,生成与参考图像面部特征一致的角色动画仍是一项重大挑战。现有方法往往因源视频提取的面部地标与参考图像中的目标面部特征不匹配而无法保持面部特征的一致性。为解决这一问题,我们提出了一种基于3D可变形模型(3DMM)的面部地标变换方法。通过从源地标重建3D人脸,并调整3DMM参数以匹配参考图像,我们获得了与目标面部特征对齐的变换后的地标。我们的方法提高了生成视频与参考图像之间的面部一致性,有效解决了面部特征不匹配的问题。
https://arxiv.org/abs/2412.08976
Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.
面部图像修复旨在增强退化的面部图像,同时应对多样化的退化类型、实时处理需求以及最重要的是保持特定身份特征的挑战。现有方法常常在处理时间慢和恢复效果不佳方面存在问题,尤其是在严重退化的情况下,无法准确重建更细粒度的身份细节。为了解决这些问题,我们引入了InstantRestore,这是一个新型框架,利用单步图像扩散模型和注意力共享机制实现快速且个性化的面部修复。此外,InstantRestore集成了一个新颖的地标注意损失函数,通过关键面部地标的对齐来优化注意力图,增强身份特征的保持。在推理过程中,给定退化输入和一组较小(约4张)参考图像,InstantRestore只需一次前向传递即可实现接近实时性能。与依赖完整扩散过程或针对每个个体模型调优的传统方法不同,InstantRestore提供了一个适合大规模应用的可扩展解决方案。广泛的实验表明,InstantRestore在质量和速度上均优于现有方法,使其成为保持身份特征面部修复的一个有吸引力的选择。
https://arxiv.org/abs/2412.06753
We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on this https URL.
https://arxiv.org/abs/2412.04000
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the this http URL code is available at this https URL.
https://arxiv.org/abs/2412.00740
Deepfake facial manipulation has garnered significant public attention due to its impacts on enhancing human experiences and posing privacy threats. Despite numerous passive algorithms that have been attempted to thwart malicious Deepfake attacks, they mostly struggle with the generalizability challenge when confronted with hyper-realistic synthetic facial images. To tackle the problem, this paper proposes a proactive Deepfake detection approach by introducing a novel training-free landmark perceptual watermark, LampMark for short. We first analyze the structure-sensitive characteristics of Deepfake manipulations and devise a secure and confidential transformation pipeline from the structural representations, i.e. facial landmarks, to binary landmark perceptual watermarks. Subsequently, we present an end-to-end watermarking framework that imperceptibly and robustly embeds and extracts watermarks concerning the images to be protected. Relying on promising watermark recovery accuracies, Deepfake detection is accomplished by assessing the consistency between the content-matched landmark perceptual watermark and the robustly recovered watermark of the suspect image. Experimental results demonstrate the superior performance of our approach in watermark recovery and Deepfake detection compared to state-of-the-art methods across in-dataset, cross-dataset, and cross-manipulation scenarios.
深度伪造面部操控因其对提升人类体验和构成隐私威胁的影响而获得了公众的广泛关注。尽管已经尝试了许多被动算法来阻止恶意的深度伪造攻击,但当面对超现实合成面部图像时,它们大多难以克服泛化挑战。为了解决这一问题,本文提出了一种基于主动防御的方法,并引入了一种无需训练的地标感知水印LampMark。我们首先分析了深度伪造操控的结构敏感特性,并设计了一个从结构表示(即面部地标)到二进制地标感知水印的安全且保密的转换管道。随后,我们提出了一个端到端的水印框架,该框架可以无损且鲁棒地嵌入和提取需要保护图像上的水印。依靠有前景的水印恢复准确性,通过评估内容匹配的地标感知水印与疑似图像中稳健恢复的水印之间的一致性来完成深度伪造检测。实验结果表明,在数据集内、跨数据集和跨操纵场景下,我们的方法在水印恢复和深度伪造检测方面均优于现有最先进的方法。
https://arxiv.org/abs/2411.17209
Facial landmark detection is a fundamental problem in computer vision for many downstream applications. This paper introduces a new facial landmark detector based on vision transformers, which consists of two unique designs: Dual Vision Transformer (D-ViT) and Long Skip Connections (LSC). Based on the observation that the channel dimension of feature maps essentially represents the linear bases of the heatmap space, we propose learning the interconnections between these linear bases to model the inherent geometric relations among landmarks via Channel-split ViT. We integrate such channel-split ViT into the standard vision transformer (i.e., spatial-split ViT), forming our Dual Vision Transformer to constitute the prediction blocks. We also suggest using long skip connections to deliver low-level image features to all prediction blocks, thereby preventing useful information from being discarded by intermediate supervision. Extensive experiments are conducted to evaluate the performance of our proposal on the widely used benchmarks, i.e., WFLW, COFW, and 300W, demonstrating that our model outperforms the previous SOTAs across all three benchmarks.
面部特征点检测是计算机视觉中许多下游应用的基础问题。本文介绍了一种基于视觉变换器的新面部特征点检测器,该检测器包含两个独特的设计:双视觉变换器(D-ViT)和长跳跃连接(LSC)。根据特征图的通道维度实际上表示热图空间线性基底这一观察结果,我们提出了通过通道分割ViT学习这些线性基底之间的相互联系,以建模地标间的固有几何关系。我们将这种通道分割ViT整合到标准视觉变换器(即空间分割ViT)中,形成我们的双视觉变换器,构成预测块。此外,我们还建议使用长跳跃连接将低级图像特征传递给所有预测块,从而防止因中间监督而丢失有用信息。我们在广泛使用的基准测试集上进行了广泛的实验,包括WFLW、COFW和300W,结果表明我们的模型在这三个基准测试集中都优于之前的最佳技术水平。
https://arxiv.org/abs/2411.07167
Current face anonymization techniques often depend on identity loss calculated by face recognition models, which can be inaccurate and unreliable. Additionally, many methods require supplementary data such as facial landmarks and masks to guide the synthesis process. In contrast, our approach uses diffusion models with only a reconstruction loss, eliminating the need for facial landmarks or masks while still producing images with intricate, fine-grained details. We validated our results on two public benchmarks through both quantitative and qualitative evaluations. Our model achieves state-of-the-art performance in three key areas: identity anonymization, facial attribute preservation, and image quality. Beyond its primary function of anonymization, our model can also perform face swapping tasks by incorporating an additional facial image as input, demonstrating its versatility and potential for diverse applications. Our code and models are available at this https URL .
当前的面部匿名化技术通常依赖于由人脸识别模型计算出的身份损失,这可能会导致不准确和不可靠的结果。此外,许多方法需要额外的数据,如面部特征点和遮罩来指导合成过程。相比之下,我们的方法仅使用扩散模型和重建损失,不需要面部特征点或遮罩,同时仍能生成具有复杂精细细节的图像。我们在两个公开基准上通过定量和定性评估验证了结果。我们的模型在三个关键领域达到了最先进的性能:身份匿名化、面部属性保留和图像质量。除了主要的匿名功能外,通过将额外的面部图像作为输入,我们的模型还可以执行面部交换任务,展示了其多样性和潜在的应用前景。我们的代码和模型可以在以下链接中获取: [此 HTTPS URL]。
https://arxiv.org/abs/2411.00762
Automated Facial Expression Recognition (FER) is challenging due to intra-class variations and inter-class similarities. FER can be especially difficult when facial expressions reflect a mixture of various emotions (aka compound expressions). Existing FER datasets, such as AffectNet, provide discrete emotion labels (hard-labels), where a single category of emotion is assigned to an expression. To alleviate inter- and intra-class challenges, as well as provide a better facial expression descriptor, we propose a new approach to create FER datasets through a labeling method in which an image is labeled with more than one emotion (called soft-labels), each with different confidences. Specifically, we introduce the notion of soft-labels for facial expression datasets, a new approach to affective computing for more realistic recognition of facial expressions. To achieve this goal, we propose a novel methodology to accurately calculate soft-labels: a vector representing the extent to which multiple categories of emotion are simultaneously present within a single facial expression. Finding smoother decision boundaries, enabling multi-labeling, and mitigating bias and imbalanced data are some of the advantages of our proposed method. Building upon AffectNet, we introduce AffectNet+, the next-generation facial expression dataset. This dataset contains soft-labels, three categories of data complexity subsets, and additional metadata such as age, gender, ethnicity, head pose, facial landmarks, valence, and arousal. AffectNet+ will be made publicly accessible to researchers.
自动化面部表情识别(FER)具有挑战性,主要由于类内变化和类间相似性的存在。当面部表情反映多种情感的混合时(即复合表达),FER尤其困难。现有的FER数据集,如AffectNet,提供离散的情感标签(硬标签),其中一种情绪类别被分配给一个表情。为了缓解类内和类间的挑战,并提供更好的面部表情描述符,我们提出了一种新的方法来创建FER数据集:通过标注方法对一张图片进行多于一种情感的标注(称为软标签),每种情感带有不同的置信度。具体来说,我们引入了面部表情数据集中软标签的概念,这是一种更为现实的情感计算新方法,用于面部表情识别。为了实现这一目标,我们提出了一种新的方法来准确计算软标签:一个向量表示在单个面部表情中同时存在多种情绪的程度。我们的提议方法具有发现平滑决策边界、支持多标注以及减轻偏差和数据不平衡的优势。基于AffectNet,我们引入了下一代面部表情数据集AffectNet+。该数据集包含软标签、三个类别的数据复杂度子集以及额外的元数据如年龄、性别、种族、头部姿态、面部特征点、效价和唤醒水平。AffectNet+将向研究人员公开提供访问权限。
https://arxiv.org/abs/2410.22506
The COVID-19 pandemic has underscored the need for low-cost, scalable approaches to measuring contactless vital signs, either during initial triage at a healthcare facility or virtual telemedicine visits. Remote photoplethysmography (rPPG) can accurately estimate heart rate (HR) when applied to close-up videos of healthy volunteers in well-lit laboratory settings. However, results from such highly optimized laboratory studies may not be readily translated to healthcare settings. One significant barrier to the practical application of rPPG in health care is the accurate localization of the region of interest (ROI). Clinical or telemedicine visits may involve sub-optimal lighting, movement artifacts, variable camera angle, and subject distance. This paper presents an rPPG ROI selection method based on 3D facial landmarks and patient head yaw angle. We then demonstrate the robustness of this ROI selection method when coupled to the Plane-Orthogonal-to-Skin (POS) rPPG method when applied to videos of patients presenting to an Emergency Department for respiratory complaints. Our results demonstrate the effectiveness of our proposed approach in improving the accuracy and robustness of rPPG in a challenging clinical environment.
https://arxiv.org/abs/2410.15851
Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces the Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the fundamental limitations of traditional FLD methods. POPoS employs three key innovations: (1) Pseudo-range multilateration is utilized to correct heatmap errors, enhancing the precision of landmark localization. By integrating multiple anchor points, this approach minimizes the impact of individual heatmap inaccuracies, leading to robust overall positioning. (2) To improve the pseudo-range accuracy of selected anchor points, a new loss function, named multilateration anchor loss, is proposed. This loss function effectively enhances the accuracy of the distance map, mitigates the risk of local optima, and ensures optimal solutions. (3) A single-step parallel computation algorithm is introduced, significantly enhancing computational efficiency and reducing processing time. Comprehensive evaluations across five benchmark datasets demonstrate that POPoS consistently outperforms existing methods, particularly excelling in low-resolution scenarios with minimal computational overhead. These features establish POPoS as a highly efficient and accurate tool for FLD, with broad applicability in real-world scenarios. The code is available at this https URL
https://arxiv.org/abs/2410.09583
Three-dimensional (3D) reconstruction has emerged as a prominent area of research, attracting significant attention from academia and industry alike. Among the various applications of 3D reconstruction, facial reconstruction poses some of the most formidable challenges. Additionally, each individuals facial structure is unique, requiring algorithms to be robust enough to handle this variability while maintaining fidelity to the original features. This article presents a comprehensive dataset of 3D meshes featuring a diverse range of facial structures and corresponding facial landmarks. The dataset comprises 188 3D facial meshes, including 73 from female candidates and 114 from male candidates. It encompasses a broad representation of ethnic backgrounds, with contributions from 45 different ethnicities, ensuring a rich diversity in facial characteristics. Each facial mesh is accompanied by key points that accurately annotate the relevant features, facilitating precise analysis and manipulation. This dataset is particularly valuable for applications such as facial re targeting, the study of facial structure components, and real-time person representation in video streams. By providing a robust resource for researchers and developers, it aims to advance the field of 3D facial reconstruction and related technologies.
https://arxiv.org/abs/2410.07415
Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.
https://arxiv.org/abs/2410.06149
Audio-driven video portrait synthesis is a crucial and useful technology in virtual human interaction and film-making applications. Recent advancements have focused on improving the image fidelity and lip-synchronization. However, generating accurate emotional expressions is an important aspect of realistic talking-head generation, which has remained underexplored in previous works. We present a novel system in this paper for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Specifically, we utilize a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks. These landmarks are concatenated with emotional embeddings to produce emotional landmarks through our motion-to-emotion module. These emotional landmarks are then used to render realistic emotional talking-head video using a Neural Radiance Fields (NeRF)-based emotion-to-video module. Additionally, we propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs. Extensive experiments demonstrate that our method obtains more accurate emotion generation with higher fidelity.
音频驱动的视频肖像合成技术在虚拟人互动和电影制作应用中是一项关键且有用的技术。最近的研究进展主要集中在提高图像保真度和唇部同步上。然而,生成准确的情感表达是实现逼真的说话头像生成的重要方面,在以往的工作中这一领域尚未得到充分探索。本文提出了一种新颖的系统,用于合成具有高保真度和准确情感表达的音频驱动视频肖像。具体而言,我们使用基于变分自编码器(VAE)的音频到动作模块来生成面部特征点。这些特征点与情感嵌入结合通过我们的运动到情感模块产生情感特征点。随后,这些情感特征点被用于利用基于神经辐射场(NeRF)的情感到视频模块渲染逼真的带有情感的说话头像视频。此外,我们提出了一种姿态采样方法,该方法能够根据静音音频输入生成自然的空闲状态(非讲话)视频。广泛的实验表明,我们的方法在更准确地生成情感方面获得了更高的保真度。
https://arxiv.org/abs/2410.17262
Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.
视觉大型语言模型(VLLMs)正在改变计算机视觉和自然语言处理之间的交叉。然而,在這些模型中使用視覺提示進行情感識別的潛力仍然被大大開發和利用。傳統的VLLM方法在空間定位方面存在困難,通常會舍棄有價值的全身上下文。為了解決這個問題,我們提出了一个Set-of-Vision prompting(SoV)方法,通過使用空間信息(如邊界框和面部 landmarks)來精確標記目標,來增強 zero-shot情感識別的準確度。SoV在face count和 emotion categorization方面改善了準確度,同時保留了豐富的情感上下文。通過對最新商業或開源VLLM進行了大量實驗和分析,我們評估了SoV模型在自然環境中理解面部表情的能力。我們的研究結果表明,將空間視覺提示集成到VLLM中可以提高情感識別性能。
https://arxiv.org/abs/2410.02244
Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (AUs) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks RAF-DB, and AffectNet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.
尽管最先进的面部表情识别(FER)分类器可以实现高水平的准确性,但它们缺乏可解释性,这对于最终用户来说非常重要。专家通常将代码库中的空间动作单元(AUs)与表情进行视觉解释。在本文中,遵循了相同的专家步骤。提出了一种新的学习策略,将AU提示明确地融入分类器训练中,以训练深度可解释的模型。在训练过程中,使用了新的AU代码本和表情图像标签以及面部关键点来构建AU热力图,表示与面部表情最相关的图像区域。这个有价值的空间线索被用于训练FER的深度可解释分类器。通过约束分类器的空间层特征与AU热力图相关,实现了训练分类器正确分类图像,并产生与AU地图相关的可解释视觉层级的关注,模拟了专家的决策过程。我们的策略仅依赖于图像类别的表达进行监督,没有额外的手动注释。我们的新策略是通用的,可以应用于任何基于深度卷积神经网络(CNN)或Transformer的分类器,而不需要进行任何架构上的改变或额外的训练时间。我们对两个公开基准数据集RAF-DB和AffectNet进行的大量评估显示,与我们的建议的策略相比,可以在不降低分类性能的情况下提高层间可解释性。此外,我们还研究了依赖于分类激活映射(CAM)方法的常见可解释分类器,并证明了我们的方法也可以提高CAM的可解释性。
https://arxiv.org/abs/2410.01848
This paper presents VisioPhysioENet, a novel multimodal system that leverages visual cues and physiological signals to detect learner engagement. It employs a two-level approach for visual feature extraction using the Dlib library for facial landmark extraction and the OpenCV library for further estimations. This is complemented by extracting physiological signals using the plane-orthogonal-to-skin method to assess cardiovascular activity. These features are integrated using advanced machine learning classifiers, enhancing the detection of various engagement levels. We rigorously evaluate VisioPhysioENet on the DAiSEE dataset, where it achieves an accuracy of 63.09%, demonstrating a superior ability to discern various levels of engagement compared to existing methodologies. The proposed system's code can be accessed at this https URL.
本文介绍了一种名为VisioPhysioENet的多模态系统,该系统利用视觉提示和生理信号来检测学习者的参与度。它采用了一种双级方法,基于Dlib库进行面部关键点提取,并使用OpenCV库进行进一步估计。这通过使用平面与皮肤垂直的方法提取生理信号来评估心血管活动。这些特征使用先进的机器学习分类器进行整合,增强了各种参与度的检测。我们在DAiSEE数据集上对VisioPhysioENet进行了严格的评估,其精度为63.09%,表明其比现有方法更能够辨别各种参与度水平。该系统的代码可以在此链接访问:<https://www.academia.edu/39411041/VisioPhysioENet>
https://arxiv.org/abs/2409.16126
Localization of the craniofacial landmarks from lateral cephalograms is a fundamental task in cephalometric analysis. The automation of the corresponding tasks has thus been the subject of intense research over the past decades. In this paper, we introduce the "Cephalometric Landmark Detection (CL-Detection)" dataset, which is the largest publicly available and comprehensive dataset for cephalometric landmark detection. This multi-center and multi-vendor dataset includes 600 lateral X-ray images with 38 landmarks acquired with different equipment from three medical centers. The overarching objective of this paper is to measure how far state-of-the-art deep learning methods can go for cephalometric landmark detection. Following the 2023 MICCAI CL-Detection Challenge, we report the results of the top ten research groups using deep learning methods. Results show that the best methods closely approximate the expert analysis, achieving a mean detection rate of 75.719% and a mean radial error of 1.518 mm. While there is room for improvement, these findings undeniably open the door to highly accurate and fully automatic location of craniofacial landmarks. We also identify scenarios for which deep learning methods are still failing. Both the dataset and detailed results are publicly available online, while the platform will remain open for the community to benchmark future algorithm developments at this https URL.
颅面标志从侧位X光片的本地化是一个关键的颅面测量分析任务。因此,在过去的几十年里,相应任务的自动化已成为研究的热点。在本文中,我们引入了“颅面标志检测(CL-Detection)”数据集,这是目前公开可用的最大且全面的颅面标志检测数据集。这个多中心、多供应商的数据集包括来自三个医疗中心的600个侧位X光图像,其中有38个标志物使用了不同的设备进行测量。本文的总体目标是测量最先进的深度学习方法在颅面标志检测方面的进展。在2023年MICCAI CL-Detection挑战中,我们报道了使用深度学习方法排名前十的研究组的成果。结果表明,最好的方法与专家分析的准确度非常接近,实现了平均检测率为75.719%和平均径向误差为1.518毫米。虽然仍有很多改进的空间,但这些发现无疑为准确且完全自动地定位颅面标志打开了大门。我们还指出了深度学习方法仍然失败的场景。该数据集和详细结果都可以公开获取,而平台将始终保持开放,以供社区在此处对未来的算法发展进行基准测试。
https://arxiv.org/abs/2409.15834