Paper Reading AI Learner

Multimodal Prompt Alignment for Facial Expression Recognition

2025-06-26 05:28:57
Fuyan Ma, Yiran He, Bin Sun, Shutao Li

Abstract

Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

Abstract (translated)

提示学习已经被广泛采用,以高效地调整视觉-语言模型(如CLIP)以适应各种下游任务。尽管取得了成功,现有的基于VLM的面部表情识别(FER)方法在捕捉细微的文本-视觉关系方面仍存在困难,而这对于区分面部表情之间的微小差异至关重要。为了解决这一挑战,我们提出了一种用于FER的多模态提示对齐框架,称为MPA-FER,该框架向提示学习过程提供了细粒度的语义指导,从而产生了更精确且可解释的表示。 具体来说,我们引入了一个多层次硬提示生成策略,利用像ChatGPT这样的大型语言模型(LLM)为每种面部表情生成详细的描述。通过最小化软提示与硬提示之间的特征差异,将基于LLM的外部知识注入到软提示中。为了保持预训练CLIP模型的泛化能力,我们的方法采用了原型引导的视觉特性对齐机制,确保来自冻结图像编码器的提示视觉特性能够紧密地与特定类别的原型一致。 此外,我们提出了一种跨模态全局局部对齐模块,专注于表情相关的面部特征,进一步提高了文本和视觉特性之间的对齐。广泛的实验表明,在三个FER基准数据集上,我们的框架优于最先进的方法,同时保持了预训练模型的优势并减少了计算成本。

URL

https://arxiv.org/abs/2506.21017

PDF

https://arxiv.org/pdf/2506.21017.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot