Abstract
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
Abstract (translated)
提示学习已经被广泛采用,以高效地调整视觉-语言模型(如CLIP)以适应各种下游任务。尽管取得了成功,现有的基于VLM的面部表情识别(FER)方法在捕捉细微的文本-视觉关系方面仍存在困难,而这对于区分面部表情之间的微小差异至关重要。为了解决这一挑战,我们提出了一种用于FER的多模态提示对齐框架,称为MPA-FER,该框架向提示学习过程提供了细粒度的语义指导,从而产生了更精确且可解释的表示。 具体来说,我们引入了一个多层次硬提示生成策略,利用像ChatGPT这样的大型语言模型(LLM)为每种面部表情生成详细的描述。通过最小化软提示与硬提示之间的特征差异,将基于LLM的外部知识注入到软提示中。为了保持预训练CLIP模型的泛化能力,我们的方法采用了原型引导的视觉特性对齐机制,确保来自冻结图像编码器的提示视觉特性能够紧密地与特定类别的原型一致。 此外,我们提出了一种跨模态全局局部对齐模块,专注于表情相关的面部特征,进一步提高了文本和视觉特性之间的对齐。广泛的实验表明,在三个FER基准数据集上,我们的框架优于最先进的方法,同时保持了预训练模型的优势并减少了计算成本。
URL
https://arxiv.org/abs/2506.21017