Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.
https://arxiv.org/abs/2603.13056
Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.
https://arxiv.org/abs/2603.12848
This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model's confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.
https://arxiv.org/abs/2603.12693
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
本文针对在第十届情感行为分析野外工作坊和竞赛(ABAW)中的表情识别(EXPR)挑战,提出了一个解决方案。该挑战要求从不受限制的视频中进行八种面部情绪表达的帧级分类。这一任务因诸如不准确的脸部定位、姿势和尺度变化大、运动模糊以及相邻帧之间的时间不稳定等复杂因素而变得极具挑战性。 为应对这些困难,我们提出了一种两阶段的双模态(音频-视觉)模型。第一阶段专注于使用预训练的DINOv2基编码器进行稳健的视觉特征提取。具体而言,采用了基于DINOv2 ViT-L/14的基础架构,并实施了一个感知填充策略(PadAug),用于原始视频图像的填充和数据预处理,同时引入了混合专家(MoE)训练头部以增强分类器多样性。 第二阶段则关注模态融合与时间一致性。在视觉模式下,从原始视频中重新裁剪出多尺度下的面部,并通过平均提取到的视觉特征形成稳健的帧级表示。与此同时,利用短音频窗口获得与图像对齐的Wav2Vec 2.0音频特征,为情绪分类提供互补的声音线索。这些双模态特征通过一个轻量化的门控融合模块进行整合,在推断阶段进一步采用时间平滑处理。 在ABAW数据集上的实验结果证明了该方法的有效性:所提出的两阶段模型在官方验证集中达到了Macro-F1得分0.5368,并且在五倍交叉验证中取得了平均值为0.5122±0.0277的成绩,超越了官方的基准线。
https://arxiv.org/abs/2603.12221
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
零样本文本分类(ZSC)提供了一种希望,即通过将文本直接匹配到可读的标签描述来消除耗资的任务特定标注。虽然早期的方法主要依赖于为自然语言推理(NLI)微调的交叉编码器模型,但最近在文本嵌入模型、重排序器和指令调整的大规模语言模型(LLM)方面的进步挑战了基于NLI架构的主导地位。然而,系统地比较这些多样化的方法仍然困难重重。现有的评估方法,如MTEB,通常通过监督探针或微调来整合带标签的例子,这使得真正意义上的零样本能力被忽视了。为了解决这一问题,我们介绍了BTZSC,这是一个全面的基准测试集合,包括22个公共数据集,涵盖了情绪、主题、意图和情感分类,覆盖了多样化的领域、类别数量以及文档长度。利用BTZSC,我们在四种主要模型家族之间进行了系统比较:NLI交叉编码器、嵌入模型、重排序器以及指令调整的LLM,包括38个公共和自定义检查点。 我们的结果表明: (i) 现代重排序器,如Qwen3-Reranker-8B,在宏观F1分数上达到了0.72的新最高水平; (ii) 强大的嵌入模型,例如GTE-large-en-v1.5,显著缩小了准确性的差距,并且在精度和延迟之间提供了最佳的权衡; (iii) 参数量为4至12B的指令调整LLM取得了竞争性性能(宏观F1分数最高达0.67),尤其擅长主题分类,但不及专门化的重排序器的表现; (iv) NLI交叉编码器即使随着骨干模型规模的增长也表现停滞; (v) 规模化主要有利于重排序器和LLM超过嵌入模型。 BTZSC及其伴随的评估代码已公开发布,以支持零样本文本理解中的公正且可复现的进步。
https://arxiv.org/abs/2603.11991
Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.
在野外视频数据中的情感识别仍然是一个具有挑战性的问题,因为面部外观、头部姿势、光照条件、背景噪声以及人类情绪固有的动态特性都存在很大的变化。仅依赖单一模态(如面部表情或言语)通常不足以捕捉这些复杂的情感线索。为了解决这个问题,我们提出了一种多模态情感识别框架,用于10th Affective Behavior Analysis in-the-wild (ABAW) 挑战赛中的Expression (EXPR) 任务。 我们的方法利用大规模预训练模型:CLIP(视觉编码)和Wav2Vec 2.0(音频表示学习),作为冻结的骨干网络。为了建模面部表情序列中时间依赖性,我们采用了一个在固定长度视频窗口上运行的时间卷积网络(TCN)。此外,我们引入了一种双向交叉注意力融合模块,在该模块中视觉和音频特征以对称方式相互作用,增强跨模式上下文化,并捕捉互补的情感信息。然后使用一个轻量级的分类头进行最终情感预测。 进一步地,我们将基于CLIP文本特征的引导式对比目标纳入其中,鼓励语义一致性的视觉表示。在ABAW 10th EXPR基准测试上的实验结果表明,所提出的框架提供了一个强大的多模态基线,并且比单一模式建模表现出更好的性能。这些结果显示了结合时间视觉建模、音频表示学习和跨模态融合对于在无约束的真实世界环境中进行稳健的情感识别的有效性。
https://arxiv.org/abs/2603.11971
Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.
自回归(AR)扩散模型通过结合扩散建模和因果推理,为视频合成等顺序生成任务提供了一个有前景的框架。尽管它们支持流式生成,但现有的AR扩散方法在高效扩展方面存在困难。在这篇论文中,我们识别了在小时级实时人体动画中的两个关键挑战:首先,大多数强制策略传播样本级别的表示与不匹配的扩散状态相矛盾,导致学习信号不一致和不稳定收敛;其次,历史表示无限制增长且缺乏结构,阻碍了缓存状态的有效重用,并严重限制了推理效率。为解决这些挑战,我们提出了邻近强制(Neighbor Forcing),这是一种扩散步骤一致的AR公式化方法,在相同的噪声条件下将时间相邻的帧作为潜在邻居传播。这种设计提供了分布对齐和稳定的学习信号,同时在整个自回归链中保持漂移。在此基础上,我们引入了一种结构化的ConvKV内存机制,它通过因果注意力压缩键值到固定长度表示中,这使常量内存推理成为可能,并实现了不依赖于短期运动帧记忆的真正无限视频生成。 大量的实验表明,与现有的AR扩散方法相比,我们的方法在训练收敛性、小时级生成质量和推理效率方面有了显著提升。从数值上看,LiveAct能够支持小时级别的实时人体动画,并且可以在仅使用两个NVIDIA H100或H200 GPU的情况下实现每秒20帧的实时流式推理。定量结果显示,我们的方法在唇部同步准确性、人体动画质量和情感表达性方面达到了最先进的性能,同时具有最低的推理成本。
https://arxiv.org/abs/2603.11746
In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. However, the employment of large language models (LLMs) risks reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. The objective of this paper is to evaluate and measure this phenomenon, analysing how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background, focusing on under-35-year-old Italian graduates. The model has been prompted to suggest jobs to 24 simulated candidate profiles, which are balanced in terms of gender, age, experience and professional field. Although no significant differences emerged in job titles and industry, gendered linguistic patterns emerged in the adjectives attributed to female and male candidates, indicating a tendency of the model to associate women with emotional and empathetic traits, while men with strategic and analytical ones. The research raises an ethical question regarding the use of these models in sensitive processes, highlighting the need for transparency and fairness in future digital labour markets.
近年来,生成式人工智能(GenAI)系统在选拔过程、人员招聘和候选人档案分析中扮演了越来越重要的角色。然而,大型语言模型(LLMs)的使用存在复制甚至放大劳动力市场中存在的性别刻板印象和偏见的风险。本文旨在评估并衡量这一现象,通过分析最先进的生成模型(GPT-5)如何基于性别和工作经验背景推荐职业,重点关注意大利35岁以下的大学毕业生。该模型被要求为24个模拟候选人的档案提供工作建议,这些档案在性别、年龄、经验和专业领域方面保持平衡。尽管职位名称和行业方面没有显著差异,但在描述女性和男性候选人时出现了性别化的语言模式,这表明该模型倾向于将女性与情感化和同理心特质联系起来,而将男性与战略性和分析性特质联系在一起。这项研究提出了一个关于在敏感过程中使用这些模型的伦理问题,并强调了未来数字劳动力市场中透明度和公平性的必要性。
https://arxiv.org/abs/2603.11736
The expression of affect is integral to spoken communication, yet, its link to underlying articulatory execution remains unclear. Measures of articulatory muscle activity such as EMG could reveal how speech production is modulated by emotion alongside acoustic speech analyses. We investigate affect decoding from facial and neck surface electromyography (sEMG) during phonated and silent speech production. For this purpose, we introduce a dataset comprising 2,780 utterances from 12 participants across 3 tasks, on which we evaluate both intra- and inter-subject decoding using a range of features and model embeddings. Our results reveal that EMG representations reliably discriminate frustration with up to 0.845 AUC, and generalize well across articulation modes. Our ablation study further demonstrates that affective signatures are embedded in facial motor activity and persist in the absence of phonation, highlighting the potential of EMG sensing for affect-aware silent speech interfaces.
情感表达是口语交流的重要组成部分,然而,它与底层发音执行的联系仍然不清楚。通过测量如表面肌电图(sEMG)等发音肌肉活动可以揭示情绪如何调节言语生成以及伴随的声学分析。我们研究了在发声和无声言语产生过程中从面部和颈部表面肌电图(sEMG)解码情感的表现。为此,我们引入了一个包含12名参与者在3项任务中产生的2,780次发音的数据集,在这个数据集中评估了各种特征和模型嵌入的单个和跨个体解码效果。我们的研究结果表明,肌电图表征能够可靠地区分烦躁情绪,最高达到了0.845的曲线下面积(AUC),并且在不同的发声模式下具有良好的泛化能力。进一步的消融实验还证明了情感特征嵌入于面部运动活动之中,并且即使没有发音也会持续存在,这突显了肌电图感应在情感感知无声言语接口中的潜力。
https://arxiv.org/abs/2603.11715
We propose a novel causal prosody mediation framework for expressive text-to-speech (TTS) synthesis. Our approach augments the FastSpeech2 architecture with explicit emotion conditioning and introduces counterfactual training objectives to disentangle emotional prosody from linguistic content. By formulating a structural causal model of how text (content), emotion, and speaker jointly influence prosody (duration, pitch, energy) and ultimately the speech waveform, we derive two complementary loss terms: an Indirect Path Constraint (IPC) to enforce that emotion affects speech only through prosody, and a Counterfactual Prosody Constraint (CPC) to encourage distinct prosody patterns for different emotions. The resulting model is trained on multi-speaker emotional corpora (LibriTTS, EmoV-DB, VCTK) with a combined objective that includes standard spectrogram reconstruction and variance prediction losses alongside our causal losses. In evaluations on expressive speech synthesis, our method achieves significantly improved prosody manipulation and emotion rendering, with higher mean opinion scores (MOS) and emotion accuracy than baseline FastSpeech2 variants. We also observe better intelligibility (low WER) and speaker consistency when transferring emotions across speakers. Extensive ablations confirm that the causal objectives successfully separate prosody attribution, yielding an interpretable model that allows controlled counterfactual prosody editing (e.g. "same utterance, different emotion") without compromising naturalness. We discuss the implications for identifiability in prosody modeling and outline limitations such as the assumption that emotion effects are fully captured by pitch, duration, and energy. Our work demonstrates how integrating causal learning principles into TTS can improve controllability and expressiveness in generated speech.
我们提出了一种新颖的因果韵律中介框架,用于情感化的文本到语音(TTS)合成。我们的方法通过引入明确的情感调节来增强FastSpeech2架构,并提出了反事实训练目标以将情感韵律从语言内容中解耦。通过对文本(内容)、情感和说话人如何共同影响韵律(持续时间、音高、能量),进而影响语音波形的结构因果模型进行建模,我们推导出两个互补的损失项:间接路径约束(IPC)用于强制执行情感仅通过韵律影响语音,并且反事实韵律约束(CPC)鼓励不同情绪产生不同的韵律模式。该模型在多说话人情感语料库(LibriTTS、EmoV-DB、VCTK)上进行训练,目标函数结合了标准光谱图重构和方差预测损失以及我们提出的因果损失。 在情感化语音合成的评估中,我们的方法实现了显著改进的韵律操纵和情感渲染,在平均意见评分(MOS)和情感准确性方面都优于基线FastSpeech2变体。当跨说话人转移情绪时,我们也观察到了更好的可理解性(低WER)和说话者一致性。广泛的消融研究证实了因果目标成功地分离了韵律属性,生成了一个解释性强的模型,可以在不损害自然度的情况下进行受控反事实韵律编辑(例如,“同一话语,不同情感”)。我们讨论了韵律建模中的可识别性问题,并概述了一些限制条件,如假设情绪效应完全由音高、持续时间和能量所捕捉。我们的工作展示了如何将因果学习原则整合到TTS中以提高生成语音的可控性和表现力。
https://arxiv.org/abs/2603.11683
Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal dynamics, often overlooking the fact that modality reliability can vary substantially across interaction stages. To address this issue, we propose SAGE, a Stage-Adaptive reliability modeling framework that explicitly estimates and calibrates modality-wise confidence during multimodal integration. SAGE introduces a reliability-aware fusion mechanism that dynamically rebalances audio and visual representations according to their stage-dependent informativeness, preventing unreliable signals from dominating the prediction process. By separating reliability estimation from feature representation, the proposed framework enables more stable emotion estimation under cross-modal noise, occlusion, and varying interaction conditions. Extensive experiments on the Aff-Wild2 benchmark demonstrate that SAGE consistently improves concordance correlation coefficient scores compared with existing multimodal fusion approaches, highlighting the effectiveness of reliability-driven modeling for continuous affect prediction.
在真实环境中进行连续的主观情绪(valence-arousal)估计面临着挑战,主要是由于不同模态的数据可靠性不一致以及音频-视频信号随互动变化而产生的多变性。现有的方法主要集中在建模时间动态上,却往往忽略了这样一个事实:即各模态数据在整个互动过程中的可靠性会有显著的变化。为了解决这一问题,我们提出了SAGE(Stage-Adaptive reliability modeling framework),这是一种阶段自适应的可靠性建模框架,该框架明确地估算并校准了在多模式融合过程中各个模态的信心度。 SAGE引入了一种基于可靠性的融合机制,这种机制能够根据各模态在不同互动阶段的信息量动态调整音频和视频表示的重要性,从而防止不可靠信号主导预测过程。通过将可靠性估计与特征表达分离,该框架能够在跨模态噪声、遮挡及不同的交互条件下的情绪估计中提供更加稳定的结果。 在Aff-Wild2基准上的广泛实验表明,SAGE能够持续提高一致性相关系数的评分,相比于现有的多模式融合方法而言,这凸显了基于可靠性的建模对于连续情感预测的有效性。
https://arxiv.org/abs/2603.11468
Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the privacy risks of transmitting biometric audio. While smaller audio-language models enable efficient on-device SEC, their limited capacity often weakens subtle paralinguistic modeling and fine-grained affective grounding. We propose an edge-cloud collaborative framework based on Uncertainty-Guided Speculative Decoding (UGSD). A lightweight edge model drafts captions locally, and only high-uncertainty token blocks are selectively escalated to a stronger cloud verifier for validation. Experiments on the MER2024 benchmark demonstrate substantial BLEU improvements up to 62.7%. UGSD further achieves 1.4x lower latency and 8.5x higher token throughput compared to an edge-only model. These results empirically characterize the quality-efficiency-privacy trade-off in deployable SEC systems.
语音情感字幕(SEC)技术利用大型音频语言模型从语音中生成丰富且上下文感知的情感描述。然而,由于资源受限的边缘设备计算需求较大以及传输生物识别音频带来的隐私风险,其实用部署仍然面临挑战。虽然较小的音频语言模型能够在设备上高效实现SEC,但它们有限的能力往往削弱了细微的副语言建模和情感细粒度定位。 为解决这些问题,我们提出了一种基于不确定性引导推测解码(UGSD)的边缘-云协作框架。在这个框架中,轻量级的边缘模型在本地生成字幕,而只有具有高不确定性的标记块才会被选择性地提升到更强的云端验证器进行验证。我们在MER2024基准测试上的实验表明,BLEU评分提高了高达62.7%。此外,UGSD还实现了1.4倍更低的延迟和8.5倍更高的令牌吞吐量,相较于仅边缘模型的方法。这些结果实证地描述了部署SEC系统时质量、效率和隐私之间的权衡关系。 这种框架不仅增强了情感字幕的质量和生成速度,同时也降低了传输敏感数据的风险,为在资源受限环境中的实用部署提供了可能。
https://arxiv.org/abs/2603.11397
Ableist microaggressions remain pervasive in everyday interactions, yet interventions to help people recognize them are limited. We present an experiment testing how AI-mediated dialogue influences recognition of ableism. 160 participants completed a pre-test, intervention, and a post-test across four conditions: AI nudges toward bias (Bias-Directed), inclusion (Neutral-Directed), unguided dialogue (Self-Directed), and a text-only non-dialogue (Reading). Participants rated scenarios on standardness of social experience and emotional impact; those in dialogue-based conditions also provided qualitative reflections. Quantitative results showed dialogue-based conditions produced stronger recognition than Reading, though trajectories diverged: biased nudges improved differentiation of bias from neutrality but increased overall negativity. Inclusive or no nudges remained more balanced, while Reading participants showed weaker gains and even declines. Qualitative findings revealed biased nudges were often rejected, while inclusive nudges were adopted as scaffolding. We contribute a validated vignette corpus, an AI-mediated intervention platform, and design implications highlighting trade-offs conversational systems face when integrating bias-related nudges.
无障碍偏见微侵犯在日常互动中仍然普遍存在,但帮助人们识别这些行为的干预措施却十分有限。我们进行了一项实验,测试了人工智能(AI)介导对话如何影响人们对无障碍偏见的认知。160名参与者完成了四个条件下的预测试、干预和后测:AI提示偏向于偏见(Bias-Directed)、包容(Neutral-Directed)、无引导对话(Self-Directed),以及非对话的文本形式(Reading)。参与者对场景在社会经验和情感影响方面进行了评分;而参加基于对话条件的人还提供了定性反思。定量结果显示,与阅读组相比,对话组产生了更强的认知效果,尽管轨迹有所分歧:带有偏见提示提高了人们区分偏见和中立的能力,但同时也增加了整体负面情绪。相比之下,包容或无提示的条件保持了较为平衡的状态,而阅读组则显示出较弱的增长甚至下降的趋势。定性结果表明,偏见提示经常被拒绝,而包容提示则被视为一种支持性的框架。我们贡献了一个经过验证的情景语料库、一个AI介导的干预平台,并强调了当对话系统在整合与偏见相关的提示时所面临的设计权衡问题。
https://arxiv.org/abs/2603.11274
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.
语音感知的大规模语言模型(LLMs)可以接受语音输入,但其训练目标主要侧重于语言内容或特定领域如情感和说话人性别等方面,因此不清楚它们是否编码了说话人的身份。首先,我们提出了一种与模型无关的评分协议,该协议能为仅使用API以及开放权重模型生成连续验证分数,通过“是/否”标记概率中的置信度得分或对数似然比来实现这一目标。利用这种协议,我们评估了几种最近的语音感知LLMs,并观察到它们在VoxCeleb1数据集上的说话人辨别能力较弱(错误接受率超过20%)。其次,我们引入了一种轻量级增强方法,通过注入冻结后的ECAPA-TDNN说话人嵌入并仅训练低秩适应器(LoRA adapters)来赋予LLM自动语音识别(ASV)的能力。在TinyLLaMA-1.1B模型上应用这种方法后,生成的ECAPA-LLM在VoxCeleb1-E数据集上的错误接受率为1.03%,这一性能接近专用说话人验证系统的表现,并且保持了自然语言接口的功能。 这段翻译介绍了如何通过改进现有大模型的方法来提高其语音识别中的说话人辨别能力。具体来说,一是提出了一种评估LLM在说话人身份编码方面表现的评分方法;二是提供了一种增强方法,能够在不牺牲自然语言处理功能的情况下赋予模型较好的ASV能力。
https://arxiv.org/abs/2603.10827
Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.
元喻识别是隐喻语言处理中的基础任务,但大多数计算方法只是作为不透明的分类器来运作,无法提供关于为什么某个表达被判定为隐喻的原因。这种可解释性缺口在中国尤其显著,因为那里有丰富的隐喻传统、缺乏形态学线索以及有限的标注资源加大了挑战难度。 我们提出了一种LLM(大型语言模型)辅助的流水线,该流水线将四种元喻识别协议——MIP/MIPVU词汇分析、CMDAG概念映射注释、基于情感的检测和类比导向的识别——作为可执行且可审计的人类规则脚本进行操作。每个协议都是由确定性步骤与受控LLM调用交替构成的模块化链,每一步骤都会生成结构化的理由以配合每一个分类决策。 我们在七个不同层次(词级、句子级和短语级)标注的中文元喻数据集上进行了评估,并建立了首个跨协议比较的基准。在单个协议内部的评估中显示,Protocol A (MIP) 在词级别识别上的F1得分为0.472。然而,跨协议分析揭示了显著差异:A和D协议之间的成对Cohen's kappa仅为0.001,而B和C协议之间则表现出几乎完美的吻合(kappa = 0.986)。 可解释性审计显示所有协议均实现了100%的确定性重现率,理由正确率为40%-87%,编辑能力为80%-100%。错误分析发现概念领域不匹配和文体敏感度是主导性的失败模式。 我们的结果表明,在元喻识别中,协议选择是最大的变化来源,其重要性超过模型层面的变化,并且规则脚本架构能够在保持完全透明的同时实现竞争性性能。
https://arxiv.org/abs/2603.10784
Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.
音频-视觉情感识别(AVER)方法通常融合话语级特征,即使是一些帧级别的注意力模型也较少解决跨模态的帧率不匹配问题。在本文中,我们提出了一种基于Transformer框架的方法,专注于多模态特征的时间对齐。我们的设计采用了一个多模态自注意编码器,在共享的特征空间内同时捕捉了内部和跨模态之间的依赖关系。为了解决异质采样率的问题,我们引入了时间对齐旋转位置嵌入(TaRoPE),它隐式地同步音频和视频标记。此外,我们提出了一种交叉时间匹配(CTM)损失函数,该函数强制相邻时间点的一致性,引导编码器向更好的对齐方向发展。在CREMA-D和RAVDESS数据集上的实验表明,与最近的基准方法相比,我们的方法取得了持续改进的效果,这表明明确解决帧率不匹配问题有助于保留时间线索并增强跨模态融合效果。
https://arxiv.org/abs/2603.11095
Story generation aims to produce image sequences that depict coherent narratives while maintaining subject consistency across frames. Although existing methods have excelled in producing coherent and expressive stories, they remain largely emotion-neutral, focusing on what subject appears in a story while overlooking how emotions shape narrative interpretation and visual presentation. As stories are intended to engage audiences emotionally, we introduce emotion-aware story generation, a new task that aims to generate subject-consistent visual stories with explicit emotional directions. This task is challenging due to the abstract nature of emotions, which must be grounded in concrete visual elements and consistently expressed across a narrative through visual composition. To address these challenges, we propose EmoStory, a two-stage framework that integrates agent-based story planning and region-aware story generation. The planning stage transforms target emotions into coherent story prompts with emotion agent and writer agent, while the generation stage preserves subject consistency and injects emotion-related elements through region-aware composition. We evaluate EmoStory on a newly constructed dataset covering 25 subjects and 600 emotional stories. Extensive quantitative and qualitative results, along with user studies, show that EmoStory outperforms state-of-the-art story generation methods in emotion accuracy, prompt alignment, and subject consistency.
故事生成的目标是创建一系列连贯叙述的图像序列,同时在各帧之间保持主体一致性。尽管现有的方法在生成连贯且富有表现力的故事方面表现出色,但它们仍然很大程度上缺乏情感导向,主要关注于出现在故事中的主题,而忽略了情感如何影响叙事解读和视觉呈现。鉴于故事旨在从情感层面吸引观众,我们引入了情感感知故事生成这一新任务,该任务旨在根据明确的情感方向生成主体一致且具有连贯性的视觉故事。由于情感的抽象性质必须通过具体的视觉元素来体现,并在整个叙述中保持一致性表达,因此完成这项任务颇具挑战性。 为应对这些挑战,我们提出了EmoStory框架,这是一套两级结构化的方案,结合了基于代理的故事规划和区域感知的故事生成方法。在规划阶段,使用情感代理和作家代理将目标情绪转化为连贯的故事提示;而在生成阶段,则通过区域感知组合来保持主体一致性并注入与情感相关的元素。 我们在一个新构建的数据集上评估了EmoStory,该数据集涵盖了25个主题及600篇情感故事。通过广泛的定量和定性分析以及用户研究,结果表明EmoStory在情绪准确性、提示对齐和主体一致性的表现上优于当前最先进的故事生成方法。 总结来说,这项研究开创了一种新的故事生成路径,将情感要素融入到视觉叙述中,并为未来研究提供了坚实的基础。
https://arxiv.org/abs/2603.10349
Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the "snowball effect" in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems.
群决策在生物和人类群体中往往源于简单的互动规则,这些规则将细微的差异放大为共识。蜜蜂方程式最初用于描述蜂群选择巢穴地点的过程,通过招募和抑制过程捕捉了这一动态变化。在此基础上,我们将蜜蜂方程式扩展为一种基于代理人的模型,在该模型中,情绪值(正负)和唤醒度(高低)作为互动速率的调节因子,有效改变了招募和交叉抑制参数。代理人在模拟面部表情映射其价值-唤醒状态时表现出这种行为,这使得在达成共识的过程中研究情感传染成为可能。 本文探讨了三种场景: 1. 价值与唤醒度对共识结果及速度的联合影响。 2. 当值相匹配时,唤醒度打破僵局的作用。 3. “滚雪球效应”,即在跨过中间支持阈值后,共识加速的现象。 研究结果显示,情绪调节可以通过改变有效招募和抑制速率来偏置决策结果并改变收敛时间。同时,在完全对称的情绪条件下,内在非线性放大也可以产生决定性的胜利。这些发现将经典群决策理论与情感和社会模型联系起来,强调了情感不对称性和结构转折点如何塑造集体成果。 提出的框架为研究自然和人工系统中群体选择的情感维度提供了一种灵活的工具。
https://arxiv.org/abs/2603.09963
Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality this http URL reproducibility, we release our code at: this https URL
多模态情感计算支撑着诸如情感分析和情绪识别等关键任务。然而,标准评估通常假设文本、音频和视觉模式是同样可用的。在实际应用中,某些模式由于脆弱性或成本原因会系统性地缺失,导致不平衡的丢失率和训练偏差,这些单靠任务级指标无法揭示。 我们推出了MissBench,这是一个针对多模态情感任务的标准基准和框架,涵盖了四种常用的情感和情绪数据集上的共享及不平衡丢失率协议。MissBench还定义了两个诊断指标:模式公平指数(Modality Equity Index, MEI)衡量不同模式在缺失模式配置下贡献的公正性;模式学习指数(Modality Learning Index, MLI)通过比较训练期间各模态特定的梯度范数来量化优化不平衡,这些梯度是跨模态相关模块聚合得到的。 代表性的方法族实验表明,在共享丢失率条件下看似稳健的模型在不平衡条件下面临显著的模式不公平性和优化不平衡。这一发现使MissBench及其MEI和MLI成为测试多模态情感模型在实际不完整模式情况下的压力工具及分析工具,以提高研究工作的可重复性。 为了确保再现性,我们发布了我们的代码:[这里可以填写发布地址]
https://arxiv.org/abs/2603.09874
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
最近在语音字幕生成模型方面的进展已经使情感化语音的丰富且精细的字幕生成成为可能。然而,评估这类字幕仍然是一个关键瓶颈:传统的N-gram度量无法捕捉语义细微差别,而大型语言模型(LLM)评委在处理长篇描述时常常会遭遇推理不一致和上下文坍缩的问题。 为了解决这个问题,我们提出了EmoSURA这一新颖的评估框架。该框架从整体评分转向原子验证,将复杂的字幕分解成原子感知单元,这些单元是关于语音或情感属性的独立陈述,并采用基于音频的验证机制来验证每个单位与原始语音信号的一致性。 此外,为了应对标准化评估资源匮乏的问题,我们引入了SURABench,这是一个精心平衡和分层的标准。我们的实验表明,EmoSURA能够与人类判断产生正相关关系,在对长篇字幕进行评价时相比传统度量标准提供更为可靠的评估结果(传统度量标准由于其对字幕长度的敏感性而产生了负相关)。
https://arxiv.org/abs/2603.09820