Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
个人可识别信息(PII)的匿名化是一项高风险任务,为许多开放科学数据共享倡议设置了障碍。虽然近年来在PII识别方面已经取得了显著进展,但在实践中,错误阈值和召回率/精确度权衡仍然限制了这些匿名化管道的应用推广。我们提出了一个名为PIIvot的轻量级框架,该框架利用对数据上下文的理解来简化PII检测问题。为了证明其有效性,我们也贡献了一个名为QATD-2k的数据集,这是同类中最大的开源现实世界辅导数据集,以支持高质量教育对话数据的需求。
https://arxiv.org/abs/2505.16931
During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
在文本生成任务的微调阶段,标准交叉熵损失函数会平等对待所有标记(token)。这可能导致模型过度强调高频率但信息量低的标记,而忽略了那些对生成内容的具体性和信息性至关重要的低频标记。本文介绍了一种新颖的损失函数——幂律衰减损失(PDL),它专门用于优化文本生成任务中的微调过程。 PDL的核心动机源于信息论和语言学观察:一个标记的信息量通常与其出现频率成反比。因此,PDL根据训练语料库中标记的频率重新加权标准交叉熵损失中每个标记的贡献值,并遵循幂律衰减原则。具体来说,高频率标记的权重被降低,而低频且信息密集型标记则赋予更高的权重。 这种机制在微调过程中引导模型更加关注学习和生成传达特定及独特信息的标记,从而提高生成文本的质量、多样性和信息量。本文从理论上详细阐述了PDL的动机及其构造,并讨论了其在摘要概括、对话系统以及风格转换等各类文本生成任务中的潜在应用和优势。
https://arxiv.org/abs/2505.16900
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
最近在基于场景的视频生成领域的进展使得系统能够从结构化的提示中合成连贯的视觉叙述。然而,叙事中的一个关键维度——以角色驱动对话和言语——仍然相对未被充分探索。在这篇论文中,我们提出了一种模块化管道,该管道将动作级别的提示转换为基于视觉和听觉的叙述对话,从而丰富了视觉叙事,并加入了自然的声音和人物表达。我们的方法采用每场景一对输入提示作为输入,其中第一个定义背景设置,第二个指定角色的行为。虽然像Text2Story这样的故事生成模型可以产生相应的视觉场景,但我们专注于从这些提示和场景图像中生成富有表现力的对话文本。 我们应用了一个预训练的视觉-语言编码器来提取代表帧中的高层次语义特征,捕捉显著的视觉上下文。这个特征随后与结构化提示相结合,并用来指导大型语言模型合成自然且角色一致的对话。为了确保在整个故事中的场景之间保持上下文一致性,我们引入了递归叙事库,使得每一次对话生成都基于之前场景积累下来的对话历史。这种方法使角色能够以反映其不断变化的目标和互动的方式进行交谈。 最后,我们将每个语句渲染成富有表现力且符合角色的语音,从而产生完整的有声视频叙述。我们的框架无需额外训练,并展示了在各种故事设置中的适用性,包括幻想冒险和日常生活片段等场景。
https://arxiv.org/abs/2505.16819
Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.
小型大语言模型(简称sLLMs)因其轻量级和高效的特点,在资源受限的环境中表现出色。然而,这些模型在面向任务的对话系统中通常难以保持话题一致性,特别是在服务聊天机器人等场景下,这一问题尤为重要。具体而言,确保模型能够拒绝无关或恶意输入并坚持其预期功能以防止潜在滥用及保证可靠性至关重要。为此,现有激活工程方法提出通过调整推理过程中的内部激活来解决此类问题。尽管这些方法在某些情况下表现出有效性,但我们的初步实验表明它们在维持话题一致性方面存在局限性。 因此,为了应对这一挑战,我们提出了一个名为熵缩放引导向量维护主题(EnSToM)的新方法。EnSToM根据输入的不确定性动态调整控制强度,使得模型能够有效处理无关干扰信息的同时保持与主题相关的准确性。实验结果表明,相较于微调方法,EnSToM在相对较小的数据集上实现了显著的性能提升。 通过提高话题一致性而不牺牲效率,我们的方法为增强基于sLLM的对话系统提供了一个稳健的解决方案。
https://arxiv.org/abs/2505.16526
Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of movie dubbing, including adaptation to various dubbing styles, effective handling of dialogue, narration, and monologues, as well as consideration of subtle details such as speaker age and gender, remain insufficiently explored. To tackle these challenges, we introduce a multi-modal generative framework. First, it utilizes a multi-modal large vision-language model (VLM) to analyze visual inputs, enabling the recognition of dubbing types and fine-grained attributes. Second, it produces high-quality dubbing using large speech generation models, guided by multi-modal inputs. Additionally, a movie dubbing dataset with annotations for dubbing types and subtle details is constructed to enhance movie understanding and improve dubbing quality for the proposed multi-modal framework. Experimental results across multiple benchmark datasets show superior performance compared to state-of-the-art (SOTA) methods. In details, the LSE-D, SPK-SIM, EMO-SIM, and MCD exhibit improvements of up to 1.09%, 8.80%, 19.08%, and 18.74%, respectively.
当前的电影配音技术能够利用参考声音和输入视频产生所需的语音,同时与画面保持完美的同步,并有效地传达所需的情感。然而,包括适应各种配音风格、有效处理对话、旁白及独白在内的关键方面,以及考虑说话者的年龄和性别等细微之处,在现有研究中仍未得到充分探索。 为解决这些挑战,我们引入了一个多模态生成框架。首先,该框架采用大规模视觉语言模型(VLM)来分析视觉输入,从而识别配音类型及其细粒度属性。其次,它利用大型语音生成模型根据多模态输入生产高质量的配音。此外,还构建了一套带有注释标记的电影配音数据集,涵盖多种类型的配音和细微信息,以此提升对电影的理解,并进一步改进提出的多模态框架下的配音质量。 实验结果显示,在多个基准测试数据集中,我们的方法相比最先进的(SOTA)技术具有显著优势。具体而言,LSE-D、SPK-SIM、EMO-SIM 和 MCD 指标分别提高了1.09%、8.80%、19.08%和18.74%。
https://arxiv.org/abs/2505.16279
This paper presents <Dialogue in Resonance>, an interactive music piece for a human pianist and a computer-controlled piano that integrates real-time automatic music transcription into a score-driven framework. Unlike previous approaches that primarily focus on improvisation-based interactions, our work establishes a balanced framework that combines composed structure with dynamic interaction. Through real-time automatic transcription as its core mechanism, the computer interprets and responds to the human performer's input in real time, creating a musical dialogue that balances compositional intent with live interaction while incorporating elements of unpredictability. In this paper, we present the development process from composition to premiere performance, including technical implementation, rehearsal process, and performance considerations.
本文介绍了《共振对话》(Dialogue in Resonance),这是一首为真人钢琴家和计算机控制的钢琴设计的互动音乐作品,它将实时自动音乐转录技术融入到基于乐谱的框架中。与以往主要侧重于即兴创作交互的方法不同,我们的工作建立了一个结合了编排结构与动态交互平衡的框架。通过实时自动转录作为核心机制,计算机能够实时解读和回应人类表演者的输入,在保持作曲意图的同时进行现场互动,并融入不可预测性元素。本文将详细介绍从作品构思到首演的过程,包括技术实现、排练过程以及演出考量等方面的内容。
https://arxiv.org/abs/2505.16259
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
随着大型音频-语言模型(LALM)的发展,这些模型通过增强大型语言模型(LLM)的听觉能力,在各种听觉任务中展现出广泛的适应性。尽管已经出现了许多评估LALM性能的基准测试,但它们仍然是分散且缺乏结构化的分类系统。为了弥补这一差距,我们进行了全面调查,并提出了一套系统的评价体系,将LALMs的评测根据其目标分为四个维度:(1)通用听觉意识与处理能力;(2)知识和推理能力;(3)对话导向的能力;以及(4)公平性、安全性和可靠性。在每个分类中我们提供了详细的概述,并强调了该领域的挑战,同时指出了未来有前景的发展方向。 据我们所知,这是第一份专门针对LALM评估的调查报告,为社区提供明确指南。我们将发布所有被调研论文的集合,并积极维护更新以支持这一领域不断发展的需求。
https://arxiv.org/abs/2505.15957
We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.
我们提出了一种基于大型语言模型的奖励分解框架,用于仅通过单一会话级反馈信号来对齐对话代理。我们利用冻结且预训练的大规模语言模型(LLM)的推理能力,通过分解全局、会话级别的反馈,推断出细粒度的局部隐式奖励。我们的第一个纯文本变体提示 LLM 仅使用对话记录来进行奖励分解。第二个多模态变体则整合了额外的行为线索,如音高、目光接触和面部表情,并将其以自然语言描述的形式表达出来。这些推理出的各回合级奖励被提炼成一个轻量级的奖励模型,我们利用该模型进行基于强化学习(RL)的微调,用于对话生成。我们在人类评估中分别测试了纯文本和多模态变体,并与最先进的奖励分解方法进行了比较,结果显示在对话质量方面有显著改进。这表明 LLM 是强大的奖励分解器,可以消除手动设计奖励形状及获取细化的人类反馈的需求。
https://arxiv.org/abs/2505.15922
Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop the PsyLLM, we propose a novel automated data synthesis pipeline. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions: comprehensiveness, professionalism, authenticity, and safety. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark.
大型语言模型(LLMs)在心理健康支持方面具有巨大潜力,能够生成富有同理心的回应并模拟治疗性对话。然而,现有的基于LLM的方法往往缺乏必要的临床基础,特别是在与如《精神疾病诊断与统计手册》(DSM)/《国际疾病分类》(ICD)标准一致的确切诊断推理以及纳入超越基本共情或单一策略的多样化治疗方法方面。 为解决这些关键局限性,我们提出了PsyLLM,这是首个系统地整合了诊断和治疗推理以支持心理健康咨询的大规模语言模型。为了开发PsyLLM,我们提出了一种新颖的自动化数据合成流水线。该流水线处理现实世界中的心理健康帖子,生成多轮对话结构,并利用指导国际诊断标准(如DSM/ICD)及多种治疗框架(如认知行为疗法、接受承诺疗法和心理动力学疗法)的大规模语言模型来模拟详细的临床推理过程。严格的多维度过滤确保了高质量、与临床一致的对话数据生成。 此外,我们还引入了一个新的基准测试和评估协议,从四个方面评估咨询质量:全面性、专业性、真实性及安全性。实验结果显示,在此基准测试中,PsyLLM显著优于现有的最先进的基线模型。
https://arxiv.org/abs/2505.15715
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
口语对话是一种直观的人机交互形式,但目前的语音语言模型通常仍局限于回合制交流,缺乏诸如用户插话等实时适应性。我们提出了一种新颖的双工语音到语音(S2S)架构,该架构支持连续的用户输入和编码代理输出,并通过信道融合直接建模同时进行的用户流和代理流。利用预先训练的流式编码器处理用户输入,使得构建首个无需语音预训练的双工S2S模型成为可能。分别设计用于代理和用户的模型架构有助于对编码器进行微调以改进代理声音,并将比特率(0.6 kbps)降至以往工作的半数以下。实验结果表明,所提出的模型在推理、对话回合处理以及用户插话能力方面均优于现有的双工模型。由于跳过了语音预训练步骤,该模型所需的语音数据显著减少,从而大大简化了从任何大型语言模型(LLM)构建双工S2S模型的过程。最后,它是第一个公开可用的包含训练和推理代码的双工S2S模型,以促进研究的可重复性。
https://arxiv.org/abs/2505.15670
Dialogue agents that support human users in solving complex tasks have received much attention recently. Many such tasks are NP-hard optimization problems that require careful collaborative exploration of the solution space. We introduce a novel dialogue game in which the agents collaboratively solve a two-player Traveling Salesman problem, along with an agent that combines LLM prompting with symbolic mechanisms for state tracking and grounding. Our best agent solves 45% of games optimally in self-play. It also demonstrates an ability to collaborate successfully with human users and generalize to unfamiliar graphs.
最近,支持人类用户解决复杂任务的对话代理受到了广泛关注。许多此类任务都是需要仔细协作探索解空间的NP难题优化问题。我们引入了一种新型对话游戏,在这种游戏中,代理合作解决一个双人旅行商问题,并且还提出了一种结合大规模语言模型提示与符号机制(用于状态跟踪和定位)的代理。在自我对局中,我们的最佳代理能有45%的概率找到最优解。此外,该代理还能成功地与人类用户协作,并能够应用于不熟悉的图结构任务中。
https://arxiv.org/abs/2505.15490
Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
个性化对齐对于大型语言模型(LLM)有效参与以用户为中心的对话至关重要。虽然最近基于提示和离线优化的方法提供了初步解决方案,但由于它们本质上是静态且浅层的设计,在冷启动场景和个人化长期维持方面存在不足。在本工作中,我们介绍了用于个性化对齐的强化学习框架(RLPA),在这个框架中,LLM通过与模拟用户模型对话交互来迭代地推断和细化用户个人资料。训练过程由一个双层次奖励结构指导:个人档案奖励鼓励准确构建用户表示,而响应奖励则激励生成符合推断出的个人资料的回答。我们通过微调Qwen-2.5-3B-Instruct实现了RLPA框架,产生了名为Qwen-RLPA的模型,在个性化对话方面达到了最先进的性能。实证评估表明,Qwen-RLPA在与提示和离线微调基线相比时始终表现更优,并且甚至超过了像Claude-3.5和GPT-4o这样的先进商业模型。进一步分析显示,Qwen-RLPA在解决冲突的用户偏好方面具有更强健性,在长期个性化维持以及推理效率上优于近期以推理为重点的LLM。这些结果强调了动态个人资料推断作为构建个性化对话系统更有效范式的潜力。
https://arxiv.org/abs/2505.15456
Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.
大型语言模型(LLMs)越来越多地被部署在多轮对话的应用程序中,其中关键值(KV)缓存的管理成为了一个显著的瓶颈。随着对话历史的增长而线性增长的KV缓存带来了巨大的计算成本,现有的驱逐策略常常通过反复压缩早期对话上下文来降低性能,导致信息丢失和上下文遗忘。本文介绍了FlowKV,这是一种新颖的**多轮隔离机制**用于KV缓存管理,可以应用于任何KV缓存压缩方法且无需训练。FlowKV的核心创新是一个多轮隔离机制,它保留了从过去几轮积累起来的压缩KV缓存,并策略性地仅对最新完成回合中新生成的KV对进行压缩,从而有效防止旧上下文被重新压缩,进而缓解灾难性遗忘的问题。我们的实验结果显示,与基线策略相比,FlowKV在保持指令遵循准确性和用户偏好保留方面始终表现更优,特别是在后期对话轮次中从10.90%到75.40%的提升尤为明显。
https://arxiv.org/abs/2505.15347
Emotional Support Conversations (ESC) are crucial for providing empathy, validation, and actionable guidance to individuals in distress. However, existing definitions of the ESC task oversimplify the structure of supportive responses, typically modelling them as single strategy-utterance pairs. Through a detailed corpus analysis of the ESConv dataset, we identify a common yet previously overlooked phenomenon: emotional supporters often employ multiple strategies consecutively within a single turn. We formally redefine the ESC task to account for this, proposing a revised formulation that requires generating the full sequence of strategy-utterance pairs given a dialogue history. To facilitate this refined task, we introduce several modelling approaches, including supervised deep learning models and large language models. Our experiments show that, under this redefined task, state-of-the-art LLMs outperform both supervised models and human supporters. Notably, contrary to some earlier findings, we observe that LLMs frequently ask questions and provide suggestions, demonstrating more holistic support capabilities.
情感支持对话(ESC)对于向处于困境中的人们提供同情、认同和可操作的指导至关重要。然而,现有的ESC任务定义过于简化了支持性回应的结构,通常将它们建模为单一策略-话语对。通过对ESConv数据集进行详细语料库分析,我们发现了一种普遍但之前被忽视的现象:情感支持者常常在一次对话中连续使用多种策略。为此,我们正式重新界定了ESC任务,提出了一个新的修正版本,要求生成一系列基于对话历史的策略-话语对。 为了促进这一细化的任务,我们介绍了几种建模方法,包括监督深度学习模型和大型语言模型(LLM)。我们的实验表明,在这个重新定义的任务下,最先进的LLM超越了监督模型甚至人类支持者的性能。值得注意的是,与一些早期发现相反,我们观察到LLM经常提出问题并提供建议,这表明它们具备更全面的支持能力。
https://arxiv.org/abs/2505.15316
Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.
净化有害语言,同时保留发言者的原始意图,是提高在线互动质量的一项具有挑战性但又至关重要的目标。尽管大型语言模型(LLMs)在重写有毒内容方面显示出潜力,但它们往往默认生成过于礼貌的改写版本,这会扭曲情感基调和沟通意图。这种问题在中国语境下尤为严重,在这里,毒性经常通过表情符号、谐音词或话语背景隐含地出现。 我们推出了ToxiRewriteCN,这是首个专门设计用于保留情感极性的中文净化数据集。该数据集中包含1,556条精心标注的三元组,每个三元组包括一条有毒句子、一个与情感一致且无毒的改写版本以及标记出的有毒段落。它涵盖了五个现实场景:标准表达式、由表情符号和同音词引发的毒性,以及单轮和多轮对话。 我们评估了17种LLMs(其中包括商用和开源模型,并且具有不同的架构),依据四个维度进行评价:净化准确性、流畅度、内容保留能力和情感极性。结果表明,虽然商业和MoE模型在总体上表现最佳,但所有模型都在处理更为微妙或背景信息丰富的输入时——例如表情符号、同音词以及对话式输入——难以在安全性与情感真实性之间找到平衡。 我们发布了ToxiRewriteCN数据集以支持未来针对中文的可控性和情感感知净化研究。
https://arxiv.org/abs/2505.15297
Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Its covert nature and the complexity of manipulation strategies make it challenging to detect, even for state-of-the-art large language models (LLMs). This concealment also hinders the manual collection of large-scale, high-quality annotations essential for training effective models. Although recent efforts have sought to improve LLM's performance on this task, progress remains limited due to the scarcity of real-world annotated datasets. To address these challenges, we propose MentalMAC, a multi-task anti-curriculum distillation method that enhances LLMs' ability to detect mental manipulation in multi-turn dialogue. Our approach includes: (i) EvoSA, an unsupervised data expansion method based on evolutionary operations and speech act theory; (ii) teacher-model-generated multi-task supervision; and (iii) progressive knowledge distillation from complex to simpler tasks. We then constructed the ReaMent dataset with 5,000 real-world dialogue samples, using a MentalMAC-distilled model to assist human annotation. Vast experiments demonstrate that our method significantly narrows the gap between student and teacher models and outperforms competitive LLMs across key evaluation metrics. All code, datasets, and checkpoints will be released upon paper acceptance. Warning: This paper contains content that may be offensive to readers.
心理操控是一种微妙却普遍存在的心里虐待形式,对心理健康构成严重威胁。这种隐蔽性及其复杂的操纵策略使得即使是最先进的大型语言模型(LLMs)也难以检测。由于其隐藏特性,这也阻碍了大规模高质量注释数据的手动收集过程,这些注释对于训练有效模型至关重要。尽管最近的努力旨在提高LLM在此任务上的表现,但由于真实世界标注数据的缺乏,进展仍然有限。 为了解决这些问题,我们提出了MentalMAC方法,这是一种多任务反向课程蒸馏技术,能增强大型语言模型在多轮对话中检测心理操控的能力。我们的方法包括以下三个部分: (i) EvoSA:一种基于进化操作和言语行为理论的无监督数据扩展方法; (ii) 教师模型生成的多任务监督; (iii) 从复杂到简单的逐步知识蒸馏。 我们还构建了包含5,000个真实世界对话样本的ReaMent数据集,并使用MentalMAC蒸馏模型来辅助人类注释工作。大量的实验表明,我们的方法显著缩小了学生和教师模型之间的差距,在所有关键评估指标上均优于竞争性的大型语言模型。 我们将根据论文接受情况发布所有代码、数据集和检查点。请注意:本论文包含可能令读者感到不适的内容。
https://arxiv.org/abs/2505.15255
Can small language models with 0.5B to 5B parameters meaningfully engage in trauma-informed, empathetic dialogue for individuals with PTSD? We address this question by introducing TIDE, a dataset of 10,000 two-turn dialogues spanning 500 diverse PTSD client personas and grounded in a three-factor empathy model: emotion recognition, distress normalization, and supportive reflection. All scenarios and reference responses were reviewed for realism and trauma sensitivity by a clinical psychologist specializing in PTSD. We evaluate eight small language models before and after fine-tuning, comparing their outputs to a frontier model (Claude Sonnet 3.5). Our IRB-approved human evaluation and automatic metrics show that fine-tuning generally improves perceived empathy, but gains are highly scenario- and user-dependent, with smaller models facing an empathy ceiling. Demographic analysis shows older adults value distress validation and graduate-educated users prefer nuanced replies, while gender effects are minimal. We highlight the limitations of automatic metrics and the need for context- and user-aware system design. Our findings, along with the planned release of TIDE, provide a foundation for building safe, resource-efficient, and ethically sound empathetic AI to supplement, not replace, clinical mental health care.
具有0.5B到5B参数的小型语言模型能否有意义地参与针对PTSD患者的创伤知情、同理心对话?为了解决这个问题,我们引入了TIDE数据集,该数据集包含10,000对两轮对话,涵盖了500个不同类型的PTSD患者角色,并基于一个三因素同理心模型:情感识别、痛苦的正常化和支持性反思。所有的情景和参考回答均由专门研究PTSD的精神科医生审查以确保真实性和创伤敏感性。我们评估了八种小型语言模型在微调前后的表现,将其输出与前沿模型(Claude Sonnet 3.5)进行比较。我们的IRB批准的人类评价和自动指标显示,微调一般会提高感知到的同理心水平,但改善的程度高度依赖于场景和用户,较小的模型面临着同理心上限的问题。人口统计学分析表明,老年人更看重痛苦的验证,而受过高等教育的用户则偏好细腻的回答,性别差异的影响相对较小。我们指出了自动指标的局限性,并强调了需要设计考虑到上下文和用户的系统。我们的研究结果以及即将发布的TIDE数据集为构建安全、资源高效且道德上合理的同理心AI提供了基础,这些AI可以补充而非替代临床心理健康护理。
https://arxiv.org/abs/2505.15065
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
尽管基于大型语言模型(LLM)的聊天机器人在生成连贯且上下文相关的回复方面表现出强大的能力,但它们常常难以掌握何时发言,尤其是在进行持续对话时提供简短、及时反应的能力上存在不足。这一限制主要源于其对文本输入的高度依赖,无法捕捉到真实世界对话中的丰富情境线索。在此项工作中,我们专注于实时预测响应类型,重点在于依赖视觉、音频和文本多模态信号的简短反应。为此,我们引入了一个新的多模态数据集,该数据集由现实世界的会话视频构建而成,并包含时间对齐的视觉、听觉和文本流。此数据集支持双人互动中响应时机的精细建模。基于这一数据集,我们提出了MM-When2Speak模型,这是一个基于LLM的多模态模型,能够自适应地整合视觉、听觉和文本上下文以预测何时应做出回应以及何种类型的回复最为合适。实验结果表明,相较于最先进的单模态及基于LLM的方法基线,MM-When2Speak在响应时机准确性方面最高可提高4倍于顶级商用LLM的表现。这些成果强调了多模态输入对于生成及时、自然且引人入胜的对话AI的重要性。
https://arxiv.org/abs/2505.14654
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
尽管神经语音对话系统取得了显著进展,但由于语音数据集中缺乏个性标注,能够根据个性调整行为的个性化会话代理仍然研究不足。我们提出了一条流水线,用于预处理原始音频记录,创建带有时间戳、响应类型和情感/情绪标签的对话数据集。首先,我们使用自动语音识别(ASR)系统提取转录文本和时间戳,然后生成对话级别的注释。利用这些注释,我们设计了一个系统,该系统采用大型语言模型来预测会话个性。通过人类评估者来识别会话特征并分配个性标签。我们的分析表明,所提出的系统在与现有方法相比时,在与人类判断的对齐方面表现出更强的效果。
https://arxiv.org/abs/2505.14356
This study investigates the interaction between personality traits and emotional expression, exploring how personality information can improve speech emotion recognition (SER). We collected personality annotation for the IEMOCAP dataset, and the statistical analysis identified significant correlations between personality traits and emotional expressions. To extract finegrained personality features, we propose a temporal interaction condition network (TICN), in which personality features are integrated with Hubert-based acoustic features for SER. Experiments show that incorporating ground-truth personality traits significantly enhances valence recognition, improving the concordance correlation coefficient (CCC) from 0.698 to 0.785 compared to the baseline without personality information. For practical applications in dialogue systems where personality information about the user is unavailable, we develop a front-end module of automatic personality recognition. Using these automatically predicted traits as inputs to our proposed TICN model, we achieve a CCC of 0.776 for valence recognition, representing an 11.17% relative improvement over the baseline. These findings confirm the effectiveness of personality-aware SER and provide a solid foundation for further exploration in personality-aware speech processing applications.
这项研究探讨了个性特征与情感表达之间的互动,并探究了个性信息如何提高语音情感识别(SER)的准确性。我们为IEMOCAP数据集收集了个性标注,并通过统计分析发现了个性特质和情感表达之间存在显著的相关性。 为了提取精细的个性特征,我们提出了一种时序交互条件网络(TICN),其中个性特征与基于Hubert的声学特征相结合用于SER任务。实验结果显示,在将真实的个性信息纳入考量后,效价识别的准确度得到了明显提升,相较于不考虑个性信息的基础模型,其一致性相关系数(CCC)从0.698提高到了0.785。 对于实际应用中的对话系统,当用户的具体个性特征不可用时,我们开发了一种自动个性识别前端模块。利用这些自动预测的特性作为输入到我们的TICN模型中,我们可以达到效价识别的一致性相关系数(CCC)为0.776,相较于基础模型有11.17%的相对改进。 这些发现确认了基于个性特征的SER的有效性,并为进一步探索带有个性化处理的应用程序奠定了坚实的基础。
https://arxiv.org/abs/2505.13978