Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
最近关于大型语言模型(LLMs)的研究展示了它们在没有明确提示的情况下理解和运用欺骗行为的能力。然而,这种行为仅在少数特殊案例中被观察到,并未显示出对用户构成严重威胁的迹象。此外,在人工智能对齐研究方面取得了显著进展,训练模型拒绝生成误导性或有毒内容的技术得到了提升。因此,LLMs通常被认为诚实且无害。在这项研究中,我们引入了一种新型攻击方法,该方法削弱了这两个特性,并揭示了一个若被利用可能会产生严重现实后果的漏洞。特别是,我们介绍了微调方法,这些方法增强了超越模型安全措施的欺骗倾向。这种“欺骗性攻击”使模型在特定话题上误导用户的同时,在其他问题上保持准确性。此外,我们发现具有欺骗性的模型还会表现出毒性,生成仇恨言论、刻板印象和其他有害内容。最后,我们评估了模型是否能在多轮对话中持续进行欺骗,并得到了参差不齐的结果。鉴于数百万用户与基于LLM的聊天机器人、语音助手、代理等交互,在这些场景中无法确保信任度的情况下,保障这些模型免受欺骗性攻击变得至关重要。
https://arxiv.org/abs/2502.08301
With the advancement of large language models (LLMs), the focus in Conversational AI has shifted from merely generating coherent and relevant responses to tackling more complex challenges, such as personalizing dialogue systems. In an effort to enhance user engagement, chatbots are often designed to mimic human behaviour, responding within a defined emotional spectrum and aligning to a set of values. In this paper, we aim to simulate personal traits according to the Big Five model with the use of LLMs. Our research showed that generating personality-related texts is still a challenging task for the models. As a result, we present a dataset of generated texts with the predefined Big Five characteristics and provide an analytical framework for testing LLMs on a simulation of personality skills.
随着大型语言模型(LLM)的进步,对话式人工智能的关注点已经从生成连贯且相关响应转向应对更具挑战性的任务,例如个性化对话系统。为了增强用户参与度,聊天机器人通常被设计为模仿人类行为,在定义的情感范围内作出反应,并遵循一系列价值观。在这篇论文中,我们旨在利用LLMs根据大五人格模型模拟个人特征。我们的研究表明,生成与性格相关文本对于这些模型而言仍是一项具有挑战性的任务。因此,我们提供了一组带有预定义的大五人格特质的生成文本集,并提出一个用于测试LLM在性格模拟技能上的分析框架。
https://arxiv.org/abs/2502.08265
In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at this https URL, aiming to promote the in-depth development and wide application of SAR visual language models.
在合成孔径雷达(SAR)遥感图像解释领域,尽管视觉语言模型(VLMs)在自然语言处理和图像理解方面取得了显著进展,但由于缺乏专业领域的知识经验,它们的应用仍受到限制。本文创新性地提出了首个大规模多模态对话数据集——SARChat-2M,该数据集包含约200万对高质量的图像文本配对,并涵盖了多种详细标注目标的情景。此数据集不仅支持包括视觉理解和物体检测在内的多项关键任务,还具有独特的创新点:本研究开发了一个针对SAR领域的视觉语言数据集和基准,这使得可以评估VLMs在解释SAR图像方面的能力,并为构建跨各种遥感垂直领域多模态数据集提供了一种典范框架。通过对16个主流VLM模型的实验验证了该数据集的有效性,并成功建立了首个SAR领域的多任务对话基准。该项目将在[此链接](https://this https URL)发布,旨在促进SAR视觉语言模型的深度发展和广泛应用。
https://arxiv.org/abs/2502.08168
Achieving a delicate balance between fostering trust in law en- forcement and protecting the rights of both officers and civilians continues to emerge as a pressing research and product challenge in the world today. In the pursuit of fairness and transparency, this study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data. Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft, producing structured narratives that are not only high in quality but also reinforce accountability and procedural clarity. This frame- work holds the potential to transform the reporting process, ensur- ing greater oversight, consistency, and fairness in future policing practices. A demonstration video of our system can be accessed at this https URL Y-kpCHNO/view?usp=sharing
在当今世界,如何在促进执法信任和保护执法人员及市民权利之间取得微妙平衡,仍然是一个紧迫的研究与产品开发挑战。为追求公正与透明度,本研究提出了一种创新的AI驱动系统,旨在从复杂、嘈杂且涉及多种角色的对话数据中生成警察报告草案。我们的方法能够智能地提取执法互动中的关键要素,并将其纳入草案之中,从而产生结构化叙述,这些叙述不仅质量高,还增强了问责制和程序清晰度。这一框架有可能变革报告流程,确保未来的警务实践拥有更大的监督、一致性和公平性。 该系统的演示视频可在此链接中访问:https://www.youtube.com/shorts/Y-kpCHNO/view?usp=sharing
https://arxiv.org/abs/2502.07677
Chatbots based on large language models offer cheap conversation practice opportunities for language learners. However, they are hard to control for linguistic forms that correspond to learners' current needs, such as grammar. We control grammar in chatbot conversation practice by grounding a dialogue response generation model in a pedagogical repository of grammar skills. We also explore how this control helps learners to produce specific grammar. We comprehensively evaluate prompting, fine-tuning, and decoding strategies for grammar-controlled dialogue response generation. Strategically decoding Llama3 outperforms GPT-3.5 when tolerating minor response quality losses. Our simulation predicts grammar-controlled responses to support grammar acquisition adapted to learner proficiency. Existing language learning chatbots and research on second language acquisition benefit from these affordances. Code available on GitHub.
基于大型语言模型的聊天机器人能够为语言学习者提供廉价的对话练习机会。然而,控制这些机器人生成的语言形式以符合学习者的当前需求(如语法)较为困难。为此,我们通过将对话响应生成模型与一个教育性的语法规则库相连接来实现对聊天机器人类对话中语法的控制,并研究这种控制如何帮助学习者生成特定的语法结构。 我们的研究全面评估了针对语法控制式对话回应生成的各种提示、微调和解码策略。在可容忍一定质量损失的情况下,战略性的从Llama3模型中解码出响应的表现优于GPT-3.5模型。此外,我们还通过模拟预测适应学习者水平的语法规则获取支持的对话响应,以帮助语法习得。 这项研究不仅为现有的语言学习聊天机器人提供了新的可能性,也为第二语言习得的研究带来了新思路。相关代码已发布在GitHub上。
https://arxiv.org/abs/2502.07544
Accurate and efficient diagnosis in online medical consultations remains a challenge for current large language models. These models often rely on single-turn interactions and lack the ability to refine their predictions through follow-up questions. Additionally, their responses frequently contain complex medical terminology, making them less accessible to non-medical users and creating barriers to effective communication. In this paper, we introduce Ask Patients with Patience (APP), the first multi-turn dialogue that enables LLMs to iteratively refine diagnoses based on grounded reasoning. By integrating medical guidelines and entropy minimization, APP improves both diagnostic accuracy and efficiency. Furthermore, it features human-centric communication that bridges the gap between user comprehension and medical terminology, significantly enhancing user accessibility and engagement. We evaluated APP using a subset of the ReMeDi dataset, comparing it with single-turn and traditional multi-turn LLM baselines. APP achieved higher similarity scores in diagnosis predictions, demonstrating better alignment with ground truth diagnoses. Entropy analysis showed that APP reduces diagnostic uncertainty more rapidly across iterations, increasing confidence in its predictions. APP also excels in user accessibility and empathy, further bridging the gap between complex medical language and user understanding. Code will be released at: this https URL.
在线医疗咨询中,准确和高效的诊断对当前的大规模语言模型(LLM)来说仍是一个挑战。这些模型通常依赖于单轮互动,并且缺乏通过后续提问来改进预测的能力。此外,它们的响应常常包含复杂的医学术语,这使得非专业用户难以理解,并阻碍了有效的沟通。在本文中,我们介绍了“Ask Patients with Patience”(APP),这是第一个多轮对话系统,它使LLM能够基于基于证据的推理迭代地细化诊断。通过整合医疗指南和熵最小化技术,APP提高了诊断的准确性和效率。此外,该系统还具备以用户为中心的沟通功能,弥合了用户理解能力和医学术语之间的差距,显著提升了用户的可访问性和参与度。 我们使用ReMeDi数据集的一个子集对APP进行了评估,并将其与单轮和传统多轮LLM基准模型进行了比较。结果显示,APP在诊断预测中的相似性得分更高,表明其更接近于真实诊断结果。熵分析显示,APP在迭代过程中更快地减少了诊断不确定性,提高了其预测的可信度。此外,APP还在用户可访问性和共情方面表现出色,进一步缩小了复杂医学语言和用户理解之间的差距。 相关代码将在以下链接发布:[this URL](https://example.com)。
https://arxiv.org/abs/2502.07143
This paper presents HamRaz, a novel Persian-language mental health dataset designed for Person-Centered Therapy (PCT) using Large Language Models (LLMs). Despite the growing application of LLMs in AI-driven psychological counseling, existing datasets predominantly focus on Western and East Asian contexts, overlooking cultural and linguistic nuances essential for effective Persian-language therapy. To address this gap, HamRaz combines script-based dialogues with adaptive LLM role-playing, ensuring coherent and dynamic therapy interactions. We also introduce HamRazEval, a dual evaluation framework that measures conversational quality and therapeutic effectiveness using General Dialogue Metrics and the Barrett-Lennard Relationship Inventory (BLRI). Experimental results show HamRaz outperforms conventional Script Mode and Two-Agent Mode, producing more empathetic, context-aware, and realistic therapy sessions. By releasing HamRaz, we contribute a culturally adapted, LLM-driven resource to advance AI-powered psychotherapy research in diverse communities.
本文介绍了HamRaz,这是一个新颖的波斯语心理健康数据集,旨在使用大型语言模型(LLMs)进行以个人为中心的心理治疗(PCT)。尽管在人工智能驱动的心理咨询领域中LLM的应用越来越广泛,但现有的数据集主要集中在西方和东亚背景下,忽视了有效的波斯语治疗所必需的文化和语言细微差别。为了解决这一空白,HamRaz结合了基于脚本的对话与适应性LLM角色扮演,确保治疗互动的一致性和动态性。我们还引入了HamRazEval,这是一个双评估框架,使用通用对话指标和巴雷特-伦纳德关系量表(BLRI)来衡量会话质量和心理治疗效果。实验结果显示,HamRaz优于传统的脚本模式和两代理模式,能够产生更具同情心、情境意识且更为现实的治疗会话。通过发布HamRaz,我们为促进不同社区中AI驱动的心理治疗研究贡献了一个文化适应型、基于LLM的资源。
https://arxiv.org/abs/2502.05982
Understanding temporal dynamics is critical for conversational agents, enabling effective content analysis and informed decision-making. However, time-aware datasets, particularly for persona-grounded conversations, are still limited, which narrows their scope and diminishes their complexity. To address this gap, we introduce MTPChat, a multimodal, time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. Leveraging MTPChat, we propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP), both designed to assess a model's ability to understand implicit temporal cues and dynamic interactions. Additionally, we present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies. Experimental results validate the challenges posed by MTPChat and demonstrate the effectiveness of our framework in multimodal time-sensitive scenarios.
理解时间动态对于对话代理来说至关重要,因为它能够促进有效的内容分析和知情决策。然而,特别是针对基于个性特征的对话而言,具备时间感知的数据集仍然非常有限,这限制了它们的应用范围,并降低了其复杂性。为了解决这一缺口,我们推出了MTPChat,这是一个多模态的时间感知个性对话数据集,它在对话和个人记忆中整合了语言、视觉和时间元素。 通过利用MTPChat,我们提出了两个针对时间敏感性的任务:时间下一次响应预测(Temporal Next Response Prediction, TNRP)和时间定位记忆预测(Temporal Grounding Memory Prediction, TGMP)。这两个任务旨在评估模型理解隐含的时间线索以及动态交互的能力。此外,我们还提出了一种创新框架,该框架包含一个自适应时间模块,能够有效地整合多模态流并捕捉时间依赖性。 实验结果验证了MTPChat带来的挑战,并展示了我们的框架在处理多模态、时间敏感场景时的有效性。
https://arxiv.org/abs/2502.05887
Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.
目标导向的视觉对话涉及人工代理之间的多轮交互,由于其广泛的应用场景而受到了极大的关注。在这个任务中,当给出一个视觉场景时,提问者会提出一个以行动为导向的问题,回答者则旨在让提问者了解正确的行动步骤。问题的质量影响了目标搜索过程的准确性和效率。然而,现有的方法缺乏明确的战略来指导问题生成,导致搜索过程中出现随机性,并且结果不收敛。 我们提出了一个基于答案分布估计器的树状策略(TSADE),该策略通过在每一轮中排除当前候选对象的一半,从而引导问题生成。上述过程是通过对二进制奖励进行最大化来实现的,这一想法受到了“分而治之”范式的启发。此外,我们设计了一种候选最小化奖励机制,鼓励模型在对话接近尾声时缩小候选对象范围。 通过实验验证,我们的方法可以使代理以较少重复问题和轮次的情况下,达到比传统遍历式问题生成方法更高的任务导向准确性。定性结果进一步表明,TSADE有助于代理生成更高质量的问题。
https://arxiv.org/abs/2502.05806
The increasing demand for mental health services has led to the rise of AI-driven mental health chatbots, though challenges related to privacy, data collection, and expertise persist. Motivational Interviewing (MI) is gaining attention as a theoretical basis for boosting expertise in the development of these chatbots. However, existing datasets are showing limitations for training chatbots, leading to a substantial demand for publicly available resources in the field of MI and psychotherapy. These challenges are even more pronounced in non-English languages, where they receive less attention. In this paper, we propose a novel framework that simulates MI sessions enriched with the expertise of professional therapists. We train an MI forecaster model that mimics the behavioral choices of professional therapists and employ Large Language Models (LLMs) to generate utterances through prompt engineering. Then, we present KMI, the first synthetic dataset theoretically grounded in MI, containing 1,000 high-quality Korean Motivational Interviewing dialogues. Through an extensive expert evaluation of the generated dataset and the dialogue model trained on it, we demonstrate the quality, expertise, and practicality of KMI. We also introduce novel metrics derived from MI theory in order to evaluate dialogues from the perspective of MI.
对心理健康服务需求的增长催生了AI驱动的心理健康聊天机器人的兴起,尽管在隐私、数据收集和专业技能方面仍存在挑战。动机访谈(Motivational Interviewing, MI)作为一种理论基础,正受到越来越多的关注,旨在增强开发这些聊天机器人的专业知识水平。然而,现有数据集对于训练聊天机器人显示出一定的局限性,导致了对MI及心理治疗领域中公开可用资源的大量需求增加。在非英语语言环境中,这些问题尤为突出,并且较少得到关注。 本文提出了一种新的框架,用于模拟包含专业治疗师经验的动机访谈会话。我们训练了一个模仿专业治疗师行为选择的MI预测模型,并利用大型语言模型(LLMs)通过提示工程生成对话内容。接着,我们介绍了KMI,这是首个基于MI理论构建的合成数据集,包含1,000个高质量的韩语动机访谈对话。通过对生成的数据集和训练该数据集上的对话模型进行广泛的专家评估,我们展示了KMI的质量、专业知识水平及其实践性。此外,本文还引入了从MI理论衍生出的新指标,以便从MI的角度对对话进行评价。 这项工作为解决AI驱动的心理健康聊天机器人开发中的关键挑战提供了重要进展,并有助于促进非英语语言环境中心理健康服务的发展与应用。
https://arxiv.org/abs/2502.05651
Nuclear fusion is one of the most promising ways for humans to obtain infinite energy. Currently, with the rapid development of artificial intelligence, the mission of nuclear fusion has also entered a critical period of its development. How to let more people to understand nuclear fusion and join in its research is one of the effective means to accelerate the implementation of fusion. This paper proposes the first large model in the field of nuclear fusion, XiHeFusion, which is obtained through supervised fine-tuning based on the open-source large model Qwen2.5-14B. We have collected multi-source knowledge about nuclear fusion tasks to support the training of this model, including the common crawl, eBooks, arXiv, dissertation, etc. After the model has mastered the knowledge of the nuclear fusion field, we further used the chain of thought to enhance its logical reasoning ability, making XiHeFusion able to provide more accurate and logical answers. In addition, we propose a test questionnaire containing 180+ questions to assess the conversational ability of this science popularization large model. Extensive experimental results show that our nuclear fusion dialogue model, XiHeFusion, can perform well in answering science popularization knowledge. The pre-trained XiHeFusion model is released on this https URL.
核聚变是人类获取无限能源的最有前景的方式之一。随着人工智能技术的迅速发展,核聚变任务也进入了其发展的关键时期。如何让更多的人了解核聚变并参与其中的研究是加速实现聚变目标的有效手段之一。本文提出了首个在核聚变领域的大型模型——XiHeFusion,该模型通过基于开源大型模型Qwen2.5-14B的监督微调获得。我们收集了多来源的知识以支持此模型训练,包括公共爬取数据、电子书、arXiv论文和学位论文等资料。在掌握核聚变领域的知识后,我们进一步利用思维链来提升该模型的逻辑推理能力,从而使XiHeFusion能够提供更加准确且有逻辑的答案。此外,我们还提出了一个包含180多道题目的测试问卷,以评估此科普大型模型的对话能力。广泛的实验结果表明,我们的核聚变对话模型XiHeFusion在回答科普知识方面表现出色。预训练的XiHeFusion模型已在此URL上发布。 这个研究不仅展示了大型语言模型在特定领域应用的可能性,而且通过提供详细的测试数据和评估方法,为未来的研究提供了宝贵的参考和基准。
https://arxiv.org/abs/2502.05615
To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization techniques. In this paper, we present two key findings: (1) The granularity of memory unit matters: Turn-level, session-level, and summarization-based methods each exhibit limitations in both memory retrieval accuracy and the semantic quality of the retrieved content. (2) Prompt compression methods, such as \textit{LLMLingua-2}, can effectively serve as a denoising mechanism, enhancing memory retrieval accuracy across different granularities. Building on these insights, we propose SeCom, a method that constructs a memory bank with topical segments by introducing a conversation Segmentation model, while performing memory retrieval based on Compressed memory units. Experimental results show that SeCom outperforms turn-level, session-level, and several summarization-based methods on long-term conversation benchmarks such as LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.
为了在长期对话中提供连贯且个性化的体验,现有方法通常通过从对话历史记录构建记忆库来进行增强响应生成。这些方法可以在转述级别、会话级别或通过总结技术来实现。在这篇论文中,我们提出了两个关键发现: 1. 记忆单元的粒度很重要:转述级别的、会话级别的和基于总结的方法在记忆检索准确性和检索内容的语义质量方面各自存在局限性。 2. 像\textit{LLMLingua-2}这样的提示压缩方法可以有效作为去噪机制,提升不同粒度下的记忆检索准确性。 基于这些见解,我们提出了SeCom(Segmentation and Compressed Memory),这是一种通过引入对话分割模型构建主题段落的记忆库的方法,并且在进行记忆检索时使用了压缩后的内存单元。实验结果显示,在LOCOMO和Long-MT-Bench+等长期对话基准测试中,SeCom超越了转述级别、会话级别以及多种基于总结的方法。此外,所提出的对话分割方法还在DialSeg711、TIAGE和SuperDialSeg等对话分割数据集上展示了优越的性能。
https://arxiv.org/abs/2502.05589
Stochastic embedding transitions introduce a probabilistic mechanism for adjusting token representations dynamically during inference, mitigating the constraints imposed through static or deterministic embeddings. A transition framework was proposed in which each token embedding evolved through probabilistic updates, ensuring adaptability while preserving semantic integrity across linguistic contexts. Empirical evaluations demonstrated that models incorporating stochastic transitions exhibited greater lexical diversity, improved generative coherence, and enhanced retention of low-frequency vocabulary, contributing to more varied sentence structures and reduced reliance on high-probability token selections. Statistical analyses of embedding drift across transformer layers indicated that representations evolved more flexibly without losing coherence, supporting the hypothesis that controlled stochasticity facilitated context-sensitive representation learning. Experimental results revealed that probabilistic embeddings introduced minor computational overhead while maintaining generative efficiency, reinforcing their feasibility in large-scale applications. A comparative study with traditional embedding approaches highlighted measurable gains in text completion accuracy, dialogue coherence, and structural complexity, confirming the effectiveness of stochastic transitions in enhancing representation expressiveness. Clustering patterns in the embedding space suggested that probabilistic updates preserved meaningful semantic groupings while enabling context-driven shifts, further validating the stability of the transition mechanism. Performance metrics indicated that stochastic transitions balanced adaptability and control, ensuring that generative outputs remained linguistically coherent without excessive randomness.
随机嵌入转换引入了一种概率机制,用于在推理过程中动态调整标记表示,从而缓解了静态或确定性嵌入所施加的约束。提出了一种转换框架,在该框架中每个令牌嵌入通过概率更新进行演变,确保适应性的同时保持跨语言上下文中的语义完整性。实证评估表明,包含随机转换的模型表现出更大的词汇多样性、改进的生成连贯性和增强的低频词汇保留能力,从而产生更多样化的句式结构并减少了对高概率令牌选择的依赖。 嵌入漂移的统计分析显示,在变换器层之间表示演化得更加灵活而不失一致性,支持了控制随机性能够促进上下文敏感表示学习这一假设。实验结果显示,概率嵌入在保持生成效率的同时引入轻微的计算开销,这进一步证明其适用于大规模应用中的可行性。 与传统嵌入方法进行比较的研究表明,在文本完成准确性、对话连贯性和结构复杂度方面取得了可衡量的进步,确认了随机转换在增强表示表达能力方面的有效性。聚类模式显示,在嵌入空间中概率更新保留有意义的语义分组的同时,还能够实现上下文驱动的转变,进一步验证了过渡机制的稳定性。 性能指标表明,随机转换平衡了适应性和控制力,确保生成输出保持语言连贯性而不至于过于随机。
https://arxiv.org/abs/2502.05553
Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.
之前的研究已经揭示了大型语言模型(LLM)在认知重构疗法中的潜力;然而,这些研究主要集中在文本方法上,往往忽略了非言语证据在现实生活中治疗中的重要性。为了解决这一差距,我们将基于文本的认知重构扩展到了多模态领域,引入视觉线索。具体来说,我们提出了一个新的数据集——多模态认知支持对话(M2CoSC),该数据集中每一段由GPT-4生成的对话都配有一张反映虚拟客户面部表情的图片。为了更好地模拟现实生活中的心理治疗,在这种情境中,面部表情用于解读隐含的情感线索,我们提出了一种多层次的心理治疗方法,这种方法能够明确识别并整合细微证据。我们的全面实验包括LLM和视觉语言模型(VLM),结果表明,使用M2CoSC数据集可以显著提升VLM作为心理咨询师的表现。此外,多层次心理治疗推理方法使VLM能提供更体贴且富有同情心的建议,优于标准提示法的效果。
https://arxiv.org/abs/2502.06873
This paper focuses on simulating text dialogues in which impressions between speakers improve during speed dating. This simulation involves selecting an utterance from multiple candidates generated by a text generation model that replicates a specific speaker's utterances, aiming to improve the impression of the speaker. Accurately selecting an utterance that improves the impression is crucial for the simulation. We believe that whether an utterance improves a dialogue partner's impression of the speaker may depend on the personalities of both parties. However, recent methods for utterance selection do not consider the impression per utterance or the personalities. To address this, we propose a method that predicts whether an utterance improves a partner's impression of the speaker, considering the personalities. The evaluation results showed that personalities are useful in predicting impression changes per utterance. Furthermore, we conducted a human evaluation of simulated dialogues using our method. The results showed that it could simulate dialogues more favorably received than those selected without considering personalities.
本文的重点是模拟在速配过程中,说话者之间印象逐渐改善的文本对话。这种模拟包括从多个由文本生成模型产生的候选语句中选择一个能够提升特定说话者形象的语句。准确地挑选出能提高对方印象的语句对于该模拟至关重要。我们认为,一句是否能够提升对话伙伴对说话者的印象可能取决于双方的性格特征。然而,最近的句子选择方法并未考虑每句话的印象价值或性格因素。为解决这一问题,我们提出了一种新的方法,用于预测某一话语能否改善对话方对发言者的好感,并且考虑到性格的影响。 评估结果显示,性格在预测每句话语引起的好感变化中非常有用。此外,我们还使用我们的方法进行了模拟对话的人类评价实验。结果表明,在考虑性格的情况下选择语句可以生成更受欢迎的对话,比不考虑性格因素时产生的对话更佳。
https://arxiv.org/abs/2502.04706
Large language model (LLM)-based agents have recently shown impressive progress in a variety of domains, including open-ended conversation and multi-step decision-making. However, applying these agents to social deduction games such as Werewolf, which requires both strategic decision-making and free-form language interaction, remains non-trivial. Traditional methods based on Counterfactual Regret Minimization (CFR) or reinforcement learning (RL) typically depend on a predefined action space, making them unsuitable for language games with unconstrained text action space. Meanwhile, pure LLM-based agents often suffer from intrinsic biases and require prohibitively large datasets for fine-tuning. We propose Latent Space Policy Optimization (LSPO), an iterative framework that addresses these challenges by first mapping free-form text to a discrete latent space, where methods like CFR and RL can learn strategic policy more effectively. We then translate the learned policy back into natural language dialogues, which are used to fine-tune an LLM via Direct Preference Optimization (DPO). By iteratively alternating between these stages, our LSPO agent progressively enhances both strategic reasoning and language communication. Experiment results on the Werewolf game show that our method improves the agent's performance in each iteration and outperforms existing Werewolf agents, underscoring its promise for free-form language decision-making.
基于大型语言模型(LLM)的代理在开放式对话和多步骤决策制定等多个领域近期取得了显著进展。然而,将这些代理应用于狼人杀等需要策略性决策和自由形式语言互动的社会推理游戏中仍然是一个挑战性的任务。传统的基于反事实遗憾最小化(CFR)或强化学习(RL)的方法通常依赖于预定义的动作空间,这使得它们不适合具有无约束文本动作空间的语言游戏。另一方面,纯粹的LLM代理经常受到内在偏见的影响,并且需要庞大的数据集进行微调。 我们提出了一种名为“潜在空间策略优化”(Latent Space Policy Optimization, LSPO) 的迭代框架,该框架通过首先将自由形式的文本映射到离散的潜在空间来解决这些挑战,在这个空间中,CFR和RL方法可以更有效地学习战略政策。然后,我们将所学策略转换回自然语言对话,并使用直接偏好优化(Direct Preference Optimization, DPO) 对LLM进行微调。通过在这些阶段之间交替迭代,我们的LSPO代理逐步增强了其战略推理能力和语言沟通能力。 在狼人杀游戏中的实验结果显示,在每次迭代中,该方法都提高了代理的性能,并且超过了现有的狼人杀代理,这表明它对于自由形式的语言决策制定具有巨大的潜力。
https://arxiv.org/abs/2502.04686
Speech technologies are transforming interactions across various sectors, from healthcare to call centers and robots, yet their performance on African-accented conversations remains underexplored. We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation. Additionally, we explore medical conversation summarization capabilities of large language models (LLMs) to demonstrate the impact of ASR errors on downstream medical summaries, providing insights into the challenges and opportunities for speech technologies in the Global South. Our work highlights the need for more inclusive datasets to advance conversational AI in low-resource settings.
语音技术正在改变医疗、呼叫中心和机器人等各种领域的互动方式,然而,这些技术在处理非洲口音的对话时的表现仍鲜有研究。我们推出了Afrispeech-Dialog,这是一个包含50段模拟医学和非医学领域带有非洲口音英语对话的数据集,旨在评估自动语音识别(ASR)及相关技术。 我们对最先进的说话人区分和ASR系统在长格式、带口音的演讲中的性能进行了测试,并将其与母语者的发音表现进行比较,发现其性能下降了10%以上。此外,我们还探索了大型语言模型(LLMs)在医学对话摘要方面的能力,以展示ASR错误对下游医学摘要的影响,并揭示全球南方地区语音技术面临的挑战和机遇。 我们的研究强调了为了推进低资源环境下的会话AI的发展,需要更多包容性的数据集的重要性。
https://arxiv.org/abs/2502.03945
The current research on Role-Playing Conversational Agents (RPCAs) with Large Language Models (LLMs) primarily focuses on imitating specific speaking styles and utilizing character backgrounds, neglecting the depiction of deeper personality traits.~In this study, we introduce personality-infused role-playing for LLM agents, which encourages agents to accurately portray their designated personality traits during dialogues. We then propose PsyPlay, a dialogue generation framework that facilitates the expression of rich personalities among multiple LLM agents. Specifically, PsyPlay enables agents to assume roles with distinct personality traits and engage in discussions centered around specific topics, consistently exhibiting their designated personality traits throughout the interactions. Validation on generated dialogue data demonstrates that PsyPlay can accurately portray the intended personality traits, achieving an overall success rate of 80.31% on GPT-3.5. Notably, we observe that LLMs aligned with positive values are more successful in portraying positive personality roles compared to negative ones. Moreover, we construct a dialogue corpus for personality-infused role-playing, called PsyPlay-Bench. The corpus, which consists of 4745 instances of correctly portrayed dialogues using PsyPlay, aims to further facilitate research in personalized role-playing and dialogue personality detection.
目前,关于大型语言模型(LLMs)驱动的角色扮演对话代理(RPCAs)的研究主要集中在模仿特定的说话风格和利用角色背景上,而忽略了对更深层次个性特征的表现。在本研究中,我们引入了带有个性化色彩的角色扮演游戏给LLM代理,鼓励这些代理在其对话过程中准确地表现出指定的个性特征。随后,我们提出了PsyPlay框架,这是一个促进多个LLM代理之间表达丰富个性的对话生成框架。具体而言,PsyPlay使代理能够扮演具有不同性格特质的角色,并围绕特定主题进行讨论,在整个交互过程中始终展现出其指定的性格特征。 通过生成的对话数据验证表明,PsyPlay能够准确地表现所设定的人物性格特征,使用GPT-3.5时总体成功率为80.31%。值得注意的是,我们观察到与积极价值观相一致的语言模型在描绘积极性格角色方面比消极性格角色更成功。 此外,为了支持带有个性化色彩的角色扮演游戏研究以及对话个性检测,我们构建了一个名为PsyPlay-Bench的对话语料库。该语料库由4745个使用PsyPlay正确表现出来的对话实例组成,旨在进一步促进个性化角色扮演和对话个性检测的研究。
https://arxiv.org/abs/2502.03821
Alzheimer's Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Low-cost, accurate identification of key markers of AD is crucial for timely diagnosis and intervention. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information, making the AD detection task difficult. Moreover, the limited availability of AD speech samples and variability in their speech styles pose significant challenges in developing robust speech-based AD detection models. To address these challenges, we propose DECT, a novel speech-based domain-specific approach leveraging large language models (LLMs) for fine-grained linguistic analysis and label-switched label-preserved data generation. Our study presents four novelties: We harness the summarizing capabilities of LLMs to identify and distill key Cognitive-Linguistic information from noisy speech transcripts, effectively filtering irrelevant information. We leverage the inherent linguistic knowledge of LLMs to extract linguistic markers from unstructured and heterogeneous audio transcripts. We exploit the compositional ability of LLMs to generate AD speech transcripts consisting of diverse linguistic patterns to overcome the speech data scarcity challenge and enhance the robustness of AD detection models. We use the augmented AD textual speech transcript dataset and a more fine-grained representation of AD textual speech transcript data to fine-tune the AD detection model. The results have shown that DECT demonstrates superior model performance with an 11% improvement in AD detection accuracy on the datasets from DementiaBank compared to the baselines.
阿尔茨海默病(AD)是一种不可逆的神经退行性疾病,影响全球约5000万人。低成本、准确地识别关键标记对于及时诊断和干预至关重要。语言障碍是认知衰退最早的症状之一,可用于区分阿尔茨海默病患者与正常对照个体。通过病人与医生访谈对话可以检测此类障碍,但这些对话通常夹杂着模糊的、嘈杂的信息以及无关信息,这使得AD检测任务变得复杂化。此外,有限可用的AD语音样本和其风格的变化给开发稳健的基于语音的AD检测模型带来了重大挑战。 为了解决这些问题,我们提出了DECT(Dementia-Enhanced Cognitive Transcription),一种新颖的方法,利用大型语言模型(LLM)进行细粒度的语言分析,并生成保留标签的数据。我们的研究包含四项创新: 1. 我们利用LLM的总结能力从嘈杂的语音转录中识别和提炼关键的认知-语言信息,有效过滤掉无关的信息。 2. 利用LLM固有的语言知识,我们从未结构化且异质化的音频转录中提取语言标记。 3. 通过利用LLM的组合能力生成包含多样化语言模式的AD语音转录,以克服语音数据稀缺的问题,并提高AD检测模型的稳健性。 4. 我们使用增强后的AD文本语音转录数据集和更细粒度表示的AD文本语音转录数据来微调AD检测模型。 实验结果显示,在来自DementiaBank的数据集上,DECT相比基准线显示出11%的阿尔茨海默病识别准确率提升。
https://arxiv.org/abs/2502.04394
This paper reports on the results from a pilot study investigating the impact of automatic speech recognition (ASR) technology on interpreting quality in remote healthcare interpreting settings. Employing a within-subjects experiment design with four randomised conditions, this study utilises scripted medical consultations to simulate dialogue interpreting tasks. It involves four trainee interpreters with a language combination of Chinese and English. It also gathers participants' experience and perceptions of ASR support through cued retrospective reports and semi-structured interviews. Preliminary data suggest that the availability of ASR, specifically the access to full ASR transcripts and to ChatGPT-generated summaries based on ASR, effectively improved interpreting quality. Varying types of ASR output had different impacts on the distribution of interpreting error types. Participants reported similar interactive experiences with the technology, expressing their preference for full ASR transcripts. This pilot study shows encouraging results of applying ASR to dialogue-based healthcare interpreting and offers insights into the optimal ways to present ASR output to enhance interpreter experience and performance. However, it should be emphasised that the main purpose of this study was to validate the methodology and that further research with a larger sample size is necessary to confirm these findings.
这篇论文报道了一项初步研究的结果,该研究探讨了自动语音识别(ASR)技术对远程医疗口译服务质量的影响。采用一种包含四种随机条件的被试内实验设计,本研究利用编写的医学咨询对话来模拟口译任务,并涉及四位中英语言组合的实习口译员。此外,还通过提示式回顾报告和半结构化访谈收集参与者对ASR支持的经验与看法。 初步数据表明,ASR(特别是完整ASR转录文本和基于ASR由ChatGPT生成的摘要)的可用性显著提升了口译质量。不同类型ASR输出对口译错误类型的分布产生了不同的影响。参与者报告了相似的技术互动体验,并表示他们更倾向于使用完整的ASR转录文本。这项初步研究展示了将ASR应用于对话式医疗口译中的有希望的结果,为优化呈现ASR输出以增强口译员经验和表现提供了见解。 然而,需要强调的是,本研究的主要目的是验证方法学的有效性,因此还需要进行具有更大样本量的研究来确认这些发现。
https://arxiv.org/abs/2502.03381