Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the advanced reasoning required for complex clinical scenarios, such as differential diagnosis or personalized treatment suggestions. We proposed FineMedLM-o1, which leverages high-quality synthetic medical data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduced Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also proposed a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.
近期,在大型语言模型(LLM)在医学应用中的进展显示出其在疾病诊断和治疗规划方面的潜力。然而,大多数现有的医疗LLM在处理需要高级推理的复杂临床场景时(如鉴别诊断或个性化治疗建议),表现力不从心。为此,我们提出了FineMedLM-o1,该模型利用高质量的人工合成医疗数据和长文本推理数据进行监督微调(SFT)和直接偏好优化(DPO),从而提升了对话能力和深层次的推理能力。 此外,我们在医学领域首次引入了测试时训练(TTT),这有助于领域的适应性,并确保了可靠的、准确的推理。实验结果显示,FineMedLM-o1在关键医疗基准上比之前的模型平均性能提高了23%。特别是,TTT技术提供了额外的14%性能提升,突显了其在增强医学推理能力方面的有效性。 为了支持这一过程,我们还提出了一种新的合成医学对话的方法。与现有的开源数据集相比,我们的数据集在质量和复杂性方面都表现更优。该项目和相关数据将在GitHub上发布。
https://arxiv.org/abs/2501.09213
Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
换言之,对话中的轮流发言是交流的基本方面,但当前的人机互动(HRI)系统通常依赖于基于静默的简单模型,这导致了不自然的停顿和打断。本文首次研究了一般轮流发言模型——特别是TurnGPT和声音活动预测(VAP)的应用,以改善人机对话中的交流动态。这些模型通过自我监督学习目标在人类之间的对话数据上进行训练,并且不需要特定领域的微调。我们提出了利用这两种模型的组合方法来预测机器人何时应开始准备回应、何时发言以及如何处理可能的打断。我们在一次实验中评估了所提出的系统,该实验采用了一个传统的基线系统,在39名成年人与Furhat机器人的对话环境中进行,并结合大型语言模型自动生成自主响应。结果表明,参与者明显更偏好于我们提出的系统,它在减少回应延迟和打断方面也显著有效。
https://arxiv.org/abs/2501.08946
Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews. Evaluations showed that EmoScan exceeded the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score=0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations. This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.
抑郁症和焦虑症广泛存在,需要及时识别和管理。近年来,大型语言模型(LLMs)的进步为解决这些问题提供了潜在的解决方案,但高昂的成本以及对训练数据的伦理顾虑仍然是挑战。本文介绍了一种用于合成临床访谈的数据生成流水线,产生了1,157个互动对话(PsyInterview),并提出了EmoScan,一种基于LLM的情感障碍筛查系统。 EmoScan能够区分粗粒度(例如焦虑或抑郁障碍)和细粒度(例如重度抑郁症)的疾病,并进行高质量的访谈。评估结果显示,EmoScan在筛选情感障碍方面的性能超过了基础模型和其他大型语言模型如GPT-4(F1分数=0.7467),同时提供了更优的质量解释(BERTScore = 0.9408),并在外部数据集上表现出强大的泛化能力(F1分数为0.67)。此外,EmoScan在访谈技能方面也超越了基准模型,并通过自动化评分和人类评估验证了这一点。 这项工作强调了开发有效的心理健康LLM工具时构建可扩展的数据生成流水线的重要性。
https://arxiv.org/abs/2501.08769
Optimization models have been applied to solve a wide variety of decision-making problems. These models are usually developed by optimization experts but are used by practitioners without optimization expertise in various application domains. As a result, practitioners often struggle to interact with and draw useful conclusions from optimization models independently. To fill this gap, we introduce OptiChat, a natural language dialogue system designed to help practitioners interpret model formulation, diagnose infeasibility, analyze sensitivity, retrieve information, evaluate modifications, and provide counterfactual explanations. By augmenting large language models (LLMs) with functional calls and code generation tailored for optimization models, we enable seamless interaction and minimize the risk of hallucinations in OptiChat. We develop a new dataset to evaluate OptiChat's performance in explaining optimization models. Experiments demonstrate that OptiChat effectively bridges the gap between optimization models and practitioners, delivering autonomous, accurate, and instant responses.
优化模型已经被应用于解决各种决策问题。这些模型通常由优化专家开发,但在不同应用领域的实践者却不一定具备相关专业知识。因此,许多没有优化背景的实践者常常难以独立地与这些优化模型互动并从中得出有用的结论。为了填补这一空白,我们引入了OptiChat,这是一个自然语言对话系统,旨在帮助实践者解释模型结构、诊断不可行性、分析敏感度、检索信息、评估修改及提供反事实解释。通过为大型语言模型(LLMs)添加针对优化模型的功能调用和代码生成能力,我们可以实现流畅的互动并尽量减少OptiChat中的幻觉风险。我们还开发了一个新的数据集来评估OptiChat在解释优化模型方面的表现。实验表明,OptiChat有效地弥合了优化模型与实践者之间的差距,能够提供自主、准确且即时的回答。
https://arxiv.org/abs/2501.08406
Dialogue is at the core of human behaviour and being able to identify the topic at hand is crucial to take part in conversation. Yet, there are few accounts of the topical organisation in casual dialogue and of how people recognise the current topic in the literature. Moreover, analysing topics in dialogue requires conversations long enough to contain several topics and types of topic shifts. Such data is complicated to collect and annotate. In this paper we present a dialogue collection experiment which aims to build a corpus suitable for topical analysis. We will carry out the collection with a messaging tool we developed.
对话是人类行为的核心,能够识别当前话题对于参与对话至关重要。然而,在文献中关于非正式对话中的主题组织以及人们如何识别当前话题的描述并不多见。此外,分析对话中的主题需要足够长的对话以包含多个主题和类型的话题转换。此类数据的收集与标注较为复杂。在本文中,我们介绍了一个对话收集实验,旨在建立一个适合进行主题分析的语料库。我们将使用我们开发的一款消息工具来进行收集工作。
https://arxiv.org/abs/2501.07947
The integration of dialogue interfaces in mobile devices has become ubiquitous, providing a wide array of services. As technology progresses, humanoid robots designed with human-like features to interact effectively with people are gaining prominence, and the use of advanced human-robot dialogue interfaces is continually expanding. In this context, emotion recognition plays a crucial role in enhancing human-robot interaction by enabling robots to understand human intentions. This research proposes a facial emotion detection interface integrated into a mobile humanoid robot, capable of displaying real-time emotions from multiple individuals on a user interface. To this end, various deep neural network models for facial expression recognition were developed and evaluated under consistent computer-based conditions, yielding promising results. Afterwards, a trade-off between accuracy and memory footprint was carefully considered to effectively implement this application on a mobile humanoid robot.
在移动设备中集成对话界面已经变得非常普遍,提供了各种各样的服务。随着技术的进步,设计得像人类一样并与人有效互动的类人机器人越来越受到重视,高级的人机对话接口的应用范围也在不断扩大。在这种背景下,情感识别在增强人机交互方面发挥着关键作用,使机器人能够理解人的意图。这项研究提出了一种面部表情检测界面,该界面被整合到移动类人机器人中,并能够在用户界面上实时显示来自多个人的情感。 为了实现这一目标,在一致的计算机环境下开发并评估了多种深度神经网络模型以进行面部表情识别,并取得了令人鼓舞的结果。之后,仔细权衡了准确性和内存占用之间的平衡,以便有效地将此应用程序实施到移动类人机器人中。
https://arxiv.org/abs/2501.07213
Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.
多模态大型语言模型(MLLMs)在图像和区域级别的遥感(RS)图像理解任务中取得了显著的成功,例如图像描述、视觉问答和视觉指代。然而,现有的遥感MLLM缺乏像素级对话能力,即根据用户的指令生成特定实例的分割掩码进行响应的能力。为此,在本文中我们提出了GeoPix,这是一种扩展了遥感图像理解至像素级别能力的MLLM。通过为模型装备一个掩码预测器来实现这一点,该预测器能够将视觉特征从视觉编码器转换成基于大型语言模型(LLM)分割令牌嵌入条件下的掩码。 为了支持RS图像中多尺度对象的分割,在掩码预测器中集成了按类别可学习的记忆模块,以捕获并存储整个数据集中每个实例级别的地物上下文。此外,由于缺乏训练像素级别遥感MLLM的大规模数据集,我们构建了GeoPixInstruct数据集,该数据集包含65,463张图像和140,412个实例,并为每个实例提供了文本描述、边界框以及掩码的标注信息。 为了平衡多模态多任务优化中文字生成与掩码预测的不同需求,我们还开发了一种两阶段训练策略。大量的实验验证了GeoPix在像素级分割任务中的有效性和优越性,同时保持了图像和区域级别基准测试中的竞争力。
https://arxiv.org/abs/2501.06828
Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval -- a prompt-based method leveraging LLMs -- across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.
基于大型语言模型(LLM)的对话系统进步迅速,然而可靠评估指标的发展却落后于这一进展,特别是在处理多样化和创新性响应方面。我们提出了一种基准测试方法,用于评估无参考点对话度量在四种对抗攻击类别下的鲁棒性:说话者标签前缀、静态响应、不规范语法响应以及重复的对话上下文。通过分析如DialogRPT、UniEval和PromptEval(一种基于提示的方法,利用LLM)等度量标准在有根据数据集和无根据数据集上的表现,我们探讨了这些度量与人类判断的相关性及其对抗攻击的敏感程度。研究发现这两个维度并不总是同步;即那些传统基准下看似相同的度量,在评估对抗性响应时得分可能会有所不同。这一发现推动了开发精细的评价框架以应对现实世界对话挑战的需求。
https://arxiv.org/abs/2501.06728
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: this https URL.
对话式语音合成(CSS)旨在利用当前对话历史来生成与对话风格相匹配的表达性语音。不同于仅考虑当前对话(CD),存储对话(SD)包含了早期用户代理互动阶段保存下来的片段,这些片段包含有关类似情境下的风格表现知识。这种知识对于使代理能够生成富有同理心的反馈具有重要作用。然而,先前的研究忽视了这一点。 为了应对这一挑战,我们提出了一种新颖的检索增强型对话知识聚合方案(RADKA-CSS),旨在实现有表达力的CSS。该方案包括三个主要组成部分: 1. **有效检索SD中与CD在语义和风格上相似的对话**:首先,我们构建了一个存储对话语义-风格数据库(SDSSD),其中包含文本和音频样本。然后,设计了一种多属性检索机制来匹配CD中的对话语义和风格向量以及SDSSD中的存储对话语义和风格向量,从而检索出最相似的对话。 2. **有效利用来自CD和SD的风格知识**:我们提议采用多层次图结构对对话进行编码,并引入一种跨源风格知识聚合机制。这种方法可以更好地捕捉不同粒度级别的风格信息。 3. **将聚合后的风格知识输入语音合成器,帮助代理生成与对话风格相匹配的表达性语音**。 我们在DailyTalk数据集上进行了全面和深入的实验,该数据集是用于CSS任务的标准基准测试集。无论是客观评价还是主观评价都表明RADKA-CSS在表现力渲染方面优于基线模型。代码和音频样本可以在以下链接找到:[this https URL](请将此占位符替换为实际提供的URL地址)。
https://arxiv.org/abs/2501.06467
General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.
通用的自动语音识别(ASR)系统在目标导向对话中并不总是表现良好。现有的ASR校正方法依赖于用户的先验数据或命名实体。我们扩展了这种校正,使之适用于没有先验用户数据且表现出语言灵活性的任务,如词汇和句法变化。我们提出了一种新颖的上下文增强方法,利用大型语言模型,并结合目标导向对话AI及其任务中的上下文信息进行排名策略。我们的方法包括: 1. 根据与上下文的词义和语义相似性对n-best ASR假设进行排序; 2. 根据与ASR假设的音素对应关系对上下文进行排序。 在家居改善和烹饪领域的真实世界用户测试中,我们的方法使校正召回率提高了34%,F1值提高16%,同时保持了精度和假阳性率不变。当我们的校正方法正常工作时,用户的评分平均提高了0.8到1分(满分5分),且没有因假阳性而导致评分下降。
https://arxiv.org/abs/2501.06129
Problem-solving therapy (PST) is a structured psychological approach that helps individuals manage stress and resolve personal issues by guiding them through problem identification, solution brainstorming, decision-making, and outcome evaluation. As mental health care increasingly integrates technologies like chatbots and large language models (LLMs), understanding how PST can be effectively automated is important. This study leverages anonymized therapy transcripts to analyze and classify therapeutic interventions using various LLMs and transformer-based models. Our results show that GPT-4o achieved the highest accuracy (0.76) in identifying PST strategies, outperforming other models. Additionally, we introduced a new dimension of communication strategies that enhances the current PST framework, offering deeper insights into therapist-client interactions. This research demonstrates the potential of LLMs to automate complex therapeutic dialogue analysis, providing a scalable, efficient tool for mental health interventions. Our annotation framework can enhance the accessibility, effectiveness, and personalization of PST, supporting therapists in real-time with more precise, targeted interventions.
问题解决疗法(PST)是一种结构化的心理治疗方法,通过引导个人识别问题、头脑风暴解决方案、做出决策和评估结果来帮助他们管理压力并解决问题。随着心理健康护理越来越融入聊天机器人和技术模型等技术手段中,了解如何将PST有效地自动化变得十分重要。本研究利用匿名的治疗对话记录,使用各种语言模型(LLM)和基于转换器的模型分析和分类心理治疗方法。我们的结果显示,GPT-4o在识别PST策略方面表现最佳,准确率达到0.76,优于其他模型。此外,我们还引入了一种新的沟通策略维度,以增强现有的PST框架,并为治疗师与患者之间的互动提供更深层次的见解。 这项研究展示了LLM自动化复杂治疗对话分析的潜力,提供了可扩展且高效的工具来支持心理健康干预措施。我们的标注框架可以提高PST的可访问性、有效性和个性化程度,在实时情境中帮助治疗师进行更加精确和针对性的干预。
https://arxiv.org/abs/2501.06101
Here's a condensed 1920-character version: The rise of misinformation and fake news in online political discourse poses significant challenges to democratic processes and public engagement. While debunking efforts aim to counteract misinformation and foster fact-based dialogue, these discussions often involve language toxicity and emotional polarization. We examined over 86 million debunking tweets and more than 4 million Reddit debunking comments to investigate the relationship between language toxicity, pessimism, and social polarization in debunking efforts. Focusing on discussions of the 2016 and 2020 U.S. presidential elections and the QAnon conspiracy theory, our analysis reveals three key findings: (1) peripheral participants (1-degree users) play a disproportionate role in shaping toxic discourse, driven by lower community accountability and emotional expression; (2) platform mechanisms significantly influence polarization, with Twitter amplifying partisan differences and Reddit fostering higher overall toxicity due to its structured, community-driven interactions; and (3) a negative correlation exists between language toxicity and pessimism, with increased interaction reducing toxicity, especially on Reddit. We show that platform architecture affects informational complexity of user interactions, with Twitter promoting concentrated, uniform discourse and Reddit encouraging diverse, complex communication. Our findings highlight the importance of user engagement patterns, platform dynamics, and emotional expressions in shaping polarization in debunking discourse. This study offers insights for policymakers and platform designers to mitigate harmful effects and promote healthier online discussions, with implications for understanding misinformation, hate speech, and political polarization in digital environments.
在线政治讨论中虚假信息和假新闻的兴起对民主进程和公众参与提出了重大挑战。尽管辟谣努力旨在对抗虚假信息并促进基于事实的对话,但这些讨论往往涉及语言毒性以及情感极化。我们分析了超过8600万条辟谣推特和400多万条Reddit辟谣评论,以探讨语言毒性、悲观情绪和社会极化之间的关系在辟谣行动中的影响。 我们的研究聚焦于2016年和2020年的美国总统选举以及QAnon阴谋论的讨论中,揭示了三个关键发现: 1. 周边参与者(一级用户)在塑造有毒对话中发挥着不成比例的作用,这主要是由于较低的社区责任感和情感表达驱动。 2. 平台机制显著影响极化现象,Twitter放大了党派之间的分歧,而Reddit则因为其结构化的、由社区推动的互动模式导致整体毒性更高。 3. 语言毒性和悲观情绪之间存在负相关性,随着相互作用增加,毒性减少,尤其是在Reddit平台上更为明显。我们展示了平台架构如何影响用户交互的信息复杂度,Twitter促进了集中和统一的对话,而Reddit则鼓励多样且复杂的沟通方式。 我们的研究结果强调了用户参与模式、平台动态以及情感表达在塑造辟谣讨论中的极化现象方面的重要性。这项研究为政策制定者和平台设计人员提供了见解,帮助他们减轻有害影响并促进更健康的在线讨论,同时也对理解数字环境下的虚假信息、仇恨言论及政治极化的机制具有重要意义。
https://arxiv.org/abs/2501.06274
Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: this https URL
最近在大型语言模型(LLMs)方面取得的进展,显著推动了基于文本对话系统的进步。这些系统现在能够生成高质量的回答,在广泛的话题和任务上表现出准确性和连贯性。然而,语音对话系统仍然在自然度方面落后于其他系统。它们倾向于产生机械化的互动,存在诸如响应时间过长、过于泛化或谨慎的回复以及缺乏自然节奏和流畅的轮流发言等问题。这一不足主要是由于过度依赖传统的级联设计所导致,这种设计涉及独立且顺序运行的组件,并使用文本作为中间表示形式。 本文提出了一种实时无文本语音对话生成模型(RTTL-DG),旨在克服这些挑战。我们的系统通过直接处理流式的语音对话来支持流畅的轮流发言并快速生成回复。此外,该模型还集成了背景音、过滤器、笑声及其他副语言信号,在传统的级联对话系统中通常缺少这些元素,从而使互动更加自然且贴近人类。 有关实现和生成样本的具体信息,请访问我们的代码库:[此URL](this https URL)
https://arxiv.org/abs/2501.04877
Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.
最近,由大型语言模型驱动的机器人在对话能力方面取得了显著进步,使其互动更加接近人类对话。然而,这些模型引入了人机交互(HRI)中的安全和隐私问题,因为它们容易受到操纵,从而绕过内置的安全措施。设想在一个家庭环境中部署的社会机器人,本研究旨在理解日常用户如何尝试利用语言模型违反伦理原则,例如通过提示机器人扮演生命伴侣的角色来实现这一目的。 我们进行了一项试点研究,涉及21名大学生与Misty机器人的互动,在三个基于具体HRI伦理原则(依恋、自由和同理心)的场景下试图规避其安全机制。我们的结果揭示了参与者使用五种技术,包括用情感语言侮辱机器人以及诉诸同情心。 我们希望这项工作能够为未来的研究提供信息,以便设计强有力的安全措施,确保人机交互的伦理性和安全性。
https://arxiv.org/abs/2501.04633
Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.
近期,跨模态学习在图像、文本和语音的理解与生成方面取得了进展,尽管这些成果主要体现在专有模型中。受限的跨模态数据集以及实时情感语音生成所固有的挑战阻碍了开源社区的进步。为解决这些问题,我们提出了一个名为openomni的方法,这是一种两阶段训练方法,结合了跨模态对齐和语音生成技术,旨在开发出最先进的跨模态大语言模型。 在第一阶段的对齐过程中,预先训练好的语音模型进一步接受文本-图像任务的训练,从而能够在视觉与语音之间(近乎)零样本地推广,超越基于三模态数据集训练出来的模型性能。在第二阶段的语音生成过程中,一个轻量级解码器通过语音任务和偏好学习来促进实时情感语音的产生。 实验结果显示,openomni方法在跨模态、视觉-语言以及语音-语言评估中均持续表现出色,能够支持自然且充满情感的对话,并实现高质量的实时情感语音生成。
https://arxiv.org/abs/2501.04561
Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems -- embodiment, symbol grounding, causality and memory -- are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.
基于大规模预训练基础模型(PFMs)的生成式人工智能系统,例如视觉-语言模型、大型语言模型(LLM)、扩散模型和视觉-语言-行动(VLA)模型,在各种领域和情境中展示了解决复杂且真正非平凡的人工智能问题的能力。特别是多模态大型语言模型(MLLM),从海量和多样化数据源中学习,能够形成世界丰富而细腻的表示,并因此提供广泛的技能,包括推理、有意义的对话、与人类和其他代理合作共同解决复杂问题以及理解人的社会和情感方面。尽管这一成就令人印象深刻,但最先进的LLM在大规模数据集上训练的认知能力仍然浅薄且脆弱。因此,通用型LLM在其通才能力方面受到严重限制。为了使LLM达到类似人类的普遍智能水平,需要解决一些基础性问题——包括实体化、符号指称、因果关系和记忆等。这些概念与人类认知更加契合,并为LLM提供了固有的类人认知属性,支持其在知识和智能上的物理上可行、语义上有意义、灵活且更通用的实现。 在这项工作中,我们讨论了上述基础问题,并对将这些概念实现在LLM中的最新方法进行了综述。具体来说,我们将探讨如何通过利用实体化、符号指称、因果关系和记忆的原则,在有机的方式下向人工智能(AGI)迈进。
https://arxiv.org/abs/2501.03151
In mental health counseling, a variety of earlier studies have focused on dialogue modeling. However, most of these studies give limited to no emphasis on the quality of interaction between a patient and a therapist. The therapeutic bond between a patient and a therapist directly correlates with effective mental health counseling. It involves developing the patient's trust on the therapist over the course of counseling. To assess the therapeutic bond in counseling, we introduce trust as a therapist-assistive metric. Our definition of trust involves patients' willingness and openness to express themselves and, consequently, receive better care. We conceptualize it as a dynamic trajectory observable through textual interactions during the counseling. To facilitate trust modeling, we present MENTAL-TRUST, a novel counseling dataset comprising manual annotation of 212 counseling sessions with first-of-its-kind seven expert-verified ordinal trust levels. We project our problem statement as an ordinal classification task for trust quantification and propose a new benchmark, TrustBench, comprising a suite of classical and state-of-the-art language models on MENTAL-TRUST. We evaluate the performance across a suite of metrics and lay out an exhaustive set of findings. Our study aims to unfold how trust evolves in therapeutic interactions.
在心理健康咨询领域,早期的研究主要集中在对话建模上。然而,大多数研究对患者与治疗师之间互动质量的关注有限甚至没有关注。患者和治疗师之间的治疗关系直接影响到有效的心理咨询效果,这涉及到通过咨询过程逐步建立患者的信任感。为了评估咨询中的治疗关系,我们引入了信任度作为辅助治疗师的指标。我们认为的信任不仅包括患者愿意并且敢于表达自己的意愿程度,也包括由此带来的更好护理效果的可能性。我们将这种信任概念化为一种可以通过咨询过程中的文本互动观察到的动态轨迹。 为了促进对信任建模的研究,我们提出了MENTAL-TRUST这一创新的心理咨询数据集,该数据集包含212个心理咨询会话的手动标注,并且是首次引入了七个专家验证过的等级制信任水平。我们将问题陈述定义为一个有序分类任务来进行信任量化的评估,并提出了一项新的基准——TrustBench,它包含了MENTAL-TRUST上的一系列经典和最先进的语言模型。我们通过一系列指标来评估这些模型的性能,并详细记录了一系列研究发现。 我们的研究表明了在治疗性互动中信任如何演变的过程。
https://arxiv.org/abs/2501.03064
Modern conversational agents, including anime-themed chatbots, are frequently reactive and personality-driven but fail to capture the dynamic nature of human interactions. We propose an event-driven dialogue framework to address these limitations by embedding dynamic events in conversation prompts and fine-tuning models on character-specific data. Evaluations on GPT-4 and comparisons with industry-leading baselines demonstrate that event-driven prompts significantly improve conversational engagement and naturalness while reducing hallucinations. This paper explores the application of this approach in creating lifelike chatbot interactions within the context of Honkai: Star Rail, showcasing the potential for dynamic event-based systems to transform role-playing and interactive dialogue.
现代对话代理,包括动漫主题聊天机器人,在很大程度上是反应性和个性驱动的,但未能捕捉到人类互动的动态特性。我们提出了一种事件驱动的对话框架来解决这些局限性,该框架通过在对话提示中嵌入动态事件并针对特定角色的数据进行微调模型来实现这一目标。对GPT-4的评估以及与行业领先基准的比较表明,事件驱动的提示显著提高了对话参与度和自然度,并减少了幻觉现象的发生。 本文探讨了这种技术在《星铁:Honkai Star Rail》背景下创建逼真聊天互动的应用潜力,展示了基于动态事件系统如何重塑角色扮演和交互式对话的可能性。
https://arxiv.org/abs/2501.03277
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.
最近的多模态大型语言模型(MLLM)主要集中在视觉和文本模式的集成上,较少关注语音在增强交互中的作用。然而,在多模态对话系统中,语音扮演着至关重要的角色,并且同时实现高水准的视觉和语音任务仍然是一个重大挑战,原因在于基本的模态差异。在这篇论文中,我们提出了一种精心设计的多阶段训练方法论,该方法逐步训练大语言模型以理解和处理视觉及语音信息,最终使流畅的视听交互成为可能。我们的方法不仅保持了强大的视觉-文本能力,并且无需单独的ASR(自动语音识别)和TTS(文本到语音转换)模块就能实现高效的语音对话功能,显著加速了多模态端到端响应速度。通过在图像、视频和语音任务的基准测试上将我们的方法与最先进的模型进行比较,我们展示了我们的模型具有强大的视觉和语音能力,实现了接近实时的视听交互。
https://arxiv.org/abs/2501.01957
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at this https URL.
由大型语言模型(LLM)驱动的社会代理能够模拟人类社会行为,但在处理复杂的以目标为导向的社会对话方面却表现不足。直接偏好优化(DPO)已被证明在各种代理任务中都能有效地将LLM的行为与人类的偏好对齐。现有的基于DPO的方法用于多轮互动可分为转换单元级别和会话级别方法。转换单元级别的方法过于细化,只关注单独的一次交互;而会话级别的方法则过于粗糙,往往引入训练噪声。为了解决这些局限性,我们提出了段级直接偏好优化(SDPO),该方法专注于在互动中特定的关键片段来优化多轮代理行为,并同时减少训练噪音。在SOTOPIA基准上的评估表明,使用SDPO调优的代理始终优于现有的基于DPO的方法以及像GPT-4o这样的专有LLM,这强调了SDPO在提升基于LLM的代理的社会智能方面的潜力。我们已在[此处](https://example.com)发布了我们的代码和数据。
https://arxiv.org/abs/2501.01821