Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.
将基于人类标注偏好数据的语言模型对齐作为获得实际且高性能的语言模型系统的关键步骤。然而,在规模上获得多语言人类偏好数据是困难的,这使得将此框架扩展到各种语言具有挑战性。在这项工作中,我们评估了一种简单的零散跨语言对齐方法,其中在一种源语言的偏好数据上训练了一个奖励模型,并直接应用于其他目标语言。在概述和开放性对话生成方面,我们发现,在综合评估设置中,这种方法在包括人类评估的广泛评估实例中始终成功地实现了卓越表现:跨语言对齐的模型在超过70%的评估实例中优于未对齐的模型。此外,我们还发现,当没有语言特定的数据进行甚至监督微调时,不同语言的奖励模型有时会生成更好的对齐模型。我们也在没有语言特定数据进行监督微调时,识别出最佳实践。
https://arxiv.org/abs/2404.12318
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at this https URL.
微调预训练的大型语言模型(LLMs)与人类价值观和意图对齐至关重要。这一过程通常采用比较对等关系和与参考LLM的KL散度的方法,重点关注模型生成的完整答案的评估。然而,这些回答的生成是在标记级别进行的,遵循了序列、自回归的样式。在本文中,我们引入了Token-level Direct Preference Optimization(TDPO),一种通过优化模型在每个标记级别的策略来与人类偏好对齐的新颖方法。与之前的方法不同,TDPO通过每个标记点的正向KL散度约束来改善对齐和多样性。利用布拉德利-特里模型作为基于标记的奖励系统,TDPO增强了KL散度的规范,同时保留了简单性,无需显式奖励建模。在各种文本任务的各种实验结果中,TDPO在平衡对齐与生成多样性方面的表现优于DPO。值得注意的是,在受控情感生成和单轮对话数据集上,TDPO与DPO的微调效果略好于PPO,显著地提高了生成的响应的质量。我们的代码目前是开源的,在以下链接处。
https://arxiv.org/abs/2404.11999
Automated dialogue systems are important applications of artificial intelligence, and traditional systems struggle to understand user emotions and provide empathetic feedback. This study integrates emotional intelligence technology into automated dialogue systems and creates a dialogue generation model with emotional intelligence through deep learning and natural language processing techniques. The model can detect and understand a wide range of emotions and specific pain signals in real time, enabling the system to provide empathetic interaction. By integrating the results of the study "Can artificial intelligence detect pain and express pain empathy?", the model's ability to understand the subtle elements of pain empathy has been enhanced, setting higher standards for emotional intelligence dialogue systems. The project aims to provide theoretical understanding and practical suggestions to integrate advanced emotional intelligence capabilities into dialogue systems, thereby improving user experience and interaction quality.
自动对话系统是人工智能的重要应用之一,但传统系统很难理解用户的情感并提供体贴的反馈。通过将情感智能技术集成到自动对话系统中,并通过深度学习和自然语言处理技术创建一个具有情感意识的对话生成模型。该模型可以实时检测和理解广泛的情感和特定的疼痛信号,使系统能够提供体贴的交互。通过将研究的“人工智能能否检测疼痛并表达疼痛同理?”的结果集成到模型中,模型对疼痛同理的理解能力得到了增强,为情感智能对话系统设定了更高的标准。该项目旨在提供理论理解和实际建议,以将先进的情感智能功能集成到对话系统中,从而提高用户体验和交互质量。
https://arxiv.org/abs/2404.11447
The advent of deep learning models has made a considerable contribution to the achievement of Emotion Recognition in Conversation (ERC). However, this task still remains an important challenge due to the plurality and subjectivity of human emotions. Previous work on ERC provides predictive models using mostly graph-based conversation representations. In this work, we propose a way to model the conversational context that we incorporate into a metric learning training strategy, with a two-step process. This allows us to perform ERC in a flexible classification scenario and to end up with a lightweight yet efficient model. Using metric learning through a Siamese Network architecture, we achieve 57.71 in macro F1 score for emotion classification in conversation on DailyDialog dataset, which outperforms the related work. This state-of-the-art result is promising regarding the use of metric learning for emotion recognition, yet perfectible compared to the microF1 score obtained.
深度学习模型的出现对实现对话中情感识别(ERC)做出了显著的贡献。然而,由于人类情感的多样性和主观性,这项任务仍然是一个重要的挑战。以前的工作主要使用基于图的对话表示来构建预测模型。在这项工作中,我们提出了一种将对话上下文建模为元学习训练策略的方法,包括两个步骤。这使我们能够在灵活的分类场景中执行ERC,并实现了一个轻量级但高效的模型。通过Siamese网络架构进行元学习,我们在DailyDialog数据集上取得了57.71的宏观F1分数的 emotion分类,超过了相关研究。这种最先进的结果在关于使用元学习进行情感识别方面具有前景,然而与微F1分数相比还有待提高。
https://arxiv.org/abs/2404.11141
Aligning large language models (LLMs) with human expectations requires high-quality instructional dialogues, which can be achieved by raising diverse, in-depth, and insightful instructions that deepen interactions. Existing methods target instructions from real instruction dialogues as a learning goal and fine-tune a user simulator for posing instructions. However, the user simulator struggles to implicitly model complex dialogue flows and pose high-quality instructions. In this paper, we take inspiration from the cognitive abilities inherent in human learning and propose the explicit modeling of complex dialogue flows through instructional strategy reuse. Specifically, we first induce high-level strategies from various real instruction dialogues. These strategies are applied to new dialogue scenarios deductively, where the instructional strategies facilitate high-quality instructions. Experimental results show that our method can generate diverse, in-depth, and insightful instructions for a given dialogue history. The constructed multi-turn instructional dialogues can outperform competitive baselines on the downstream chat model.
将大型语言模型(LLMs)与人类期望对齐需要高质量的教学对话,这是通过提高多样、深入和富有见解的指示来实现的。现有的方法将实教对话中的指示作为学习目标,并微调用户模拟器以提出指示。然而,用户模拟器很难隐含地建模复杂的对话流程和高质量的指示。在本文中,我们从人类学习固有的认知能力中获得灵感,并通过重用教学策略来显式建模复杂对话流程。具体来说,我们首先从各种实教对话中诱导高级策略。这些策略通过演绎应用于新的对话场景,在这里,教学策略促进高质量的教学。实验结果表明,我们的方法可以生成与给定对话历史相异、深入和富有见解的指令。构建的多轮教学对话可以在下游聊天模型上优于竞争基线。
https://arxiv.org/abs/2404.11095
Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.
内容审查面临着一个具有挑战性的任务,因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论,传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战,因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而,其对可用目标标签进行判断的依赖性,在平台不断演进和仇恨言论多变性的情况下,面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization,本研究提出了 HATE WATCH,一种新颖的弱监督因果解码框架,它绕过了明确的目标标签的需要,有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后,HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法,具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。
https://arxiv.org/abs/2404.11036
A major barrier towards the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety. In all three cases, models should ideally abstain from responding, much like humans, whose ability to understand uncertainty makes us refrain from answering questions we don't know. Inspired by analogous approaches in classification, this study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering. We investigate two kinds of uncertainties, statistical uncertainty metrics and a distinct verbalized measure, termed as In-Dialogue Uncertainty (InDU). Using these uncertainty measures combined with models with and without Reinforcement Learning with Human Feedback (RLHF), we show that in all three situations, abstention based on the right kind of uncertainty measure can boost the reliability of LLMs. By sacrificing only a few highly uncertain samples we can improve correctness by 2% to 8%, avoid 50% hallucinations via correctly identifying unanswerable questions and increase safety by 70% up to 99% with almost no additional computational overhead.
大型语言模型(LLMs)实际部署的一个主要障碍是它们的不可靠性。在以下三种情况下,这种不可靠性尤为突出:正确性、在无法回答的问题上出现的幻觉,以及安全性。在这三种情况下,模型应尽可能地避免回答,就像人类一样,因为我们理解不确定性使得我们避免回答我们不知道的问题。受到分类方法类比启发,本研究探讨了在LLMs领域中基于不确定性的回避是否具有可行性和有效性。我们研究了两种不确定性,统计不确定性和一个称为“对话不确定性”(In-Dialogue Uncertainty,InDU)的区分性口头测量。通过使用这些不确定性衡量标准与具有或无人类反馈的模型相结合,我们发现,在所有三种情况下,基于适当类型的不确定性衡量标准回避可以提高LLMs的可靠性。通过牺牲一些高度不确定性的样本,我们可以将正确性提高2%至8%,通过正确识别无法回答的问题避免50%的幻觉,并将安全性提高70%至99%。几乎不需要额外的计算开销。
https://arxiv.org/abs/2404.10960
Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
意识到LLM输出是否可以基于证据是自然语言处理(NLP)中许多任务的关键:检索增强生成、总结、文档导向对话等。目前针对这种“事实核查”的方法是基于使用LLM验证每个模型生成的每个部分与潜在证据是否相符。然而,这种过程非常计算密集,需要多次调用LLM来检查单个响应。在这项工作中,我们展示了如何构建具有GPT-4级性能的模型,但成本却降低了400倍。我们通过使用GPT-4构建合成训练数据来实现这一点,这是一种通过结构化生成程序创建现实且具有挑战性的事实错误的途径。在这种数据上训练使模型检查每个主张中的事实,并识别句子间信息合成。为了评估,我们将现有的数据集统一到一个名为LLM-AggregFact的基准LLM-聚合数据集,该数据集是由最近关于事实核查和将LLM生成与证据结合的工作收集的。我们的最佳系统 MiniCheck-FT5 (770M参数) 超越了所有具有相当大小且达到GPT-4精度的系统。我们发布了LLM-AggregFact数据集、数据合成代码和模型。
https://arxiv.org/abs/2404.10774
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across various a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
强化学习从人类反馈(RLHF)是当前最广泛用于将大型语言模型(LLMs)与人类偏好对齐的方法。现有的RLHF方法可以大致分为基于奖励或无奖励两类。新颖的应用程序如ChatGPT和Claude利用了基于奖励的方法,首先学习了一个奖励模型并应用了actor-critic算法,如Proximal Policy Optimization (PPO)。然而,在学术基准测试中,最先进的结果通常是通过无奖励方法实现的,例如直接偏好优化(DPO)。DPO是否真的比PPO更优越?为什么PPO在这些基准测试中的表现不佳?在本文中,我们首先对DPO和RLHF的算法性质进行了理论和实证研究,并表明DPO可能具有根本限制。此外,我们还对PPO进行了全面评估,揭示了其在微调LLM时的最佳表现关键因素。最后,我们在各种RLHF测试床上对DPO和PPO进行了基准测试,从对话到代码生成。实验结果表明,在所有情况下,PPO都能够在所有情况下超越其他对齐方法,并在具有挑战性的代码竞赛中实现最先进的结果。
https://arxiv.org/abs/2404.10719
Open conversations are one of the most engaging forms of teaching. However, creating those conversations in educational software is a complex endeavor, especially if we want to address the needs of different audiences. While language models hold great promise for educational applications, there are substantial challenges in training them to engage in meaningful and effective conversational teaching, especially when considering the diverse needs of various audiences. No official data sets exist for this task to facilitate the training of language models for conversational teaching, considering the diverse needs of various audiences. This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels (from preschooler to expert), namely dialogues taken from video transcripts. We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate and natural responses to scientific topics for specific target audiences. It is a freely available valuable resource for training and evaluating conversation models, encompassing organically occurring dialogues. While the raw data is available online, we provide additional metadata for conversational analysis of dialogues at each level in all available videos.
开放性对话是教学中最引人入胜的形式之一。然而,在教育软件中创建这些对话是一个复杂的任务,尤其是当我们想要满足不同受众的需求时。虽然自然语言处理模型在教育应用中具有巨大的潜力时,要训练它们参与有意义的有效对话教学确实存在巨大的挑战。尤其当考虑到各种受众的不同需求时。目前尚无用于此任务的可用于训练自然语言处理模型进行对话教学的官方数据集。本文介绍了一个新的来源,用于促进不同难度级别的科学概念的对话教学,这个来源是从视频剪辑中提取的对话。我们对这个数据源进行了多种方式的分析,以展示它提供了多样化的例子,可以用于为特定目标受众生成适当且自然地回答科学主题。这对于训练和评估对话模型具有价值,并涵盖了自然产生的对话。尽管原始数据可在线获取,但我们为所有可用视频的每个级别提供了对话分析的额外元数据。
https://arxiv.org/abs/2404.10475
Health coaching helps patients achieve personalized and lifestyle-related goals, effectively managing chronic conditions and alleviating mental health issues. It is particularly beneficial, however cost-prohibitive, for low-socioeconomic status populations due to its highly personalized and labor-intensive nature. In this paper, we propose a neuro-symbolic goal summarizer to support health coaches in keeping track of the goals and a text-units-text dialogue generation model that converses with patients and helps them create and accomplish specific goals for physical activities. Our models outperform previous state-of-the-art while eliminating the need for predefined schema and corresponding annotation. We also propose a new health coaching dataset extending previous work and a metric to measure the unconventionality of the patient's response based on data difficulty, facilitating potential coach alerts during deployment.
健康教练可以帮助患者实现个性化和生活方式相关目标,有效地管理慢性疾病和缓解心理健康问题。然而,由于其高度个性化和劳动密集的性质,对于低社会经济地位的人口来说,它可能过于昂贵。在本文中,我们提出了一个神经符号目标摘要器来支持健康教练跟踪目标,以及一个文本单元文本对话生成模型,与患者进行交互并帮助他们制定和实现特定的运动活动目标。我们的模型在保持当前最先进水平的同时,消除了需要预定义模式和相关注释的需求。我们还提出了一个新的健康教练数据集,扩展了以前的工作,并定义了一个指标来衡量患者响应的不寻常性,从而在部署过程中促进可能的教练警报。
https://arxiv.org/abs/2404.10268
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models (LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator's performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.
众包标签在评估面向任务的对话系统(TDS)中扮演着关键角色。从标注者那里获得高质量且一致的地面真实标签存在挑战。在评估TDS时,标注者必须全面理解对话内容,然后提供判断。之前的研究建议在标注过程中仅使用对话的一部分。然而,这个限制对标签质量的影响尚未被探索。本研究探讨了对话内容对标注质量的影响,考虑了相关性和有用性标注的截断语境。我们进一步提出使用大型语言模型(LLMs)对对话内容进行总结,以提供丰富和简洁的对话背景,并研究其对标注者性能的影响。减少语境会导致更高的评分。相反,提供完整的对话语境会得到更高质量的关联评分,但会引入有用性评级的模糊性。以第一用户的说话内容为上下文会导致一致的评分,类似于使用完整对话获得的评分,同时减少了标注的工作量。我们的研究结果表明,任务设计(特别是对话上下文的可用性)如何影响众包评估标签的质量和一致性。
https://arxiv.org/abs/2404.09980
A key requirement in developing Generative Language Models (GLMs) is to have their values aligned with human values. Preference-based alignment is a widely used paradigm for this purpose, in which preferences over generation pairs are first elicited from human annotators or AI systems, and then fed into some alignment techniques, e.g., Direct Preference Optimization. However, a substantial percent (20 - 40%) of the preference pairs used in GLM alignment are noisy, and it remains unclear how the noise affects the alignment performance and how to mitigate its negative impact. In this paper, we propose a framework to inject desirable amounts and types of noise to the preferences, and systematically study the impact of preference noise on the alignment performance in two tasks (summarization and dialogue generation). We find that the alignment performance can be highly sensitive to the noise rates in the preference data: e.g., a 10 percentage points (pp) increase of the noise rate can lead to 30 pp drop in the alignment performance (in win rate). To mitigate the impact of noise, confidence-based data filtering shows significant benefit when certain types of noise are present. We hope our work can help the community better understand and mitigate the impact of preference noise in GLM alignment.
在开发生成语言模型(GLMs)时,使其值与人类价值观保持一致是一个关键要求。基于偏好的对齐是一个广泛使用的范例,其中先从人类注释者或AI系统那里征求偏好对生成对之间的偏好,然后将其输入到一些对齐技术中,例如直接偏好优化。然而,GLM对齐中使用的偏好对中相当大比例(20 - 40%)是噪声,而且尚不清楚噪声对对齐性能的影响以及如何减轻其负面影响。在本文中,我们提出了一个框架,可以向偏好中注入理想的数量和类型的噪声,并系统地研究偏好噪声对对齐性能的影响,在两个任务(摘要和对话生成)上进行研究。我们发现,偏好数据的噪声率对对齐性能的影响可能非常大:例如,噪声率每增加10个百分点(pp),对对齐性能的影响就可能减少30 pp。为了减轻噪声的影响,基于信心的数据滤波在某些类型噪声存在时表现出显著优势。我们希望我们的工作能够帮助社区更好地了解和减轻GLM对齐中偏好噪声的影响。
https://arxiv.org/abs/2404.09824
Compositional generalization is the ability of a model to generalize to complex, previously unseen types of combinations of entities from just having seen the primitives. This type of generalization is particularly relevant to the semantic parsing community for applications such as task-oriented dialogue, text-to-SQL parsing, and information retrieval, as they can harbor infinite complexity. Despite the success of large language models (LLMs) in a wide range of NLP tasks, unlocking perfect compositional generalization still remains one of the few last unsolved frontiers. The past few years has seen a surge of interest in works that explore the limitations of, methods to improve, and evaluation metrics for compositional generalization capabilities of LLMs for semantic parsing tasks. In this work, we present a literature survey geared at synthesizing recent advances in analysis, methods, and evaluation schemes to offer a starting point for both practitioners and researchers in this area.
组合泛化是指模型能够从仅看到基本元素来泛化到复杂、之前未见过的实体组合。对于自然语言处理社区中诸如面向任务对话、文本到关系数据库解析和信息检索等应用,这种泛化尤为重要,因为它们可能包含无限复杂性。尽管在自然语言处理领域,大型语言模型(LLMs)在各种NLP任务上取得了成功,但实现完美组合泛化仍然是不可能的几个最后的未解决的前沿。在过去的几年里,对研究探索LLM在语义解析任务上组合泛化能力的局限性、方法和评估指标的兴趣浓厚。在这篇论文中,我们进行了一篇文献调查,旨在为该领域的实践者和研究人员提供一些关于LLM组合泛化分析、方法和评估方案的起点。
https://arxiv.org/abs/2404.13074
Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.
模拟患者(SPs)在临床医疗教育中扮演着关键角色,为学生提供真实场景进行实践。然而,高额的培训和雇佣高素质SP的费用,以及他们描绘实际患者可能面临的重大的工作量及潜在风险,限制了学生接近这种临床培训。因此,近年来,基于计算机程序的模拟患者整合成为一个有价值的教学工具。随着大型语言模型(LLMs)的快速发展,他们在会话人工智能和角色扮演方面的卓越表现已经得到证实,使它们成为实现虚拟患者(VSP)的可行选项。在本文中,我们提出了一个模型无关的框架,称为CureFun,利用LLMs在临床医疗教育中的潜力。这个框架促进了学生与模拟患者之间的自然对话,评估了他们的对话,并提供了建议,以提高学生的临床研究技能。通过全面的评估,我们的方法证明了与其他LLM-基于聊天机器人的对话流相比,其真实的医生场景对话更为真实和专业化。此外,利用CureFun的评估能力,我们评估了几个医疗LLM,并探讨了使用LLMs作为虚拟医生的诊断能力的可能性和局限性。
https://arxiv.org/abs/2404.13066
Health coaching helps patients identify and accomplish lifestyle-related goals, effectively improving the control of chronic diseases and mitigating mental health conditions. However, health coaching is cost-prohibitive due to its highly personalized and labor-intensive nature. In this paper, we propose to build a dialogue system that converses with the patients, helps them create and accomplish specific goals, and can address their emotions with empathy. However, building such a system is challenging since real-world health coaching datasets are limited and empathy is subtle. Thus, we propose a modularized health coaching dialogue system with simplified NLU and NLG frameworks combined with mechanism-conditioned empathetic response generation. Through automatic and human evaluation, we show that our system generates more empathetic, fluent, and coherent responses and outperforms the state-of-the-art in NLU tasks while requiring less annotation. We view our approach as a key step towards building automated and more accessible health coaching systems.
健康教练有助于患者识别和实现与生活方式相关的目标,有效改善慢性疾病控制并减轻心理健康状况。然而,由于其高度个性化和劳动密集的性质,健康教练的费用是不可承受的。在本文中,我们提出了一个与患者进行对话的系统,帮助他们创建和实现具体目标,并能够体谅他们的情感。然而,由于现实世界健康教练数据集有限,且情感表达微妙,因此我们提出了一个模块化健康教练对话系统,结合简化的NLU和NLG框架以及机制条件下的情感响应生成。通过自动和人类评估,我们证明了我们的系统生成更有同情心、流畅和连贯的回答,并在NLU任务上优于现有技术,同时需要更少的注释。我们认为,我们的方法是构建自动和更易访问的健康教练系统的关键一步。
https://arxiv.org/abs/2404.08888
The conversational search task aims to enable a user to resolve information needs via natural language dialogue with an agent. In this paper, we aim to develop a conceptual framework of the actions and intents of users and agents explaining how these actions enable the user to explore the search space and resolve their information need. We outline the different actions and intents, before discussing key decision points in the conversation where the agent needs to decide how to steer the conversational search process to a successful and/or satisfactory conclusion. Essentially, this paper provides a conceptualization of the conversational search process between an agent and user, which provides a framework and a starting point for research, development and evaluation of conversational search agents.
对话搜索任务旨在通过自然语言与代理进行交互,使用户能够通过与代理的自然语言对话来解决信息需求。在本文中,我们的目标是开发一个用户和代理之间对话行动和意图的概念框架,解释这些行动如何使用户探索搜索空间并解决他们的信息需求。我们概述了不同的行动和意图,然后讨论了在对话中代理需要决定如何引导对话搜索过程以达到成功和/或令人满意的结论的关键决策点。本质上,本文为用户和代理之间对话搜索过程提供了一个概念性框架,为对话搜索代理的研究、开发和评估提供了框架和起点。
https://arxiv.org/abs/2404.08630
Zero-shot dialogue state tracking (DST) transfers knowledge to unseen domains, reducing the cost of annotating new datasets. Previous zero-shot DST models mainly suffer from domain transferring and partial prediction problems. To address these challenges, we propose Mixture of Prefix Experts (MoPE) to establish connections between similar slots in different domains, which strengthens the model transfer performance in unseen domains. Empirical results demonstrate that MoPE-DST achieves the joint goal accuracy of 57.13% on MultiWOZ2.1 and 55.40% on SGD.
零距离对话状态跟踪(DST)将知识传递给未见过的领域,从而降低为新数据集注释的成本。之前的主要零距离DST模型主要受到领域转移和部分预测问题的困扰。为解决这些挑战,我们提出了前缀专家(MoPE)来建立不同领域中类似槽位之间的联系,从而在未见过的领域中增强模型的迁移性能。实证结果表明,MoPE-DST在MultiWOZ2.1和SGD上的准确度都达到了57.13%和55.40%。
https://arxiv.org/abs/2404.08559
Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
实时检测对话破裂对于会话人工智能系统至关重要,因为它允许采取正确的行动来成功地完成任务。在口语对话系统中,这可能由各种意外情况引起,包括高度的背景噪音,导致STT误转或意外的用户流程。特别是,像医疗保健行业这样的行业需要高精度和高灵活性来根据对话历史和对话状态有所不同地导航。这使得准确检测对话破裂既更具挑战性,又更具关键性。为了准确检测,我们发现,在实时处理音频输入并基于转录文本进行下游NLP模型推断的同时,需要实现这一点。在本文中,我们引入了一个多模态上下文对话破裂(MultConDB)模型。这个模型在获得F1分数69.27的同时,显著超过了其他已知最佳模型。
https://arxiv.org/abs/2404.08156
Current Conversational AI systems employ different machine learning pipelines, as well as external knowledge sources and business logic to predict the next action. Maintaining various components in dialogue managers' pipeline adds complexity in expansion and updates, increases processing time, and causes additive noise through the pipeline that can lead to incorrect next action prediction. This paper investigates graph integration into language transformers to improve understanding the relationships between humans' utterances, previous, and next actions without the dependency on external sources or components. Experimental analyses on real calls indicate that the proposed Graph Integrated Language Transformer models can achieve higher performance compared to other production level conversational AI systems in driving interactive calls with human users in real-world settings.
当前的会话AI系统采用不同的机器学习管道以及外部知识和业务逻辑来预测下一个动作。在保持对话管理器的管道中维护各种组件增加了复杂性,增加了处理时间,并通过管道中的附加噪声导致错误的下一个动作预测。本文研究了将图整合到语言变换器中,以改进理解人类陈述、先前的动作和下一个动作之间的关系,而不受外部来源或组件的依赖。对于真实电话实验分析结果表明,与具有人类用户在现实环境中的交互式电话相比,所提出的图整合语言变换器模型具有更高的性能。
https://arxiv.org/abs/2404.08155