Aligning large language models (LLMs) with human expectations requires high-quality instructional dialogues, which can be achieved by raising diverse, in-depth, and insightful instructions that deepen interactions. Existing methods target instructions from real instruction dialogues as a learning goal and fine-tune a user simulator for posing instructions. However, the user simulator struggles to implicitly model complex dialogue flows and pose high-quality instructions. In this paper, we take inspiration from the cognitive abilities inherent in human learning and propose the explicit modeling of complex dialogue flows through instructional strategy reuse. Specifically, we first induce high-level strategies from various real instruction dialogues. These strategies are applied to new dialogue scenarios deductively, where the instructional strategies facilitate high-quality instructions. Experimental results show that our method can generate diverse, in-depth, and insightful instructions for a given dialogue history. The constructed multi-turn instructional dialogues can outperform competitive baselines on the downstream chat model.
将大型语言模型(LLMs)与人类期望对齐需要高质量的教学对话,这是通过提高多样、深入和富有见解的指示来实现的。现有的方法将实教对话中的指示作为学习目标,并微调用户模拟器以提出指示。然而,用户模拟器很难隐含地建模复杂的对话流程和高质量的指示。在本文中,我们从人类学习固有的认知能力中获得灵感,并通过重用教学策略来显式建模复杂对话流程。具体来说,我们首先从各种实教对话中诱导高级策略。这些策略通过演绎应用于新的对话场景,在这里,教学策略促进高质量的教学。实验结果表明,我们的方法可以生成与给定对话历史相异、深入和富有见解的指令。构建的多轮教学对话可以在下游聊天模型上优于竞争基线。
https://arxiv.org/abs/2404.11095
Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.
内容审查面临着一个具有挑战性的任务,因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论,传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战,因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而,其对可用目标标签进行判断的依赖性,在平台不断演进和仇恨言论多变性的情况下,面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization,本研究提出了 HATE WATCH,一种新颖的弱监督因果解码框架,它绕过了明确的目标标签的需要,有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后,HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法,具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。
https://arxiv.org/abs/2404.11036
A major barrier towards the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety. In all three cases, models should ideally abstain from responding, much like humans, whose ability to understand uncertainty makes us refrain from answering questions we don't know. Inspired by analogous approaches in classification, this study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering. We investigate two kinds of uncertainties, statistical uncertainty metrics and a distinct verbalized measure, termed as In-Dialogue Uncertainty (InDU). Using these uncertainty measures combined with models with and without Reinforcement Learning with Human Feedback (RLHF), we show that in all three situations, abstention based on the right kind of uncertainty measure can boost the reliability of LLMs. By sacrificing only a few highly uncertain samples we can improve correctness by 2% to 8%, avoid 50% hallucinations via correctly identifying unanswerable questions and increase safety by 70% up to 99% with almost no additional computational overhead.
大型语言模型(LLMs)实际部署的一个主要障碍是它们的不可靠性。在以下三种情况下,这种不可靠性尤为突出:正确性、在无法回答的问题上出现的幻觉,以及安全性。在这三种情况下,模型应尽可能地避免回答,就像人类一样,因为我们理解不确定性使得我们避免回答我们不知道的问题。受到分类方法类比启发,本研究探讨了在LLMs领域中基于不确定性的回避是否具有可行性和有效性。我们研究了两种不确定性,统计不确定性和一个称为“对话不确定性”(In-Dialogue Uncertainty,InDU)的区分性口头测量。通过使用这些不确定性衡量标准与具有或无人类反馈的模型相结合,我们发现,在所有三种情况下,基于适当类型的不确定性衡量标准回避可以提高LLMs的可靠性。通过牺牲一些高度不确定性的样本,我们可以将正确性提高2%至8%,通过正确识别无法回答的问题避免50%的幻觉,并将安全性提高70%至99%。几乎不需要额外的计算开销。
https://arxiv.org/abs/2404.10960
Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
意识到LLM输出是否可以基于证据是自然语言处理(NLP)中许多任务的关键:检索增强生成、总结、文档导向对话等。目前针对这种“事实核查”的方法是基于使用LLM验证每个模型生成的每个部分与潜在证据是否相符。然而,这种过程非常计算密集,需要多次调用LLM来检查单个响应。在这项工作中,我们展示了如何构建具有GPT-4级性能的模型,但成本却降低了400倍。我们通过使用GPT-4构建合成训练数据来实现这一点,这是一种通过结构化生成程序创建现实且具有挑战性的事实错误的途径。在这种数据上训练使模型检查每个主张中的事实,并识别句子间信息合成。为了评估,我们将现有的数据集统一到一个名为LLM-AggregFact的基准LLM-聚合数据集,该数据集是由最近关于事实核查和将LLM生成与证据结合的工作收集的。我们的最佳系统 MiniCheck-FT5 (770M参数) 超越了所有具有相当大小且达到GPT-4精度的系统。我们发布了LLM-AggregFact数据集、数据合成代码和模型。
https://arxiv.org/abs/2404.10774
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across various a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
强化学习从人类反馈(RLHF)是当前最广泛用于将大型语言模型(LLMs)与人类偏好对齐的方法。现有的RLHF方法可以大致分为基于奖励或无奖励两类。新颖的应用程序如ChatGPT和Claude利用了基于奖励的方法,首先学习了一个奖励模型并应用了actor-critic算法,如Proximal Policy Optimization (PPO)。然而,在学术基准测试中,最先进的结果通常是通过无奖励方法实现的,例如直接偏好优化(DPO)。DPO是否真的比PPO更优越?为什么PPO在这些基准测试中的表现不佳?在本文中,我们首先对DPO和RLHF的算法性质进行了理论和实证研究,并表明DPO可能具有根本限制。此外,我们还对PPO进行了全面评估,揭示了其在微调LLM时的最佳表现关键因素。最后,我们在各种RLHF测试床上对DPO和PPO进行了基准测试,从对话到代码生成。实验结果表明,在所有情况下,PPO都能够在所有情况下超越其他对齐方法,并在具有挑战性的代码竞赛中实现最先进的结果。
https://arxiv.org/abs/2404.10719
Open conversations are one of the most engaging forms of teaching. However, creating those conversations in educational software is a complex endeavor, especially if we want to address the needs of different audiences. While language models hold great promise for educational applications, there are substantial challenges in training them to engage in meaningful and effective conversational teaching, especially when considering the diverse needs of various audiences. No official data sets exist for this task to facilitate the training of language models for conversational teaching, considering the diverse needs of various audiences. This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels (from preschooler to expert), namely dialogues taken from video transcripts. We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate and natural responses to scientific topics for specific target audiences. It is a freely available valuable resource for training and evaluating conversation models, encompassing organically occurring dialogues. While the raw data is available online, we provide additional metadata for conversational analysis of dialogues at each level in all available videos.
开放性对话是教学中最引人入胜的形式之一。然而,在教育软件中创建这些对话是一个复杂的任务,尤其是当我们想要满足不同受众的需求时。虽然自然语言处理模型在教育应用中具有巨大的潜力时,要训练它们参与有意义的有效对话教学确实存在巨大的挑战。尤其当考虑到各种受众的不同需求时。目前尚无用于此任务的可用于训练自然语言处理模型进行对话教学的官方数据集。本文介绍了一个新的来源,用于促进不同难度级别的科学概念的对话教学,这个来源是从视频剪辑中提取的对话。我们对这个数据源进行了多种方式的分析,以展示它提供了多样化的例子,可以用于为特定目标受众生成适当且自然地回答科学主题。这对于训练和评估对话模型具有价值,并涵盖了自然产生的对话。尽管原始数据可在线获取,但我们为所有可用视频的每个级别提供了对话分析的额外元数据。
https://arxiv.org/abs/2404.10475
Health coaching helps patients achieve personalized and lifestyle-related goals, effectively managing chronic conditions and alleviating mental health issues. It is particularly beneficial, however cost-prohibitive, for low-socioeconomic status populations due to its highly personalized and labor-intensive nature. In this paper, we propose a neuro-symbolic goal summarizer to support health coaches in keeping track of the goals and a text-units-text dialogue generation model that converses with patients and helps them create and accomplish specific goals for physical activities. Our models outperform previous state-of-the-art while eliminating the need for predefined schema and corresponding annotation. We also propose a new health coaching dataset extending previous work and a metric to measure the unconventionality of the patient's response based on data difficulty, facilitating potential coach alerts during deployment.
健康教练可以帮助患者实现个性化和生活方式相关目标,有效地管理慢性疾病和缓解心理健康问题。然而,由于其高度个性化和劳动密集的性质,对于低社会经济地位的人口来说,它可能过于昂贵。在本文中,我们提出了一个神经符号目标摘要器来支持健康教练跟踪目标,以及一个文本单元文本对话生成模型,与患者进行交互并帮助他们制定和实现特定的运动活动目标。我们的模型在保持当前最先进水平的同时,消除了需要预定义模式和相关注释的需求。我们还提出了一个新的健康教练数据集,扩展了以前的工作,并定义了一个指标来衡量患者响应的不寻常性,从而在部署过程中促进可能的教练警报。
https://arxiv.org/abs/2404.10268
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models (LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator's performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.
众包标签在评估面向任务的对话系统(TDS)中扮演着关键角色。从标注者那里获得高质量且一致的地面真实标签存在挑战。在评估TDS时,标注者必须全面理解对话内容,然后提供判断。之前的研究建议在标注过程中仅使用对话的一部分。然而,这个限制对标签质量的影响尚未被探索。本研究探讨了对话内容对标注质量的影响,考虑了相关性和有用性标注的截断语境。我们进一步提出使用大型语言模型(LLMs)对对话内容进行总结,以提供丰富和简洁的对话背景,并研究其对标注者性能的影响。减少语境会导致更高的评分。相反,提供完整的对话语境会得到更高质量的关联评分,但会引入有用性评级的模糊性。以第一用户的说话内容为上下文会导致一致的评分,类似于使用完整对话获得的评分,同时减少了标注的工作量。我们的研究结果表明,任务设计(特别是对话上下文的可用性)如何影响众包评估标签的质量和一致性。
https://arxiv.org/abs/2404.09980
A key requirement in developing Generative Language Models (GLMs) is to have their values aligned with human values. Preference-based alignment is a widely used paradigm for this purpose, in which preferences over generation pairs are first elicited from human annotators or AI systems, and then fed into some alignment techniques, e.g., Direct Preference Optimization. However, a substantial percent (20 - 40%) of the preference pairs used in GLM alignment are noisy, and it remains unclear how the noise affects the alignment performance and how to mitigate its negative impact. In this paper, we propose a framework to inject desirable amounts and types of noise to the preferences, and systematically study the impact of preference noise on the alignment performance in two tasks (summarization and dialogue generation). We find that the alignment performance can be highly sensitive to the noise rates in the preference data: e.g., a 10 percentage points (pp) increase of the noise rate can lead to 30 pp drop in the alignment performance (in win rate). To mitigate the impact of noise, confidence-based data filtering shows significant benefit when certain types of noise are present. We hope our work can help the community better understand and mitigate the impact of preference noise in GLM alignment.
在开发生成语言模型(GLMs)时,使其值与人类价值观保持一致是一个关键要求。基于偏好的对齐是一个广泛使用的范例,其中先从人类注释者或AI系统那里征求偏好对生成对之间的偏好,然后将其输入到一些对齐技术中,例如直接偏好优化。然而,GLM对齐中使用的偏好对中相当大比例(20 - 40%)是噪声,而且尚不清楚噪声对对齐性能的影响以及如何减轻其负面影响。在本文中,我们提出了一个框架,可以向偏好中注入理想的数量和类型的噪声,并系统地研究偏好噪声对对齐性能的影响,在两个任务(摘要和对话生成)上进行研究。我们发现,偏好数据的噪声率对对齐性能的影响可能非常大:例如,噪声率每增加10个百分点(pp),对对齐性能的影响就可能减少30 pp。为了减轻噪声的影响,基于信心的数据滤波在某些类型噪声存在时表现出显著优势。我们希望我们的工作能够帮助社区更好地了解和减轻GLM对齐中偏好噪声的影响。
https://arxiv.org/abs/2404.09824
Compositional generalization is the ability of a model to generalize to complex, previously unseen types of combinations of entities from just having seen the primitives. This type of generalization is particularly relevant to the semantic parsing community for applications such as task-oriented dialogue, text-to-SQL parsing, and information retrieval, as they can harbor infinite complexity. Despite the success of large language models (LLMs) in a wide range of NLP tasks, unlocking perfect compositional generalization still remains one of the few last unsolved frontiers. The past few years has seen a surge of interest in works that explore the limitations of, methods to improve, and evaluation metrics for compositional generalization capabilities of LLMs for semantic parsing tasks. In this work, we present a literature survey geared at synthesizing recent advances in analysis, methods, and evaluation schemes to offer a starting point for both practitioners and researchers in this area.
组合泛化是指模型能够从仅看到基本元素来泛化到复杂、之前未见过的实体组合。对于自然语言处理社区中诸如面向任务对话、文本到关系数据库解析和信息检索等应用,这种泛化尤为重要,因为它们可能包含无限复杂性。尽管在自然语言处理领域,大型语言模型(LLMs)在各种NLP任务上取得了成功,但实现完美组合泛化仍然是不可能的几个最后的未解决的前沿。在过去的几年里,对研究探索LLM在语义解析任务上组合泛化能力的局限性、方法和评估指标的兴趣浓厚。在这篇论文中,我们进行了一篇文献调查,旨在为该领域的实践者和研究人员提供一些关于LLM组合泛化分析、方法和评估方案的起点。
https://arxiv.org/abs/2404.13074
Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.
模拟患者(SPs)在临床医疗教育中扮演着关键角色,为学生提供真实场景进行实践。然而,高额的培训和雇佣高素质SP的费用,以及他们描绘实际患者可能面临的重大的工作量及潜在风险,限制了学生接近这种临床培训。因此,近年来,基于计算机程序的模拟患者整合成为一个有价值的教学工具。随着大型语言模型(LLMs)的快速发展,他们在会话人工智能和角色扮演方面的卓越表现已经得到证实,使它们成为实现虚拟患者(VSP)的可行选项。在本文中,我们提出了一个模型无关的框架,称为CureFun,利用LLMs在临床医疗教育中的潜力。这个框架促进了学生与模拟患者之间的自然对话,评估了他们的对话,并提供了建议,以提高学生的临床研究技能。通过全面的评估,我们的方法证明了与其他LLM-基于聊天机器人的对话流相比,其真实的医生场景对话更为真实和专业化。此外,利用CureFun的评估能力,我们评估了几个医疗LLM,并探讨了使用LLMs作为虚拟医生的诊断能力的可能性和局限性。
https://arxiv.org/abs/2404.13066
Health coaching helps patients identify and accomplish lifestyle-related goals, effectively improving the control of chronic diseases and mitigating mental health conditions. However, health coaching is cost-prohibitive due to its highly personalized and labor-intensive nature. In this paper, we propose to build a dialogue system that converses with the patients, helps them create and accomplish specific goals, and can address their emotions with empathy. However, building such a system is challenging since real-world health coaching datasets are limited and empathy is subtle. Thus, we propose a modularized health coaching dialogue system with simplified NLU and NLG frameworks combined with mechanism-conditioned empathetic response generation. Through automatic and human evaluation, we show that our system generates more empathetic, fluent, and coherent responses and outperforms the state-of-the-art in NLU tasks while requiring less annotation. We view our approach as a key step towards building automated and more accessible health coaching systems.
健康教练有助于患者识别和实现与生活方式相关的目标,有效改善慢性疾病控制并减轻心理健康状况。然而,由于其高度个性化和劳动密集的性质,健康教练的费用是不可承受的。在本文中,我们提出了一个与患者进行对话的系统,帮助他们创建和实现具体目标,并能够体谅他们的情感。然而,由于现实世界健康教练数据集有限,且情感表达微妙,因此我们提出了一个模块化健康教练对话系统,结合简化的NLU和NLG框架以及机制条件下的情感响应生成。通过自动和人类评估,我们证明了我们的系统生成更有同情心、流畅和连贯的回答,并在NLU任务上优于现有技术,同时需要更少的注释。我们认为,我们的方法是构建自动和更易访问的健康教练系统的关键一步。
https://arxiv.org/abs/2404.08888
The conversational search task aims to enable a user to resolve information needs via natural language dialogue with an agent. In this paper, we aim to develop a conceptual framework of the actions and intents of users and agents explaining how these actions enable the user to explore the search space and resolve their information need. We outline the different actions and intents, before discussing key decision points in the conversation where the agent needs to decide how to steer the conversational search process to a successful and/or satisfactory conclusion. Essentially, this paper provides a conceptualization of the conversational search process between an agent and user, which provides a framework and a starting point for research, development and evaluation of conversational search agents.
对话搜索任务旨在通过自然语言与代理进行交互,使用户能够通过与代理的自然语言对话来解决信息需求。在本文中,我们的目标是开发一个用户和代理之间对话行动和意图的概念框架,解释这些行动如何使用户探索搜索空间并解决他们的信息需求。我们概述了不同的行动和意图,然后讨论了在对话中代理需要决定如何引导对话搜索过程以达到成功和/或令人满意的结论的关键决策点。本质上,本文为用户和代理之间对话搜索过程提供了一个概念性框架,为对话搜索代理的研究、开发和评估提供了框架和起点。
https://arxiv.org/abs/2404.08630
Zero-shot dialogue state tracking (DST) transfers knowledge to unseen domains, reducing the cost of annotating new datasets. Previous zero-shot DST models mainly suffer from domain transferring and partial prediction problems. To address these challenges, we propose Mixture of Prefix Experts (MoPE) to establish connections between similar slots in different domains, which strengthens the model transfer performance in unseen domains. Empirical results demonstrate that MoPE-DST achieves the joint goal accuracy of 57.13% on MultiWOZ2.1 and 55.40% on SGD.
零距离对话状态跟踪(DST)将知识传递给未见过的领域,从而降低为新数据集注释的成本。之前的主要零距离DST模型主要受到领域转移和部分预测问题的困扰。为解决这些挑战,我们提出了前缀专家(MoPE)来建立不同领域中类似槽位之间的联系,从而在未见过的领域中增强模型的迁移性能。实证结果表明,MoPE-DST在MultiWOZ2.1和SGD上的准确度都达到了57.13%和55.40%。
https://arxiv.org/abs/2404.08559
Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
实时检测对话破裂对于会话人工智能系统至关重要,因为它允许采取正确的行动来成功地完成任务。在口语对话系统中,这可能由各种意外情况引起,包括高度的背景噪音,导致STT误转或意外的用户流程。特别是,像医疗保健行业这样的行业需要高精度和高灵活性来根据对话历史和对话状态有所不同地导航。这使得准确检测对话破裂既更具挑战性,又更具关键性。为了准确检测,我们发现,在实时处理音频输入并基于转录文本进行下游NLP模型推断的同时,需要实现这一点。在本文中,我们引入了一个多模态上下文对话破裂(MultConDB)模型。这个模型在获得F1分数69.27的同时,显著超过了其他已知最佳模型。
https://arxiv.org/abs/2404.08156
Current Conversational AI systems employ different machine learning pipelines, as well as external knowledge sources and business logic to predict the next action. Maintaining various components in dialogue managers' pipeline adds complexity in expansion and updates, increases processing time, and causes additive noise through the pipeline that can lead to incorrect next action prediction. This paper investigates graph integration into language transformers to improve understanding the relationships between humans' utterances, previous, and next actions without the dependency on external sources or components. Experimental analyses on real calls indicate that the proposed Graph Integrated Language Transformer models can achieve higher performance compared to other production level conversational AI systems in driving interactive calls with human users in real-world settings.
当前的会话AI系统采用不同的机器学习管道以及外部知识和业务逻辑来预测下一个动作。在保持对话管理器的管道中维护各种组件增加了复杂性,增加了处理时间,并通过管道中的附加噪声导致错误的下一个动作预测。本文研究了将图整合到语言变换器中,以改进理解人类陈述、先前的动作和下一个动作之间的关系,而不受外部来源或组件的依赖。对于真实电话实验分析结果表明,与具有人类用户在现实环境中的交互式电话相比,所提出的图整合语言变换器模型具有更高的性能。
https://arxiv.org/abs/2404.08155
A problem with many current Large Language Model (LLM) driven spoken dialogues is the response time. Some efforts such as Groq address this issue by lightning fast processing of the LLM, but we know from the cognitive psychology literature that in human-to-human dialogue often responses occur prior to the speaker completing their utterance. No amount of delay for LLM processing is acceptable if we wish to maintain human dialogue latencies. In this paper, we discuss methods for understanding an utterance in close to real time and generating a response so that the system can comply with human-level conversational turn delays. This means that the information content of the final part of the speaker's utterance is lost to the LLM. Using the Google NaturalQuestions (NQ) database, our results show GPT-4 can effectively fill in missing context from a dropped word at the end of a question over 60% of the time. We also provide some examples of utterances and the impacts of this information loss on the quality of LLM response in the context of an avatar that is currently under development. These results indicate that a simple classifier could be used to determine whether a question is semantically complete, or requires a filler phrase to allow a response to be generated within human dialogue time constraints.
许多当前的 Large Language Model (LLM) 驱动的会话存在一个响应时间的问题。一些努力(如Groq)通过闪电般的处理LLM解决了这个问题,但根据认知心理学文献,人类之间的对话中,响应通常在说话人完成其陈述之前发生。如果我们希望保持人类对话的延迟,对于LLM处理过程中的任何延迟都是不可接受的。在本文中,我们讨论了在接近实时理解和生成响应以使系统符合人类级的会话轮次延迟的方法。这意味着说话人最后部分的话语内容的最终部分将被LLM丢失。使用谷歌自然问题(NQ)数据库,我们的结果表明,GPT-4在超过60%的时间内可以有效地填补问题中的单词末尾丢失的上下文。我们还提供了一些例子,以及这种信息丢失对正在开发中的虚拟助手中的 LLM 响应质量的影响。这些结果表明,一个简单的分类器可以用来确定一个问题是否具有语义完整性,或者是否需要填充短语以便在人类对话时间内生成响应。
https://arxiv.org/abs/2404.16053
We explore question generation in the context of knowledge-grounded dialogs focusing on explainability and evaluation. Inspired by previous work on planning-based summarisation, we present a model which instead of directly generating a question, sequentially predicts first a fact then a question. We evaluate our approach on 37k test dialogs adapted from the KGConv dataset and we show that, although more demanding in terms of inference, our approach performs on par with a standard model which solely generates a question while allowing for a detailed referenceless evaluation of the model behaviour in terms of relevance, factuality and pronominalisation.
我们在知识本底的对话中探索问题生成,重点关注可解释性和评估。受到之前基于规划的摘要工作的启发,我们提出了一个模型,该模型在直接生成问题之前,先预测一个事实,然后是一个问题。我们在KGConv数据集上使用了37k个测试对话来评估我们的方法,我们证明了,尽管在推理方面更加具有挑战性,但我们的方法在相关性、事实性和名词化方面的行为与仅生成一个问题而允许对模型行为进行详细的有建设性的评估的标准模型相当。
https://arxiv.org/abs/2404.07836
Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question-answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset to demonstrate the complexity and applicability of Audio Dialogues. Our code for generating the dataset will be made publicly available. Detailed prompts and generated dialogues can be found on the demo website this https URL.
目前,音频理解数据集主要关注单轮交互(即音频字幕和音频问答),用自然语言描述音频,从而限制了通过交互对话理解音频。为了填补这个空白,我们引入了AudioDialogues数据集:一个包含163.8k个通用音频样本的多轮对话数据集。除了对话,AudioDialogues还包含问题-答案对,以理解和比较多个输入音频。AudioDialogues利用基于提示的策略和现有数据集中的字幕注释,使用大型语言模型(LLM)生成多轮对话。我们在提出的数据集上评估现有的音频增强大型语言模型,以证明AudioDialogues的复杂性和适用性。我们的数据集生成代码将公开发布。详细的提示和生成的对话可以在上述网站的演示网站上找到。
https://arxiv.org/abs/2404.07616
Addressing the imminent shortfall of 10 million health workers by 2030, predominantly in Low- and Middle-Income Countries (LMICs), this paper introduces an innovative approach that harnesses the power of Large Language Models (LLMs) integrated with machine translation models. This solution is engineered to meet the unique needs of Community Health Workers (CHWs), overcoming language barriers, cultural sensitivities, and the limited availability of medical dialog datasets. I have crafted a model that not only boasts superior translation capabilities but also undergoes rigorous fine-tuning on open-source datasets to ensure medical accuracy and is equipped with comprehensive safety features to counteract the risks of misinformation. Featuring a modular design, this approach is specifically structured for swift adaptation across various linguistic and cultural contexts, utilizing open-source components to significantly reduce healthcare operational costs. This strategic innovation markedly improves the accessibility and quality of healthcare services by providing CHWs with contextually appropriate medical knowledge and diagnostic tools. This paper highlights the transformative impact of this context-aware LLM, underscoring its crucial role in addressing the global healthcare workforce deficit and propelling forward healthcare outcomes in LMICs.
为了应对到2030年预计短于1000万卫生工作者(LMICs)的紧迫短缺,特别是中低收入国家,本文介绍了一种创新方法,利用大型语言模型(LLMs)与机器翻译模型的结合力量。这种解决方案专门为社区卫生工作者(CHWs)设计,克服了语言障碍、文化敏感性和医疗对话数据有限等问题。我创建了一个模型,不仅具备卓越的翻译能力,还通过开源数据集进行严格微调,确保医疗准确性,并配备了全面的安全功能来对抗信息不准确的风险。这种方法具有模块化设计,特别针对各种语言和文化背景进行快速适应,利用开源组件显著降低了卫生保健运营成本。本文突出了这种情境意识的LLM所带来的变革性影响,强调了其在解决全球卫生工作者短缺问题和推动LMICs卫生保健成果方面关键作用。
https://arxiv.org/abs/2404.08705