Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
大语言模型(LLMs)通过会话互动展示了改善人类效率的能力。传统的LLM驱动对话系统,采用轮询范式操作,在响应生成过程中无法实现实时互动。为了应对这一局限,研究人员提出了双工模型。这些模型可以动态适应用户输入,促进实时交互反馈。然而,这些方法通常需要大量的计算资源来获得实现能力。为了减少开销,本文提出了一种新的双工解码方法,增强了具有双工能力的LLM,并要求 minimal additional training。具体来说,我们的方法采用对话中的查询和响应并行解码,有效实现了信道分集多路解码策略。实验结果表明,与 minimal additional training 相比,我们提出的方法显著增强了用户-AI交互的自然性和人性化,且训练成本较低。
https://arxiv.org/abs/2409.11727
We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.
我们提出了一种基于多文档的 grounded multi-turn 合成对话生成技术,它采用了三种主要思想。首先,我们通过使用由 Chain-of-Thought (CoT) 提示生成的分类驱动用户查询来控制对话的整体流。其次,我们通过模仿现实世界中检索器的使用方式来生成多文档 grounded 对话,并在每个用户回合后更新底层文档。第三,我们应用 LLM-as-a-Judge 来过滤出答案错误的查询。人类评估合成对话数据表明,数据具有多样性、连贯性,并且主要包括正确答案。人类和自动评估的答案able 查询表明,在四个公开可用的多轮文档 grounded 基准测试集中,经过预训练的模型在合成对话上的一致表现优于预训练于现有人类生成的训练数据。
https://arxiv.org/abs/2409.11500
The workshop is affiliated with 33nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2024) August 26~30, 2023 / Pasadena, CA, USA. It is designed as a half-day event, extending over four hours from 9:00 to 12:30 PST time. It accommodates both in-person and virtual attendees (via Zoom), ensuring a flexible participation mode. The agenda is thoughtfully crafted to include a diverse range of sessions: two keynote speeches that promise to provide insightful perspectives, two dedicated paper presentation sessions, an interactive panel discussion to foster dialogue among experts which facilitates deeper dives into specific topics, and a 15-minute coffee break. The workshop website: this https URL.
该研讨会与2024年8月26日至30日举行的33rd IEEE国际机器人与人类交互沟通会议(RO-MAN 2024)有关。它被设计为半天的活动,从9:00持续到12:30太平洋标准时间。它容纳了亲自参加和通过Zoom参加的虚拟与会者,确保了灵活的参与方式。会议议程精心设计,包括以下内容:两个 keynote speeches,旨在提供富有启示性的观点;两个专题论文发表会议,专家们通过互动式论坛促进对话,并深入探讨了特定主题;和一个15分钟的咖啡休息时间。会议网站:https://this.https://www.meetup.com/RO-MAN-2024/。
https://arxiv.org/abs/2409.11150
In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as \emph{near} OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.
在面向任务的对话系统中,一个健壮的意图检测机制必须有效地处理在现实场景中遇到的畸形输入语句。本研究提出了一种新的细粒度调整框架,旨在增强归一化(ID)意图分类和离散化(OOD)意图检测,该框架利用LLM的极具区分性的表示来构建每个ID类的语义原型。通过利用LLM的高度区分性表示,我们使用多样性驱动的提示调整方法为每个ID类构建语义原型。我们在具有挑战性的离散化环境中严格测试我们的框架,其中ID和OOD类在语义上密切相关但不同,被称为\emph{near} OOD检测。为了进行全面的评估,我们将我们的方法与常见的微调方法进行了比较。实验结果表明,我们的方法在少样本ID意图分类和近距离OOD意图检测任务中表现出卓越的性能。
https://arxiv.org/abs/2409.11114
Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.
之前的研究工作已经使用有限的指标如拼写或几个基本知识任务和过时的数据集对量化LLMs进行了评估。此外,最近的大型模型如Llama 3.1(具有高达405B个参数)尚未进行彻底的评估。本文评估了指令定制LLM在各种量化方法(GPTQ、AWQ、SmoothQuant和FP8)上的性能,这些模型从7B到405B。使用13个基准,我们评估了六个任务类型的性能:常识问答、知识与语言理解、指令跟随、幻觉检测、数学和对话。我们发现,(1)将一个较大的LLM量化为与较小FP16 LLM相似的大小通常在大多数基准测试中都表现更好,但除外于幻觉检测和指令跟随;(2)不同的量化方法、模型大小和位宽对性能的影响显著不同,权重方法通常在较大模型上产生更好的结果;(3)量化不会显著影响任务难度导致的准确性下降;(4)MT-Bench评估方法在最近的高性能LLM之间缺乏区分力。
https://arxiv.org/abs/2409.11055
Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.
大语言模型(LLMs)激发了对于自动评估摘要的兴趣,为人类评估提供了一种更快、更经济高效的可替代方案。然而,在复杂任务(如长文本摘要和对话式会议摘要)上,现有方法往往表现不足。在本文中,我们介绍了CREAM(基于比较的免费参考-免 Evaluated自动评估会议摘要),一种为评估会议摘要的摘要。CREAM利用思维链推理和关键事实对齐来评估模型生成的摘要的简洁性和完整性,无需参考。通过采用ELO排名系统,我们的方法为比较不同模型或提示配置的质量提供了稳健的机制。
https://arxiv.org/abs/2409.10883
Dialog systems, such as voice assistants, are expected to engage with users in complex, evolving conversations. Unfortunately, traditional automatic speech recognition (ASR) systems deployed in such applications are usually trained to recognize each turn independently and lack the ability to adapt to the conversational context or incorporate user feedback. In this work, we introduce a general framework for ASR in dialog systems that can go beyond learning from single-turn utterances and learn over time how to adapt to both explicit supervision and implicit user feedback present in multi-turn conversations. We accomplish that by leveraging advances in student-teacher learning and context-aware dialog processing, and designing contrastive self-supervision approaches with Ohm, a new online hard-negative mining approach. We show that leveraging our new framework compared to traditional training leads to relative WER reductions of close to 10% in real-world dialog systems, and up to 26% on public synthetic data.
对话系统(如语音助手)预计将参与用户进行复杂、不断发展的对话。然而,在这样应用中部署的传统自动语音识别(ASR)系统通常被训练为每个回合独立识别,并缺乏适应对话上下文或纳入用户反馈的能力。在这项工作中,我们提出了一个通用的对话系统ASR框架,可以超越从单轮说话人指令中学习,并学会在多轮对话中适应 explicit(明确的)指导和 implicit(隐含的)用户反馈。我们通过利用学生-教师学习的发展和上下文感知对话处理的进步,以及设计具有Ohm的新型在线硬负挖掘方法的自监督框架来实现这一目标。我们证明了与传统训练相比,利用我们的新框架在现实世界对话系统中进行,相对误检率降低了大约10%,而在公共合成数据上,相对误检率降低了高达26%。
https://arxiv.org/abs/2409.10515
Dialogue summarization aims to provide a concise and coherent summary of conversations between multiple speakers. While recent advancements in language models have enhanced this process, summarizing dialogues accurately and faithfully remains challenging due to the need to understand speaker interactions and capture relevant information. Indeed, abstractive models used for dialog summarization may generate summaries that contain inconsistencies. We suggest using the semantic information proposed for performing Spoken Language Understanding (SLU) in human-machine dialogue systems for goal-oriented human-human dialogues to obtain a more semantically faithful summary regarding the task. This study introduces three key contributions: First, we propose an exploration of how incorporating task-related information can enhance the summarization process, leading to more semantically accurate summaries. Then, we introduce a new evaluation criterion based on task semantics. Finally, we propose a new dataset version with increased annotated data standardized for research on task-oriented dialogue summarization. The study evaluates these methods using the DECODA corpus, a collection of French spoken dialogues from a call center. Results show that integrating models with task-related information improves summary accuracy, even with varying word error rates.
对话总结旨在对多轮对话进行简洁、连贯的总结。尽管近年来自然语言处理(NLP)技术的进步增强了这一过程,但准确、忠实地总结对话仍然具有挑战性,因为需要理解说话者之间的互动并捕捉相关信息。事实上,用于对话总结的抽象模型可能会生成包含不一致性的摘要。我们建议使用SLU中提出的语义信息来获得关于目标导向的人际对话更精确的摘要。本研究引入了三个关键贡献:首先,我们探讨了如何通过包含任务相关信息来增强总结过程,从而产生更精确的摘要。其次,我们引入了一种基于任务语义的新评估标准。最后,我们提出了一个用于研究任务导向对话总结的新数据集版本,该版本标准化了增加的注释数据。研究通过对DECODA语料库的评估来评估这些方法。结果表明,将具有任务相关信息的模型集成可以提高摘要准确性,即使单词错误率不同。
https://arxiv.org/abs/2409.10070
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
估计模型对其输出的自信度对基于大型语言模型的会话AI系统(LLMs)至关重要,特别是为了减少幻觉和防止过度依赖。在这项工作中,我们进行了对包括为开式和闭式LLM提出的方法的全面探索,旨在量化并利用模型的不确定性来提高LLM生成的响应的可靠性,特别关注面向任务导向对话系统(TODS)中的对话状态跟踪(DST)。无论模型类型如何,精确的置信度分数对处理不确定性至关重要,从而提高模型性能。我们使用面积 under the curve(AUC)指标来评估校准,AUC 越高表示校准越好。我们还通过自检机制增强了这些方法。此外,我们使用针对DST任务的开放式LLM进行微调,实现了卓越的联合目标精度(JGA)。我们的研究结果还表明,微调开放式LLM可以提高AUC性能,表明置信度分数校准更好。
https://arxiv.org/abs/2409.09629
Students frequently make mistakes while solving mathematical problems, and traditional error correction methods are both time-consuming and labor-intensive. This paper introduces an innovative \textbf{V}irtual \textbf{A}I \textbf{T}eacher system designed to autonomously analyze and correct student \textbf{E}rrors (VATE). Leveraging advanced large language models (LLMs), the system uses student drafts as a primary source for error analysis, which enhances understanding of the student's learning process. It incorporates sophisticated prompt engineering and maintains an error pool to reduce computational overhead. The AI-driven system also features a real-time dialogue component for efficient student interaction. Our approach demonstrates significant advantages over traditional and machine learning-based error correction methods, including reduced educational costs, high scalability, and superior generalizability. The system has been deployed on the Squirrel AI learning platform for elementary mathematics education, where it achieves 78.3\% accuracy in error analysis and shows a marked improvement in student learning efficiency. Satisfaction surveys indicate a strong positive reception, highlighting the system's potential to transform educational practices.
学生经常会犯在做数学问题时的错误,而传统的错误纠正方法既费时又费力。本文介绍了一种创新的虚拟人工智能教师系统(VATE),旨在自主分析并纠正学生的错误(VATE)。通过利用先进的大型语言模型(LLMs),系统以学生草稿为原始资料进行错误分析,这有助于理解学生的学习过程。它还采用了精细的提示工程,并维护了一个错误池,以降低计算开销。人工智能驱动的系统还具备实时对话组件,用于高效的学生互动。我们的方法在传统和基于机器学习的错误纠正方法方面具有显著优势,包括降低教育成本、高可扩展性和更好的泛化能力。该系统已部署在Squirrel AI学习平台上,用于初中数学教育,在那里它实现了78.3%的错误分析准确率,学生的学习效率明显提高。满意度调查显示,该系统受到了一致好评,强调了系统可能彻底改变教育实践的潜力。
https://arxiv.org/abs/2409.09403
Scientific research indicates that for every hour spent in direct patient care, physicians spend nearly two additional hours on administrative tasks, particularly on electronic health records (EHRs) and desk work. This excessive administrative burden not only reduces the time available for patient care but also contributes to physician burnout and inefficiencies in healthcare delivery. To address these challenges, this study introduces MediGen, a fine-tuned large language model (LLM) designed to automate the generation of medical reports from medical dialogues. By leveraging state-of-the-art methodologies for fine-tuning open-source pretrained models, including LLaMA3-8B, MediGen achieves high accuracy in transcribing and summarizing clinical interactions. The fine-tuned LLaMA3-8B model demonstrated promising results, achieving a ROUGE score of 58% and a BERTScore-F1 of 72%, indicating its effectiveness in generating accurate and clinically relevant medical reports. These findings suggest that MediGen has the potential to significantly reduce the administrative workload on physicians, improving both healthcare efficiency and physician well-being.
研究表明,每花费一个小时直接为患者提供护理,医生会花费近两个小时进行行政任务,特别是与电子病历(EHRs)和办公桌工作相关的任务。这种过度行政负担不仅减少了可用于患者护理的时间,而且还会导致医生倦怠和医疗交付的效率低下。为了应对这些挑战,本研究引入了MediGen,一种经过微调的大型语言模型(LLM),旨在自动从医疗对话中生成医疗报告。通过利用最先进的微调方法来微调开源预训练模型,包括LLaMA3-8B,MediGen在转录和总结临床交互方面取得了高准确度。微调后的LLaMA3-8B模型表现出积极的效果,其ROUGE得分达到了58%,BERTScore-F1得分达到了72%,表明其生成准确且具有临床相关性的医疗报告的有效性。这些发现表明,MediGen有可能显著减少医生在行政工作上的负担,提高医疗效率和医生的幸福感。
https://arxiv.org/abs/2409.09324
Small and medium-sized agricultural holders face challenges like limited access to localized, timely information, impacting productivity and sustainability. Traditional extension services, which rely on in-person agents, struggle with scalability and timely delivery, especially in remote areas. We introduce Farmer.Chat, a generative AI-powered chatbot designed to address these issues. Leveraging Generative AI, Farmer.Chat offers personalized, reliable, and contextually relevant advice, overcoming limitations of previous chatbots in deterministic dialogue flows, language support, and unstructured data processing. Deployed in four countries, Farmer.Chat has engaged over 15,000 farmers and answered over 300,000 queries. This paper highlights how Farmer.Chat's innovative use of GenAI enhances agricultural service scalability and effectiveness. Our evaluation, combining quantitative analysis and qualitative insights, highlights Farmer.Chat's effectiveness in improving farming practices, enhancing trust, response quality, and user engagement.
中小农户面临诸如获取本地化、及时信息的限制,这影响了他们的生产力和可持续性。传统的农业扩展服务,依赖个人代理,在远程地区难以实现可扩展性和及时交付。我们介绍了一个名为Farmer.Chat的生成AI驱动的聊天机器人,用于解决这些问题。通过利用生成AI,Farmer.Chat提供个性化的、可靠的、上下文相关的建议,克服了以前聊天机器人在确定性对话流、语言支持和无结构数据处理方面的限制。部署在四个国家,Farmer.Chat已经与超过15,000名农民互动,回答了超过300,000个查询。本文重点探讨了Farmer.Chat如何通过其生成AI的使用增强了农业服务的学习能力和效果。我们的评估,结合了数量分析和定性见解,强调了Farmer.Chat在改善农业生产实践、增强可信度、回答质量和用户参与度方面的效果。
https://arxiv.org/abs/2409.08916
To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate "discusses COVID". To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g. subareas), and explains sophisticated concepts that classical methods (e.g. n-gram analysis) struggle to produce.
要理解大量数据,我们通常会使用简化模型,然后解释参数;例如,我们对文本嵌入进行聚类,然后解释每个聚类的均值参数。然而,这些参数通常是高维的,很难解释。为了使模型参数直接可解释,我们引入了一种统计模型家族——包括聚类、时间序列和分类模型的自然语言约束参数的模型。例如,关于COVID-19的文本聚类可以由命题“讨论COVID”进行参数化。要有效地学习这些统计模型,我们开发了一个模型无关的算法,通过梯度下降优化命题参数的连续松弛,并通过提示语言模型(LMs)将它们离散化。最后,我们将我们的框架应用于广泛的问题:对用户聊天对话进行分类,描述它们随时间演变的过程,找到一个语言模型比另一个更好的分类问题,基于子领域的聚类数学问题,并解释图像记忆中的视觉特征。我们的框架具有高度的灵活性,适用于文本和视觉领域,可以轻松地将关注点转向特定属性(例如子领域),并且能够解释经典方法(如n-gram分析)难以产生的复杂概念。
https://arxiv.org/abs/2409.08466
Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.
研究并构建对话任务的数据集既耗资又耗时,因为需要招募、培训和从研究参与者那里收集数据。为了应对这一问题,许多最近的工作试图使用大型语言模型(LLMs)来模拟人类间和人类-LLM之间的互动,因为它们已经被证明在许多环境中生成逼真的人类似文本。然而,LLM基于的模拟在多大程度上实际上反映了人类对话呢?在这项工作中,我们通过从WildChat数据集中生成100,000对LLM-LLM和人类-LLM对话,并衡量LLM模拟与人类对应的关系。总体而言,我们发现模拟与人类交互的同步较低,表明在多个文本属性上存在系统性的差异,包括风格和内容。此外,在比较英语、汉语和俄语对话时,我们发现模型表现相似。我们的结果表明,当人类以与LLM自己的风格更加相似的方式写作时,LLM通常表现得更好。
https://arxiv.org/abs/2409.08330
This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates for the integration of direct interaction feedback in conversational AI evaluation to enhance system development and user satisfaction.
本文探讨了在线和离线评估方法在评估会话聊天机器人时的效果,特别是比较了一手直接交互和第三方观察评估之间的差异。通过扩展用户对话数据集,我们比较了在线交互和更疏离的离线第三方评估之间的系统对比。我们的结果表明,离线人类评估未能像在线评估一样有效地捕捉到人类聊天机器人互动的微妙之处。相比之下,使用GPT-4模型的自动第三方评估提供了更好的第三方人类判断的近似,鉴于详细的指示。本研究突出了第三方评估在理解用户体验复杂性方面的局限性,并主张在会话人工智能评估中集成直接交互反馈以提高系统开发和用户满意度。
https://arxiv.org/abs/2409.07823
Dialogue topic segmentation plays a crucial role in various types of dialogue modeling tasks. The state-of-the-art unsupervised DTS methods learn topic-aware discourse representations from conversation data through adjacent discourse matching and pseudo segmentation to further mine useful clues in unlabeled conversational relations. However, in multi-round dialogs, discourses often have co-references or omissions, leading to the fact that direct use of these discourses for representation learning may negatively affect the semantic similarity computation in the neighboring discourse matching task. In order to fully utilize the useful cues in conversational relations, this study proposes a novel unsupervised dialog topic segmentation method that combines the Utterance Rewriting (UR) technique with an unsupervised learning algorithm to efficiently utilize the useful cues in unlabeled dialogs by rewriting the dialogs in order to recover the co-referents and omitted words. Compared with existing unsupervised models, the proposed Discourse Rewriting Topic Segmentation Model (UR-DTS) significantly improves the accuracy of topic segmentation. The main finding is that the performance on DialSeg711 improves by about 6% in terms of absolute error score and WD, achieving 11.42% in terms of absolute error score and 12.97% in terms of WD. on Doc2Dial the absolute error score and WD improves by about 3% and 2%, respectively, resulting in SOTA reaching 35.17% in terms of absolute error score and 38.49% in terms of WD. This shows that the model is very effective in capturing the nuances of conversational topics, as well as the usefulness and challenges of utilizing unlabeled conversations.
对话主题分割在各种对话建模任务中起着关键作用。最先进的无监督DTS方法通过相邻对话匹配和伪主题分割从对话数据中学习主题相关的会话表示,进一步挖掘未标记对话关系中的有用线索。然而,在多轮对话中,对话通常存在共同参考或省略,导致直接使用这些对话进行表示学习可能会对相邻对话匹配任务的语义相似性计算产生负面影响。为了充分利用对话关系中的有用线索,本研究提出了一个新颖的无监督对话主题分割方法,将Utterance Rewriting(UR)技术无监督学习算法相结合,通过重写对话来恢复共同参考和省略的字词,以有效地利用未标记对话中的有用线索。与现有无监督模型相比,所提出的UR-DTS模型在主题分割准确性方面显著改进。主要发现是,在DialSeg711上的性能提高了约6%,在WD方面提高了约11.42%,达到12.97%的绝对误差分数。在Doc2Dial上,绝对误差分数和WD分别提高了约3%和2%,达到SOTA的绝对误差分数为35.17%,WD为38.49%。这说明该模型在捕捉会话主题的细微差别以及利用未标记对话的有用性和挑战方面非常有效。
https://arxiv.org/abs/2409.07672
We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55\% more cases.
我们介绍了SimulBench,一个旨在评估大型语言模型(LLMs)在各种创新模拟场景中的基准, such as充当Linux终端或与用户玩文字游戏。尽管这些模拟任务作为LLM的一般智能的有效指标,但它们很少被纳入现有基准。一个主要的挑战是开发一个公正的评估框架来测试不同的LLM,同时保留用户和AI之间的多轮交互模拟任务的 Multi-round interactive nature。为解决这个问题,我们建议使用一个固定的LLM作为用户代理与LLM进行交互,首先在不同任务下收集对话。然后,为评估不同的目标LLM,提取挑战性的对话脚本。为了在DataName上自动评估,我们使用GPT-4作为评估者,负责审查目标LLM根据多轮对话脚本产生的最终回答的质量。我们的全面实验结果表明,这些模拟任务继续以独特的性质提出重大挑战,表明专有模型和最先进的开放LLM之间存在差距。例如,GPT-4-turbo在18.55%的案例中超过了LLaMA-3-70b-Chat。
https://arxiv.org/abs/2409.07641
In the context of Human-Robot Collaboration (HRC), it is crucial that the two actors are able to communicate with each other in a natural and efficient manner. The absence of a communication interface is often a cause of undesired slowdowns. On one hand, this is because unforeseen events may occur, leading to errors. On the other hand, due to the close contact between humans and robots, the speed must be reduced significantly to comply with safety standard ISO/TS 15066. In this paper, we propose a novel architecture that enables operators and robots to communicate efficiently, emulating human-to-human dialogue, while addressing safety concerns. This approach aims to establish a communication framework that not only facilitates collaboration but also reduces undesired speed reduction. Through the use of a predictive simulator, we can anticipate safety-related limitations, ensuring smoother workflows, minimizing risks, and optimizing efficiency. The overall architecture has been validated with a UR10e and compared with a state of the art technique. The results show a significant improvement in user experience, with a corresponding 23% reduction in execution times and a 50% decrease in robot downtime.
在人机协作(HRC)的背景下,两个角色之间自然且高效的交流至关重要。缺乏通信接口通常是导致不必要的减速的原因之一。一方面,这是因为意外事件可能发生,导致错误。另一方面,由于人类与机器人之间的紧密接触,必须显著降低速度以符合ISO/TS 15066的安全标准。在本文中,我们提出了一个新架构,使操作者和机器人能够以高效且自然的方式进行交流,同时解决安全问题。这种方法旨在建立一个不仅促进协作,而且减少不必要的减速的通信框架。通过使用预测模拟器,我们可以预见与安全相关的限制,确保更顺畅的工作流程,降低风险,优化效率。总体架构通过UR10e进行了验证,并与其最先进的 technique进行了比较。结果表明,用户体验得到了显著改善,相应的执行时间减少了23%,机器人停机时间也减少了50%。
https://arxiv.org/abs/2409.07158
In the rapidly evolving landscape of Human-Robot Collaboration (HRC), effective communication between humans and robots is crucial for complex task execution. Traditional request-response systems often lack naturalness and may hinder efficiency. This study emphasizes the importance of adopting human-like communication interactions to enable fluent vocal communication between human operators and robots simulating a collaborative human-robot industrial assembly. We propose a novel approach that employs human-like interactions through natural dialogue, enabling human operators to engage in vocal conversations with robots. Through a comparative experiment, we demonstrate the efficacy of our approach in enhancing task performance and collaboration efficiency. The robot's ability to engage in meaningful vocal conversations enables it to seek clarification, provide status updates, and ask for assistance when required, leading to improved coordination and a smoother workflow. The results indicate that the adoption of human-like conversational interactions positively influences the human-robot collaborative dynamic. Human operators find it easier to convey complex instructions and preferences, resulting in a more productive and satisfying collaboration experience.
在快速发展的机器人协同(HRC)领域,有效的人机交互对于复杂任务执行至关重要。传统的请求-响应系统通常缺乏自然性,可能会阻碍效率。本研究强调了采用类似于人类的人机交互对于实现流畅的人机协同工业组装至关重要。我们提出了通过自然对话实现人机交互的新颖方法,使人类操作员能够与模拟协同机器人进行有意义的人机对话。通过比较实验,我们证明了我们的方法可以提高任务绩效和协作效率。机器人进行有意义的口语交流使其能够寻求澄清、提供状态更新并需要帮助时提出请求,从而改善协调并使工作流程更加顺畅。结果表明,采用类似于人类的人机交互会积极影响人机协同动态。人类操作员发现,通过传达复杂指令和偏好,更容易进行有效的人机协同,从而获得更有成效和满足的协作体验。
https://arxiv.org/abs/2409.07145
We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.
我们提出了一个评估语言模型角色扮演能力的新颖基准。我们的方法利用了语言模型本身来模拟用户在动态、多轮对话中的行为,并评估由此产生的对话质量。框架由三个主要组件组成:一个假设特定角色扮演的玩家模型,一个模拟用户行为的交互器模型和一个评估对话质量的评委模型。我们进行了实验,比较了自动评估与人类标注之间的差异,以验证我们的方法,并表明在多个标准上存在强烈的相关性。这项工作为在交互式场景中评估模型能力提供了基础。
https://arxiv.org/abs/2409.06820