Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation framework using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios.
社交媒体平台如Twitter、Reddit和新浪微博在全球交流中扮演着关键角色,但通常会在敏感的地缘政治地区遇到严格的监管。这种情况促使用户巧妙地修改他们的交流方式,经常在受监管的社交媒体环境中使用暗语。这种交流方式的转变不仅是对抗监管的一种策略,更是语言进化的生动表现,展示了在社会和技术压力下语言的自然演变。研究在受监管的社交媒体环境中语言演变的重要性对于确保言论自由、优化内容监管和推动语言研究具有重大意义。本文提出了一种使用大型语言模型(LLMs)的多代理仿真框架,探讨受监管社交媒体环境中用户语言的演变。框架采用LLM驱动的代理:监督代理负责对话监督,参与者代理在参与对话的过程中发展他们的语言策略,模拟在严格监管下避免社交媒体管理策略的演变。研究通过从抽象场景到现实世界情况的各种场景对框架的有效性进行评估。关键发现表明,LLMs能够模拟受约束环境中的细微语言动态和互动,随着进度的提高,逃避监督和信息准确性的能力都有所改善。此外,发现LLM代理采用不同的策略来应对不同的场景。
https://arxiv.org/abs/2405.02858
This paper introduces Stochastic RAG--a novel approach for end-to-end optimization of retrieval-augmented generation (RAG) models that relaxes the simplifying assumptions of marginalization and document independence, made in most prior work. Stochastic RAG casts the retrieval process in RAG as a stochastic sampling without replacement process. Through this formulation, we employ straight-through Gumbel-top-k that provides a differentiable approximation for sampling without replacement and enables effective end-to-end optimization for RAG. We conduct extensive experiments on seven diverse datasets on a wide range of tasks, from open-domain question answering to fact verification to slot-filling for relation extraction and to dialogue systems. By applying this optimization method to a recent and effective RAG model, we advance state-of-the-art results on six out of seven datasets.
本文介绍了一种名为Stochastic RAG的新方法,用于端到端优化检索增强生成(RAG)模型,该方法放宽了大多数先前的工作中的简化假设,即边际化和文档独立性假设。Stochastic RAG将检索过程在RAG中建模为无替换的随机采样过程。通过这种表示方法,我们使用 straight-through Gumbel-top-k,它为无替换采样提供了一个不同的导数近似,并有效地对RAG进行了端到端的优化。我们在包括开放域问题回答、事实验证、关系提取和对话系统等七种多样化的数据集上进行了广泛的实验。通过将这种优化方法应用于最近且有效的RAG模型,我们在六个数据集上超过了现有的最佳结果。
https://arxiv.org/abs/2405.02816
Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agents (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSA to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations. The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.
近年来,随着大型语言模型(LLMs)的发展,会话信息寻求已经迅速发展,为用户提供了以自然方式理解和回应请求的基础。TREC Interactive Knowledge Assistance Track (iKAT)扩展收藏旨在使研究人员能够测试和评估他们的会话搜索代理(CSA)。该收藏包含20个不同主题的个性化对话,每个主题都附带一个个人文本知识库(PTKB),定义了特定的用户人格。该收藏提供了关于相关性的评估以及关于生成响应的四个关键方面的额外评估:相关性、完整性、 groundedness 和自然性。该收藏挑战CSA有效地浏览多样的人格背景,唤起相关人物信息,并利用相关对话的上下文。集成PTKB和强调决策搜索任务使该测试收藏独特,成为推动研究在会话和交互式知识助手领域的重要基准。
https://arxiv.org/abs/2405.02637
This paper aims to efficiently enable large language models (LLMs) to use external knowledge and goal guidance in conversational recommender system (CRS) tasks. Advanced LLMs (e.g., ChatGPT) are limited in domain-specific CRS tasks for 1) generating grounded responses with recommendation-oriented knowledge, or 2) proactively leading the conversations through different dialogue goals. In this work, we first analyze those limitations through a comprehensive evaluation, showing the necessity of external knowledge and goal guidance which contribute significantly to the recommendation accuracy and language quality. In light of this finding, we propose a novel ChatCRS framework to decompose the complex CRS task into several sub-tasks through the implementation of 1) a knowledge retrieval agent using a tool-augmented approach to reason over external Knowledge Bases and 2) a goal-planning agent for dialogue goal prediction. Experimental results on two multi-goal CRS datasets reveal that ChatCRS sets new state-of-the-art benchmarks, improving language quality of informativeness by 17% and proactivity by 27%, and achieving a tenfold enhancement in recommendation accuracy.
本文旨在有效地使大型语言模型(LLMs)能够使用外部知识和目标指导在会话推荐系统(CRS)任务中进行高效运用。先进的LLM(例如,ChatGPT)在领域特定CRS任务上存在限制,其一是生成具有推荐导向知识的有根回答,二是通过不同的对话目标主动引导对话。在这项工作中,我们通过全面的评估分析了这些限制,展示了外部知识和目标指导在提高推荐准确性和语言质量方面的重要性。鉴于这一发现,我们提出了一个新型的ChatCRS框架,通过实现知识检索代理和对话目标预测代理来分解复杂的CRS任务为几个子任务。在两个多目标CRSDataset上的实验结果表明,ChatCRS取得了最先进的基准,将信息提供性的语言质量提高了17%,活力提高了27%,并且推荐准确率提高了十倍。
https://arxiv.org/abs/2405.01868
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be continually updated: \url{this https URL}.
在人工智能快速发展的领域,如金融、医疗和法律等领域,大型语言模型(LLMs)如GPT-3和GPT-4正在改变这些领域的格局:这些领域以依赖专业知识、具有挑战性的数据收集、高风险和高监管合规而闻名。这项调查详细探讨了LLMs在这些高风险领域中的方法论、应用、挑战和未来展望。我们强调LLM在提高医疗保健中的诊断和治疗方法、创新金融分析和优化法律解释和合规策略中的关键作用。此外,我们对这些领域中LLM应用的伦理问题进行了批判性分析,指出存在的伦理担忧以及需要透明、公正和强大的AI系统来尊重监管规范。通过全面回顾现有文献和实际应用,我们展示了LLM的变革性影响,并强调了跨学科合作、方法和伦理警惕的重要性。通过这一视角,我们旨在激发对话,并激励未来研究,最大限度地利用LLM的优势,同时减轻其风险在这些精准依赖的领域。为了促进未来关于LLM在这些关键社会领域的研究,我们还启动了一个跟踪最新进展的阅读列表,该列表将不断更新:\url{this <https://url.org>}.
https://arxiv.org/abs/2405.01769
The design of dialogue flows is a critical but time-consuming task when developing task-oriented dialogue (TOD) systems. We propose an approach for the unsupervised discovery of flows from dialogue history, thus making the process applicable to any domain for which such an history is available. Briefly, utterances are represented in a vector space and clustered according to their semantic similarity. Clusters, which can be seen as dialogue states, are then used as the vertices of a transition graph for representing the flows visually. We present concrete examples of flows, discovered from MultiWOZ, a public TOD dataset. We further elaborate on their significance and relevance for the underlying conversations and introduce an automatic validation metric for their assessment. Experimental results demonstrate the potential of the proposed approach for extracting meaningful flows from task-oriented conversations.
对话流设计的有效性是一个关键但耗时费力的任务,尤其是在开发面向任务的对话系统(TOD)时。我们提出了一个无监督地从对话历史中发现流的方法,从而使该过程适用于任何可以获得这种历史的领域。简而言之,的话语用向量空间来表示,并根据其语义相似性进行聚类。然后,这些聚类被用作表示流 visually 的顶点,我们可以从MultiWOZ等公共TOD数据集中发现这些流。我们进一步详细介绍了它们在对话背后的意义和重要性,并引入了一个自动验证指标来评估它们的准确性。实验结果表明,所提出的方案具有从面向任务对话中提取有意义的流的潜力。
https://arxiv.org/abs/2405.01403
Reduced articulatory precision is common in speech, but for dialog its acoustic properties and pragmatic functions have been little studied. We here try to remedy this gap. This technical report contains content that was omitted from the journal article (Ward et al. 2024, submitted). Specifically, we here report 1) lessons learned about annotating for perceived reduction, 2) the finding that, unlike in read speech, the correlates of reduction in dialog include high pitch, wide pitch range, and intensity, and 3) a baseline model for predicting reduction in dialog, using simple acoustic/prosodic features, that achieves correlations with human perceptions of 0.24 for English, and 0.17 for Spanish. We also provide examples of additional possible pragmatic functions of reduction in English, and various discussion, observations and speculations
减少发声精确度在 speech 中很常见,但在对话中,对其音学和语用功能的研究还很少。在这里,我们试图弥补这个空白。本技术报告包含了 journal 文章中省略的内容(Ward 等人,2024 年提交)。具体来说,我们在这里报道了关于为感知减少的教训(1)以及对话中减少的关系因素(2),包括高音、宽音域和强度,并且用简单的声学/语调特征预测对话中减少的基准模型,其与人类对英语的感知的相关系数为 0.24,对西班牙语的感知的相关系数为 0.17。我们还提供了英语中可能具有额外语用功能的减少的例子,以及各种讨论、观察和推测。
https://arxiv.org/abs/2405.01376
Understanding user enjoyment is crucial in human-robot interaction (HRI), as it can impact interaction quality and influence user acceptance and long-term engagement with robots, particularly in the context of conversations with social robots. However, current assessment methods rely solely on self-reported questionnaires, failing to capture interaction dynamics. This work introduces the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), a novel scale for assessing user enjoyment from an external perspective during conversations with a robot. Developed through rigorous evaluations and discussions of three annotators with relevant expertise, the scale provides a structured framework for assessing enjoyment in each conversation exchange (turn) alongside overall interaction levels. It aims to complement self-reported enjoyment from users and holds the potential for autonomously identifying user enjoyment in real-time HRI. The scale was validated on 25 older adults' open-domain dialogue with a companion robot that was powered by a large language model for conversations, corresponding to 174 minutes of data, showing moderate to good alignment. Additionally, the study offers insights into understanding the nuances and challenges of assessing user enjoyment in robot interactions, and provides guidelines on applying the scale to other domains.
理解用户的喜爱在人机交互(HRI)中至关重要,因为它可能会影响交互质量和影响用户对机器的接受程度以及与机器的长期参与,特别是在与社交机器人的对话中。然而,目前的评估方法仅依赖自我报告问卷,无法捕捉交互动态。这项工作介绍了一个名为人机交互聊天机器人用户喜爱量表(HRI CUES)的新量表,用于从外部角度评估用户在机器人对话中的喜爱。通过与具有相关专业知识的三位注释者的深入讨论和严格的评估,该量表构建了一个结构化的框架,用于评估每个对话交流(回合)的喜爱程度以及整个交互水平。该量表旨在补充来自用户的自我报告喜爱,并具有在实时HRI中自动识别用户喜好的潜力。 该量表在25名年龄较大的成年人与一台由大型语言模型驱动的伴侣机器人进行开放领域的对话上进行了验证,对话持续了174分钟,显示出中等至良好的相关性。此外,这项研究揭示了评估用户喜爱在机器人交互中的细微问题和挑战,并为其他领域提供了应用该量表的指导。
https://arxiv.org/abs/2405.01354
Active participation in a conversation is key to building common ground, since understanding is jointly tailored by producers and recipients. Overhearers are deprived of the privilege of performing grounding acts and can only conjecture about intended meanings. Still, data generation and annotation, modelling, training and evaluation of NLP dialogue models place reliance on the overhearing paradigm. How much of the underlying grounding processes are thereby forfeited? As we show, there is evidence pointing to the impossibility of properly modelling human meta-communicative acts with data-driven learning models. In this paper, we discuss this issue and provide a preliminary analysis on the variability of human decisions for requesting clarification. Most importantly, we wish to bring this topic back to the community's table, encouraging discussion on the consequences of having models designed to only "listen in".
对话中的积极参与是建立共同基础的关键,因为理解是由生产者和接收者共同调整的。听者被剥夺了执行补足行为的特权,只能猜测意图。尽管如此,数据生成、注释、建模、训练和评估自然语言对话模型的过程仍然依赖于听者范式。因此,有多少 underlying 补足过程被放弃了呢?正如我们所看到的,有证据表明,用数据驱动的学习模型正确建模人类元交际行为是不可能的。在本文中,我们讨论了这个问题,并对人类请求澄清的决策的变异性进行了初步分析。最重要的是,我们希望将这个话题带回社区的讨论中,鼓励大家讨论设计仅能“听”的模型的后果。
https://arxiv.org/abs/2405.01139
Existing methods for creating source-grounded information-seeking dialog datasets are often costly and hard to implement due to their sole reliance on human annotators. We propose combining large language models (LLMs) prompting with human expertise for more efficient and reliable data generation. Instead of the labor-intensive Wizard-of-Oz (WOZ) method, where two annotators generate a dialog from scratch, role-playing agent and user, we use LLM generation to simulate the two roles. Annotators then verify the output and augment it with attribution data. We demonstrate our method by constructing MISeD -- Meeting Information Seeking Dialogs dataset -- the first information-seeking dialog dataset focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance on our test set, as well as on a novel fully-manual WOZ test set and an existing query-based summarization benchmark, suggesting the utility of our approach.
目前创建源-地面信息检索对话数据集的方法通常代价高昂且难以实现,因为它们仅依赖于人类标注者。我们提出了一种结合大型语言模型(LLMs)提示与人类专业知识的更高效和可靠的数据生成方法。我们摒弃了劳动密集型的Wizard of Oz(WOZ)方法,即两个标注者从零开始生成对话,角色扮演代理和用户,而是利用LLM生成来模拟这两个角色。然后,标注者验证输出并对其进行归因数据增强。我们通过构建MISeD--会议信息检索对话数据集,这是第一个关注会议转录的信息检索对话数据集,利用LLM进行训练的模型在测试集和我们的新完全手动WOZ测试集以及现有的基于查询的摘要基准上的表现都超过了现有水平,表明了我们的方法的实用性。
https://arxiv.org/abs/2405.01121
Designing preference elicitation (PE) methodologies that can quickly ascertain a user's top item preferences in a cold-start setting is a key challenge for building effective and personalized conversational recommendation (ConvRec) systems. While large language models (LLMs) constitute a novel technology that enables fully natural language (NL) PE dialogues, we hypothesize that monolithic LLM NL-PE approaches lack the multi-turn, decision-theoretic reasoning required to effectively balance the NL exploration and exploitation of user preferences towards an arbitrary item set. In contrast, traditional Bayesian optimization PE methods define theoretically optimal PE strategies, but fail to use NL item descriptions or generate NL queries, unrealistically assuming users can express preferences with direct item ratings and comparisons. To overcome the limitations of both approaches, we formulate NL-PE in a Bayesian Optimization (BO) framework that seeks to generate NL queries which actively elicit natural language feedback to reduce uncertainty over item utilities to identify the best recommendation. We demonstrate our framework in a novel NL-PE algorithm, PEBOL, which uses Natural Language Inference (NLI) between user preference utterances and NL item descriptions to maintain preference beliefs and BO strategies such as Thompson Sampling (TS) and Upper Confidence Bound (UCB) to guide LLM query generation. We numerically evaluate our methods in controlled experiments, finding that PEBOL achieves up to 131% improvement in MAP@10 after 10 turns of cold start NL-PE dialogue compared to monolithic GPT-3.5, despite relying on a much smaller 400M parameter NLI model for preference inference.
设计能够在冷启动设置中快速确定用户top物品偏好的偏好诱发(PE)方法是一个构建有效且个性化的会话推荐(ConvRec)系统的重要挑战。虽然大型语言模型(LLMs)是一种新兴技术,允许实现完全自然语言(NL)PE对话,但我们假设单体LLM NL-PE方法缺乏多轮、决策理论推理,以有效地平衡NL探索和利用用户偏好的对任意物品集的平衡。相比之下,传统贝叶斯优化PE方法定义了理论最优的PE策略,但无法使用NL项目描述或生成NL查询,错误地假设用户可以通过直接项目评分和比较表达偏好。为了克服两者的局限,我们在贝叶斯优化(BO)框架中设计了一种NL-PE方法,旨在生成NL查询,积极引导自然语言反馈以降低对物品效用的不确定性,以确定最佳推荐。我们在新颖的NL-PE算法PEBOL中展示了我们的框架,该算法使用自然语言推理(NLI)在用户偏好陈述和NL项目描述之间保持偏好信念和BO策略,如Thompson采样(TS)和Upper Confidence Bound(UCB),以引导LLM查询生成。我们在控制实验中数值评估了我们的方法,发现PEBOL在10轮冷启动NL-PE对话后,MAP@10提高了131%,尽管在偏好推理中依赖了参数大小为400M的模型。
https://arxiv.org/abs/2405.00981
In this study, we introduce Generative Manufacturing Systems (GMS) as a novel approach to effectively manage and coordinate autonomous manufacturing assets, thereby enhancing their responsiveness and flexibility to address a wide array of production objectives and human preferences. Deviating from traditional explicit modeling, GMS employs generative AI, including diffusion models and ChatGPT, for implicit learning from envisioned futures, marking a shift from a model-optimum to a training-sampling decision-making. Through the integration of generative AI, GMS enables complex decision-making through interactive dialogue with humans, allowing manufacturing assets to generate multiple high-quality global decisions that can be iteratively refined based on human feedback. Empirical findings showcase GMS's substantial improvement in system resilience and responsiveness to uncertainties, with decision times reduced from seconds to milliseconds. The study underscores the inherent creativity and diversity in the generated solutions, facilitating human-centric decision-making through seamless and continuous human-machine interactions.
在这项研究中,我们提出了生成制造系统(GMS)作为一种新颖的方法来有效地管理和协调自主制造资产,从而提高其对生产各种目标及人类偏好的响应能力和灵活性。与传统显式建模不同,GMS采用生成式AI,包括扩散模型和ChatGPT,进行从预见到的未来进行隐式学习,标志着从模型最优到训练抽样的决策转变。通过集成生成式AI,GMS使通过与人类交互进行复杂决策成为可能,允许制造资产生成多个高质量的全球决策,并可以根据人类反馈进行迭代改进。实证研究展示了GMS在系统弹性和对不确定性的改进,决策时间从秒减少到毫秒。该研究突出了生成式解决方案固有的创造力和多样性,通过无缝连续的人机交互促进人本决策。
https://arxiv.org/abs/2405.00958
Large language models (LLMs) that are proved to be very powerful on different NLP tasks. However, there are still many ways to attack the model with very low costs. How to defend the model becomes an important problem. In our work, we treat adversarial attack results as a new (unseen) domain of the model, and we frame the defending problem into how to improve the robustness of the model on the new domain. We focus on the task of conversation entailment, where multi-turn natural language dialogues are the premise, and the transformer model is fine-tuned to predict whether a given hypothesis about the given dialogue is true or false. The adversary would attack the hypothesis to fool the model to make the wrong predictions. We apply synonym-swapping as the attack method. To show the robustness of the model, we implement some fine-tuning strategies and propose the embedding perturbation loss as a method to improve the robustness of the model. Finally, we show the importance of our work by discussing the adversarial attacks in NLP in the real world.
大型语言模型(LLMs)已经在许多自然语言处理任务上证明非常强大。然而,仍然有许多方法可以以非常低的成本攻击模型。如何防御模型成为重要问题。在我们的工作中,我们将对抗性攻击结果视为一个新的(未见过的)领域,并将防御问题转化为在为新领域提高模型稳健性的问题。我们关注对话含义任务,其中多轮自然语言对话是前提,对模型进行微调以预测给定对话中假设是否为真或假。攻击者会攻击假设以欺骗模型做出错误的预测。我们采用同义词替换作为攻击方法。为了展示模型的稳健性,我们实现了一些微调策略,并提出了嵌入漂移损失作为提高模型稳健性的方法。最后,我们通过讨论在现实生活中NLP中的对抗性攻击重要性,证明了我们的工作的重要性。
https://arxiv.org/abs/2405.00289
Large Language Models (LLMs) have catalyzed significant advancements in Natural Language Processing (NLP), yet they encounter challenges such as hallucination and the need for domain-specific knowledge. To mitigate these, recent methodologies have integrated information retrieved from external resources with LLMs, substantially enhancing their performance across NLP tasks. This survey paper addresses the absence of a comprehensive overview on Retrieval-Augmented Language Models (RALMs), both Retrieval-Augmented Generation (RAG) and Retrieval-Augmented Understanding (RAU), providing an in-depth examination of their paradigm, evolution, taxonomy, and applications. The paper discusses the essential components of RALMs, including Retrievers, Language Models, and Augmentations, and how their interactions lead to diverse model structures and applications. RALMs demonstrate utility in a spectrum of tasks, from translation and dialogue systems to knowledge-intensive applications. The survey includes several evaluation methods of RALMs, emphasizing the importance of robustness, accuracy, and relevance in their assessment. It also acknowledges the limitations of RALMs, particularly in retrieval quality and computational efficiency, offering directions for future research. In conclusion, this survey aims to offer a structured insight into RALMs, their potential, and the avenues for their future development in NLP. The paper is supplemented with a Github Repository containing the surveyed works and resources for further study: this https URL.
大规模语言模型(LLMs)在自然语言处理(NLP)领域催生了许多显著的进步,但它们仍然面临诸如幻觉和需要领域特定知识等挑战。为了缓解这些挑战,最近的方法将外部资源中检索到的信息与LLM相结合,极大地提高了它们在NLP任务上的表现。 这份调查论文讨论了关于检索增强语言模型(RALMs)的全面概述的缺失,包括检索增强生成(RAG)和检索增强理解(RAU),深入研究了它们的范式、演变、分类和应用。论文讨论了RALMs的重要组成部分,包括检索器、语言模型和增强器,以及它们之间的互动导致的不同模型结构和应用。RALMs在翻译和对话系统以及知识密集型应用等领域表现出实际价值。 调查包括对RALMs的几个评估方法的讨论,强调了它们的稳健性、准确性和相关性在评估中的重要性。它还承认了RALMs的局限性,特别是检索质量和计算效率方面的限制,为未来的研究提供了方向。总之,这份调查试图提供一个结构化的了解RALMs、它们的潜力和在NLP领域未来发展的途径。论文还附带了一个Github存储库,其中包含调查的工作和资源:https://github.com/。
https://arxiv.org/abs/2404.19543
Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom $\textit{My Own Swordsman}$. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs' performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at this https URL.
理解一个会话中的非字面意义对于大型语言模型(LLMs)成为具有人类水平的社交交际者至关重要。在这项工作中,我们引入了SwordsmanImp,第一个针对中国情景喜剧《我自己的刀剑侠》的中文多轮对话数据集,旨在实现会话含义。它包括200个精心制作的提问,所有这些都附有 Gricean maxims 的违约情况。我们对八个闭源和开源的LLM进行了两种任务测试:多选题问题和会话含义解释任务。我们的结果表明,GPT-4在多选题上的准确率达到了人类水平(94%)。CausalLM在GPT-4之后的准确率达到了78.5%。其他模型,包括GPT-3.5和几个开源模型,在多选题上的准确率较低,从20%到60%不等。人类评估者被要求根据LLM对会话含义的生成进行推理、逻辑和流畅性评分。虽然所有模型产生的文本都相当流畅且自相矛盾,但它们的解释得分在推理方面都很低,除了GPT-4,这表明大多数LLM无法产生满意的会话含义。此外,我们发现LLM的性能与Gricean maxims没有显著差异,这表明LLM似乎没有以不同的最大原则处理会话含义。我们的数据和代码可以从该链接的https:// URL中获取。
https://arxiv.org/abs/2404.19509
Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training.
最近,关于指令式代理的研究使用了记忆增强的大型语言模型(LLMs)作为任务规划者,这是一种通过检索与输入指令相关的语言程序实例并使用它们作为LLM提示中的上下文示例来提高LLM在推断正确动作和任务计划中的性能的技术。在本文的技术报告中,我们通过扩展HELPER的功能,通过增加更广泛的示例和提示,以及添加询问API,来扩展其能力。这种简单的扩展使得HELPER可以应用于执行计划领域,包括对话、自然语言指令跟随、主动问题询问和共同空间组织。我们在四个具有多样性的交互式视觉语言 embodied 代理基准上评估了代理:ALFRED、TEACh、DialFRED 和 Tidy Task。HELPER-X在这些基准上使用单个代理实现了卓越的少样本、状态最先进的性能,而无需进行领域内训练,且与经过领域内训练的代理竞争。
https://arxiv.org/abs/2404.19065
This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: this https URL.
本次调查对多模态大型语言模型(MMLMs)的现象进行了全面分析,这些模型也被称为大型视觉语言模型(LVLMs),在多模态任务中取得了显著的进步和非凡的能力。尽管有这些鼓舞人心的发展和显著的进步,但MMLMs通常生成的输出与视觉内容不一致,这是一种称为幻觉的挑战,这对其实际部署造成了实质性的障碍,并对其可靠性在现实世界应用中提出了担忧。这个问题吸引了越来越多的关注,促使人们努力检测和缓解这种不准确。我们回顾了最近在识别、评估和缓解这种幻觉方面的最新进展,提供了对这个问题背后的原因、评估基准、指标和策略的详细概述。此外,我们分析了当前的挑战和局限性,提出了开放性问题,勾勒出未来研究的潜在路径。通过深入研究幻觉的原因、评估基准和缓解方法,本次调查旨在加深人们对MMLMs幻觉的理解,并为该领域的进一步发展提供有益的见解和资源。资源可在此链接中获取:https://this.url
https://arxiv.org/abs/2404.18930
Recent research in dialogue systems and corpora has focused on two main categories: task-oriented (TOD) and open-domain (chit-chat) dialogues. TOD systems help users accomplish specific tasks, while open-domain systems aim to create engaging conversations. However, in real-world scenarios, user intents are often revealed during interactions. A recent study introduced SalesBot, which simulates dialogues transitioning from chit-chat to task-oriented scenarios to train sales agents. Unfortunately, the initial data lacked smooth transitions and coherent long-turn dialogues, resulting in poor naturalness in sales-customer interactions. To address these issues, this paper presents SalesBot 2.0, an improved dataset. It leverages commonsense knowledge from large language models (LLMs) through strategic prompting. Additionally, we introduce a novel model called SalesAgent, trained on salesperson's interactions, using chain-of-thought (CoT) reasoning. This model excels in transitioning topics, understanding user intents, and selecting appropriate strategies. Experiments using diverse user simulations validate the effectiveness of our method in controlling dialogue strategies in LLMs. Furthermore, SalesBot 2.0 enhances coherence and reduces aggression, facilitating better model learning for sales-customer interactions.
近年来,在对话系统和语料库研究中,主要集中于两个主要类别:任务导向对话(TOD)和开放域对话(CHIT-CHAT)。TOD系统帮助用户完成特定任务,而开放域系统旨在创建有趣的对话。然而,在现实世界的场景中,用户意图通常在互动过程中揭示。一项最近的研究引入了SalesBot,它通过模拟从CHIT-CHAT到任务导向场景的对话过渡来训练销售代理。然而,最初的数据显示,数据缺乏平滑的过渡和连贯的长轮对话,导致销售-客户互动的自然度较差。为了解决这些问题,本文提出了SalesBot 2.0,一个改进的数据集。它通过大规模语言模型(LLMs)的常识知识来进行策略提示。此外,我们引入了一种名为SalesAgent的新模型,使用链式思维(CoT)进行训练。这种模型在转移主题、理解用户意图和选择适当的策略方面表现出色。使用多样化的用户模拟实验证实了我们在LLMs上控制对话策略的有效性。此外,SalesBot 2.0还提高了连贯性并减少了攻击性,有助于提高模型学习销售-客户互动的效果。
https://arxiv.org/abs/2404.18564
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
在医疗应用中卓越表现对AI提出了相当大的挑战,需要先进的推理能力、最新的医学知识和理解复杂多模态数据的能力。Gemini模型,在多模态和长语境推理方面具有强大的通用能力,在医学领域具有令人兴奋的可能性。在Gemini模型的核心优势的基础上,我们引入了Med-Gemini系列高度 capable的多模态模型,具有使用网络搜索平滑地使用医疗多模态数据的能力,并且可以采用自定义编码器将其定制为新颖的模态。我们在14个医疗基准上评估了Med-Gemini,其中10个基准建立了与GPT-4模型家族匹敌的新标杆性能,并在每个可进行直接比较的基准上超过了GPT-4。在热门的MedQA(USMLE)基准中,我们表现最佳的Med-Gemini模型实现了SoTA性能的91.1%,采用了一种新颖的不确定性指导搜索策略。在包括NEJM图像挑战和MMMU(健康与医学)在内的7个多模态基准上,Med-Gemini比GPT-4V提高了平均相对分数44.5%。我们通过在长匿名健康记录和医疗视频问答中的 needle-in-a-haystack 检索任务等长语境推理任务上的SoTA表现,展示了Med-Gemini长语境能力的效果。最后,Med-Gemini的表现表明,在诸如医疗文本摘要、多模态医疗对话、医学研究和教育等领域具有实际应用价值。尽管在实际部署前还需要进行进一步的严谨评估,但我们的结果确实为Med-Gemini的潜力提供了有力的证据。
https://arxiv.org/abs/2404.18416
With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. However, AI-driven products that leverage language models usually necessitate a fusion of these abilities to function effectively in real-world scenarios. Moreover, the considerable computational resources required for proper alignment of LLMs underscore the need for a more robust, efficient, and encompassing approach to multi-task alignment, ensuring improved generative performance. In response to these challenges, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction concatenation combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.
随着大型语言模型(LLMs)的普及,跨任务全面对齐这些模型成为了一个关键的研究领域。现有的对齐方法主要关注单一任务,例如多轮对话、编码、数学问题解决和工具使用。然而,利用语言模型的AI驱动产品通常需要将这些能力进行融合,以在现实场景中有效运行。此外,为正确对齐LLMs所需要的大量计算资源,也凸显了需要更健壮、高效和包容的对多任务对齐方法的需求,以提高生成性能。为了应对这些挑战,我们引入了一种名为混合指令(MoI)的新技术,它结合了指令串联和多样系统提示来提高语言模型的对齐效率。我们还收集了七个具有不同应用领域的基准数据集,以严格评估MoI增强语言模型的对齐效果。我们的方法应用于开源的Qwen-7B聊天模型,最终开发出了Qwen-SFT-MoI增强模型。这种增强模型在编码、数学和工具使用任务上显著提高了生成能力。
https://arxiv.org/abs/2404.18410