While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.
虽然变分自编码器在自然语言处理方面取得了显著的成功,但他们的注意力机制巨大的内存要求已经限制了他们处理更长上下文的能力。先前的方法,如循环记忆或检索增强,要么牺牲了注意力的随机访问灵活性(即整个上下文中选择任意 token 的能力)要么依赖于 separate 机制来获取相关上下文,这可能与模型的注意力不兼容。在本文中,我们提出了一种新的方法来访问整个上下文,同时保留随机访问灵活性,几乎像整个上下文运行注意力一样。我们的方法使用地标性 token 来代表输入的每个块,并训练注意力使用它来选择相关块,从而使块直接通过注意力机制进行检索,而不是通过 separate 机制。我们的方法无缝集成了 specialized 数据结构和系统的记忆层次结构,使可以处理任意长的上下文长度。我们证明了,我们的方法可以与 Transformer-XL 取得类似的性能,同时显著减少每个步骤检索 token 的数量。最后,我们展示了,通过与我们的方法 fine-tuning LLaMA 7B,成功地将上下文长度能力扩展到 32k tokens,使可以在 GPT-4 的上下文长度上进行推理。
https://arxiv.org/abs/2305.16300
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at this https URL.
我们介绍了Voyager,它是Minecraft中第一个使用LLM技术 embodied 的 lifelong learningagent,能够持续探索世界、获取多样化的技能,并且在没有人类干预的情况下发现新的事物。Voyager由三个关键组件组成:1) 最大化探索的自动课程,2) 不断增长的技能库,用于存储和检索复杂的行为,3) 新的迭代prompt机制,其中包括环境反馈、执行错误和自我验证,用于改进程序。Voyager通过黑盒查询与GPT-4交互,无需调整模型参数即可绕过了模型参数微调的需求。由Voyager开发的技能具有时间上的扩展、可解释性和组合性,这迅速增强了agent的能力,并减轻了灾难性遗忘。实证研究表明,Voyager表现出Context-based in-game lifelong learning能力,并在Minecraft游戏中表现出非凡的技能水平。它获得了3.3倍的更独特物品、旅行2.3倍的更长距离,并且比先前的SOTA更快地解锁关键科技树里程碑。Voyager能够利用已学习的技能库在一个新的Minecraft世界中解决全新的任务,而其他技术则难以通用。我们公开了我们完整的代码库和prompts,这些prompts位于这个httpsURL上。
https://arxiv.org/abs/2305.16291
Dialogue data in real scenarios tend to be sparsely available, rendering data-starved end-to-end dialogue systems trained inadequately. We discover that data utilization efficiency in low-resource scenarios can be enhanced by mining alignment information uncertain utterance and deterministic dialogue state. Therefore, we innovatively implement dual learning in task-oriented dialogues to exploit the correlation of heterogeneous data. In addition, the one-to-one duality is converted into a multijugate duality to reduce the influence of spurious correlations in dual training for generalization. Without introducing additional parameters, our method could be implemented in arbitrary networks. Extensive empirical analyses demonstrate that our proposed method improves the effectiveness of end-to-end task-oriented dialogue systems under multiple benchmarks and obtains state-of-the-art results in low-resource scenarios.
实际场景中的对话数据往往很少可用,导致数据匮乏的端到端对话系统训练不足。我们发现,在资源匮乏的情况下,可以通过挖掘不确定的言词和确定性对话状态的信息,提高数据利用效率。因此,我们创新性地在任务导向的对话中实施双重学习,利用不同数据之间的相关性。此外,将一对一的二元关系转换为多视角的二元关系,以减少在双重训练中伪相关性的影响。在没有引入额外的参数的情况下,我们的方法可以应用于任意网络。广泛的实证分析表明,我们提出的这种方法在多个基准条件下改进了端到端任务导向对话系统的效力,并在资源匮乏的情况下取得了最先进的结果。
https://arxiv.org/abs/2305.16106
Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.
构建能够感知各种现实世界模式并解决各种任务的通用模型是一个令人着迷的目标,在人工智能中也不例外。在本文中,我们介绍了ChatBridge,一个新型的多模式语言模型,利用语言的表达能力作为催化剂,以连接各种模式之间的差异。我们表明,只有与语言配对的两个模式数据才能足够地连接所有模式。ChatBridge利用最近的大型语言模型(LLM)并扩展了他们的零次操作能力,以包括各种多模式输入。ChatBridge经历了两阶段的培训。第一阶段将每个模式与语言对齐,带来模式的零次操作相关性和协作能力。第二阶段指令微调ChatBridge,使其与用户的意图对齐,我们的新提出的多模式指令调整数据集MultiS,涵盖了文本、图像、视频和音频16种多模式任务。我们展示了在涵盖文本、图像、视频和音频模式的零次操作多模式任务中的强烈量化和定性结果。ChatBridge的所有代码、数据和模型都将开源。
https://arxiv.org/abs/2305.16103
The potential of integrating Computer-Assisted Diagnosis (CAD) with Large Language Models (LLMs) in clinical applications, particularly in digital family doctor and clinic assistant roles, shows promise. However, existing works have limitations in terms of reliability, effectiveness, and their narrow applicability to specific image domains, which restricts their overall processing capabilities. Moreover, the mismatch in writing style between LLMs and radiologists undermines their practical utility. To address these challenges, we present ChatCAD+, an interactive CAD system that is universal, reliable, and capable of handling medical images from diverse domains. ChatCAD+ utilizes current information obtained from reputable medical websites to offer precise medical advice. Additionally, it incorporates a template retrieval system that emulates real-world diagnostic reporting, thereby improving its seamless integration into existing clinical workflows. The source code is available at \href{this https URL}{GitHub}. The online demo will be available soon.
在临床应用程序中,特别是在数字家庭医生和 Clinic 助手角色中,将计算机辅助诊断(CAD)与大型语言模型(LLM)相结合的潜力表明有 promise。然而,现有的作品在可靠性、有效性和特定图像领域的狭隘适用性方面存在限制,这限制了它们的整体处理能力。此外,LLM 和医学影像学的写作风格不匹配,削弱了它们的实际实用性。为了解决这些问题,我们提出了 ChatCAD+,这是一个交互式CAD系统,具有普遍、可靠和能够处理来自不同领域 medical 图像的能力。ChatCAD+ 利用知名医学网站获取的最新信息提供准确的医疗建议。此外,它还集成了一个模板检索系统,模拟真实的诊断报告,从而改进了将其无缝融入现有的临床工作流程。源代码可在 \href{this https URL}{GitHub} 获取。在线演示将很快发布。
https://arxiv.org/abs/2305.15964
Longitudinal Dialogues (LD) are the most challenging type of conversation for human-machine dialogue systems. LDs include the recollections of events, personal thoughts, and emotions specific to each individual in a sparse sequence of dialogue sessions. Dialogue systems designed for LDs should uniquely interact with the users over multiple sessions and long periods of time (e.g. weeks), and engage them in personal dialogues to elaborate on their feelings, thoughts, and real-life events. In this paper, we study the task of response generation in LDs. We evaluate whether general-purpose Pre-trained Language Models (PLM) are appropriate for this purpose. We fine-tune two PLMs, GePpeTto (GPT-2) and iT5, using a dataset of LDs. We experiment with different representations of the personal knowledge extracted from LDs for grounded response generation, including the graph representation of the mentioned events and participants. We evaluate the performance of the models via automatic metrics and the contribution of the knowledge via the Integrated Gradients technique. We categorize the natural language generation errors via human evaluations of contextualization, appropriateness and engagement of the user.
长期对话(LD)是人类-机器对话系统中最具挑战性的通话类型。LD包括个体在对话序列中的特定事件、个人想法和情绪的记忆。为LD设计的对话系统应该在多个对话 session 和长时间内(例如几周)uniquely 与用户交互,并让他们参与个人对话,以详细阐述他们的感受、想法和真实生活中的事件。在本文中,我们研究了LD中的响应生成任务。我们评估了通用预训练语言模型(PLM)是否适合这一目的。我们利用LD数据的集微调了两个PLM:GePpeTto (GPT-2) 和 iT5。我们使用不同的个人知识表示方法,包括从LD中提取的提及的事件和参与者的图形表示,进行了实验。我们通过自动指标和集成梯度技术评估了模型的性能,并利用人类评估了情境化、合适性和用户参与的程度,将自然语言生成错误进行分类。
https://arxiv.org/abs/2305.15908
Efficient utilisation of both intra- and extra-textual context remains one of the critical gaps between machine and human translation. Existing research has primarily focused on providing individual, well-defined types of context in translation, such as the surrounding text or discrete external variables like the speaker's gender. This work introduces MTCue, a novel neural machine translation (NMT) framework that interprets all context (including discrete variables) as text. MTCue learns an abstract representation of context, enabling transferability across different data settings and leveraging similar attributes in low-resource scenarios. With a focus on a dialogue domain with access to document and metadata context, we extensively evaluate MTCue in four language pairs in both translation directions. Our framework demonstrates significant improvements in translation quality over a parameter-matched non-contextual baseline, as measured by BLEU (+0.88) and Comet (+1.58). Moreover, MTCue significantly outperforms a "tagging" baseline at translating English text. Analysis reveals that the context encoder of MTCue learns a representation space that organises context based on specific attributes, such as formality, enabling effective zero-shot control. Pre-training on context embeddings also improves MTCue's few-shot performance compared to the "tagging" baseline. Finally, an ablation study conducted on model components and contextual variables further supports the robustness of MTCue for context-based NMT.
有效地利用内文本和外部文本上下文仍然是机器翻译和人类翻译之间的一个关键差距。现有的研究主要关注提供个人、明确类型的上下文,例如周围的文本或说话者性别等离散外部变量。这项工作介绍了MTCue,一个新的神经网络机器翻译框架,它将所有上下文(包括离散变量)解释为文本。MTCue学习了一种抽象上下文表示,使得在不同数据设置下的可移植性得以实现,并在低资源情况下利用类似的属性。重点关注有文档和元数据上下文访问的对话领域,我们广泛评估了MTCue在两种翻译方向的四对语言之间的性能。我们的框架通过BLEU(+0.88)和Comet(+1.58)的性能测量表现出了翻译质量的重大改进。此外,MTCue在翻译英语文本方面比“标签”基准框架表现得更好。分析表明,MTCue的上下文编码器学习了一个表示空间,以基于特定的属性(如正式性)组织上下文,从而实现有效的零次控制。预处理上下文嵌入的训练也提高了MTCue的少量次性能,与“标签”基准框架相比。最后,对模型组件和上下文变量进行的 ablation研究进一步支持了MTCue对基于上下文的机器翻译的鲁棒性。
https://arxiv.org/abs/2305.15904
We investigate the phenomenon of an LLM's untruthful response using a large set of 220 handcrafted linguistic features. We focus on GPT-3 models and find that the linguistic profiles of responses are similar across model sizes. That is, how varying-sized LLMs respond to given prompts stays similar on the linguistic properties level. We expand upon this finding by training support vector machines that rely only upon the stylistic components of model responses to classify the truthfulness of statements. Though the dataset size limits our current findings, we present promising evidence that truthfulness detection is possible without evaluating the content itself.
我们使用一组220个手工构建的语言学特征研究了LM(语言模型)的虚假响应现象。我们关注了GPT-3模型,并发现不同模型大小下的响应语言特征具有相似性。也就是说,对于给定提示的不同大小的LLM,其在语言学属性层面上的响应风格部分保持相似性。我们通过训练支持向量机,仅依靠模型响应的风格部分来分类陈述的诚实性。尽管数据集大小限制了当前发现的范围,但我们提供了令人振奋的证据,表明在不评估内容本身的情况下进行诚实性检测是可能的。
https://arxiv.org/abs/2305.15875
Large language models (large LMs) are susceptible to producing text with hallucinated content. Self-contradiction, where the LM generates two contradictory sentences within the same context, is an important form of hallucination. In this work, we present a comprehensive analysis on self-contradiction for state-of-the-art, instruction-tuned LMs, including evaluation, detection, and mitigation. To effectively trigger self-contradictions, we design a framework that constrains LMs to generate appropriate sentence pairs. Our evaluation on these sentence pairs reveals that self-contradictions occur frequently across different LMs for both famous and lesser-known topics. Next, we prompt the LMs to detect self-contradictions. Our results indicate that ChatGPT and GPT-4 are able to accurately identify self-contradictions, while Vicuna-13B struggles to do so. For example, with our best prompting method, ChatGPT achieves 91.0% precision and 80.5% recall on the sentence pairs generated by itself. To automatically mitigate self-contradictions, we develop an iterative algorithm that prompts the LMs to remove the detected self-contradictions from the generated text. Our algorithm successfully revises the text such that self-contradictions are significantly reduced, while maintaining its fluency and informativeness. Importantly, our entire pipeline of triggering, detecting, and mitigating self-contradictions is applicable to black-box LMs and does not require any external grounded knowledge.
大型语言模型(大型LM)容易生成具有幻觉 content 的文本。自相矛盾是指在相同的上下文中生成两个互相矛盾的语句,是一种重要的幻觉形式。在本研究中,我们对各种先进的指令调整大型LM的自相矛盾进行了全面分析,包括评估、检测和缓解。要有效地触发自相矛盾,我们设计了一个框架,限制LM生成适当的语句对。我们的评估表明,对话生成模型(GPT)和Vicuna-13B在这些语句对中经常检测到自相矛盾。接下来,我们促使LM检测自相矛盾。我们的结果显示,ChatGPT和GPT-4能够准确地识别自相矛盾,而Vicuna-13B则 struggles。例如,我们的最佳促使方法使ChatGPT在其自身生成的语句对中获得了91.0%的精度和80.5%的召回。为了自动缓解自相矛盾,我们开发了一种迭代算法,促使LM从生成的文本中删除检测到的自相矛盾。我们的算法成功地修改了文本,使自相矛盾得到显著减少,同时保持了其流畅性和信息性。重要的是,我们整个触发、检测和缓解自相矛盾的流程适用于黑盒大型LM,并不需要任何外部基础知识。
https://arxiv.org/abs/2305.15852
Large language models (LLMs) providing generative AI have become popular to support software engineers in creating, summarizing, optimizing, and documenting source code. It is still unknown how LLMs can support control engineers using typical control programming languages in programming tasks. Researchers have explored GitHub CoPilot or DeepMind AlphaCode for source code generation but did not yet tackle control logic programming. The contribution of this paper is an exploratory study, for which we created 100 LLM prompts in 10 representative categories to analyze control logic generation for of PLCs and DCS from natural language. We tested the prompts by generating answers with ChatGPT using the GPT-4 LLM. It generated syntactically correct IEC 61131-3 Structured Text code in many cases and demonstrated useful reasoning skills that could boost control engineer productivity. Our prompt collection is the basis for a more formal LLM benchmark to test and compare such models for control logic generation.
大型语言模型(LLMs)提供生成AI已经变得流行,以支持软件工程师在创造、总结、优化和记录源代码方面支持控制工程师。但仍然不清楚LLMs如何在编程任务中支持使用典型的控制编程语言控制工程师。研究人员已经探索了GitHub Copilot或DeepMind AlphaCode用于源代码生成,但尚未解决控制逻辑编程。本文的贡献是一个探索性研究,因此我们在10个代表性类别中创建100个LLMprompts,以分析PLC和DCS从自然语言生成控制逻辑。我们使用GPT-4LLM使用ChatGPT生成答案来测试prompts。在许多情况下,它生成语义正确的IEC 61131-3结构化文本代码,并展示了有用的推理技能,可以提高控制工程师 productivity。我们的prompt集是更正式的LLM基准的基座,以测试和比较这样的模型控制逻辑生成。
https://arxiv.org/abs/2305.15809
The paper speculates about how ChatGPT-like systems can support the field of automated service composition and identifies new research areas to explore in order to take advantage of such tools in the field of service-oriented composition.
论文探讨了像 ChatGPT 这样的系统如何支持自动服务编写领域,并确定了新的研究领域以利用这些工具在服务编写领域的应用。
https://arxiv.org/abs/2305.15788
Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs as a source of heuristic guidance for other agents (AI planners) in their planning tasks. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the heuristic mode show more promise. In the heuristic mode, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.
本文对LLM(自然语言处理模型)在一般Web corpora训练出来的推理能力表示了浓厚的兴趣,我们旨在研究其规划能力。我们的目标是评估(1)LLM在常识规划任务中自主生成计划的有效性,以及(2)LLM作为其他Agent(AI规划师)在规划任务中的启发式指导源的潜力。我们通过生成一系列与国际规划竞赛中使用的 domains 相似的实例,并进行系统研究,并对LLM在两个不同模式下进行评估:自主和启发。我们的研究结果表明,LLM自主生成可执行计划的能力相当有限,最优秀的模型(GPT-4)在所有 domains 上的平均成功率为 ~12%。然而,在启发模式下的结果表现出更多潜力。在启发模式下,我们证明,LLM生成的计划可以改进可靠的规划师搜索过程,此外,外部验证者可以帮助提供对生成的计划反馈,并帮助LLM更好地生成规划。
https://arxiv.org/abs/2305.15771
Recent years have seen increasing concerns about the private inference of NLP services and Transformer models. However, existing two-party privacy-preserving methods solely consider NLU scenarios, while the private inference of text generation such as translation, dialogue, and code completion remains unsolved. Besides, while migrated to NLG models, existing privacy-preserving methods perform poorly in terms of inference speed, and suffer from the convergence problem during the training stage. To address these issues, we propose MERGE, a fast private text generation framework for Transformer-based language models. Specifically, MERGE reuse the output hidden state as the word embedding to bypass the embedding computation, and reorganize the linear operations in the Transformer module to accelerate the forward procedure. Based on these two optimizations, extensive experiments show that MERGE can achieve a 26.5x speedup under the sequence length 512, and reduce 80\% communication bytes, with an up to 10x speedup to existing state-of-art models.
近年来,人们对自然语言处理服务和Transformer模型的私有推理越来越关注。然而,现有的两方隐私保护方法仅仅考虑了NLU场景,而对于生成文本如翻译、对话和代码补全的私有推理仍然无法解决。此外,在迁移到NLG模型时,现有的隐私保护方法在推理速度方面表现较差,并且在训练阶段会出现收敛问题。为了解决这些问题,我们提出了Merge,一个适用于Transformer基于语言模型的快速私有文本生成框架。具体来说,Merge将输出隐状态用作单词嵌入,绕过嵌入计算,并重新安排Transformer模块中的线性操作,以加速前进过程。基于这两个优化,广泛的实验结果表明,Merge可以在序列长度为512的情况下实现26.5倍速度提升,并减少80\%的通信字节,而现有最先进的模型速度提升可以达到10倍。
https://arxiv.org/abs/2305.15769
Recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. Some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. Though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. To address these issues, we propose an unsupervised pseudo-label sampling method, TEMP, that can automatically assign potential safe responses. Specifically, our TEMP method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. Extensive experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.
近年来,人们对大规模对话系统的不安全响应生成日益关注,这些系统将从现实世界的数据集学习具有攻击性或偏见的行为。有一些方法建议通过在管道中检测并替换不安全的训练示例来解决上述问题。虽然有效,但它们面临着高标注成本,并且对于未观察到的场景和对抗攻击的适应性较差。此外,忽略了提供安全响应(例如简单地替换为模板)将会导致对话信息的丢失问题。为了解决这些问题,我们提出了一种 unsupervised 的伪标签采样方法 TEMP,该方法可以自动分配可能的安全响应。具体而言,我们的 TEMP 方法将响应分为多个簇,并使用自适应的增强采样策略样本多个标签,灵感来自于观察簇中的不安全样本通常很少,分布在尾部。在闲聊和任务导向的对话实验中,广泛研究表明,我们的 TEMP 在弱监督信号下的表现力比先进的模型更强,并能够在无监督学习设置下获得类似的结果。
https://arxiv.org/abs/2305.15757
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model. In this work, we critically analyze this approach. We first finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), data sources, and imitation data amounts (0.3M--150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. Initially, we were surprised by the output quality of our imitation models -- they appear far better at following instructions, and crowd workers rate their outputs as competitive with ChatGPT. However, when conducting more targeted automatic evaluations, we find that imitation models close little to none of the gap from the base LM to ChatGPT on tasks that are not heavily supported in the imitation data. We show that these performance discrepancies may slip past human raters because imitation models are adept at mimicking ChatGPT's style but not its factuality. Overall, we conclude that model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs. In turn, we argue that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems.
一种新兴的方法以低成本改善较弱的语言模型是优化来自更强大模型的输出,例如像ChatGPT这样的专有系统(例如Alpaca、Self-Instruct、和其他人)的方法。这种方法旨在使用较弱的开源模型模仿专有模型的能力。在这项工作中,我们 critically 分析了这种方法。我们首先优化了一系列模仿ChatGPT的LMs,使用不同的基模型大小(1.5B--13B)、数据源和模仿数据量(0.3M--150M代币)。然后我们使用 crowd raters 和标准的NLP基准测试评估模型。一开始,我们对我们的模仿模型的输出质量感到惊讶——它们似乎更好地遵循指令, crowd 工作者评估它们的输出与ChatGPT相当。但是,在更有针对性的自动评估中,我们发现模仿模型几乎在基LM与ChatGPT的任务上没有填补差距,我们表明,这些表现差异可能躲过人类 raters 的原因是因为模仿模型擅长模仿ChatGPT的风格,但不擅长其事实。总的来说,我们得出结论,模型模仿是一种虚假的承诺:开源和闭源LM之间存在巨大的能力差距,目前的方法只能使用大量的模仿数据或使用更强大的基LMs 来填补差距。我们提出,改善开源模型的最高 leverage 行动是解决开发更好的基LMs 的艰难挑战,而不是使用模仿专有系统的捷径。
https://arxiv.org/abs/2305.15717
With the continuous development and change exhibited by large language model (LLM) technology, represented by generative pretrained transformers (GPTs), many classic scenarios in various fields have re-emerged with new opportunities. This paper takes ChatGPT as the modeling object, incorporates LLM technology into the typical book resource understanding and recommendation scenario for the first time, and puts it into practice. By building a ChatGPT-like book recommendation system (BookGPT) framework based on ChatGPT, this paper attempts to apply ChatGPT to recommendation modeling for three typical tasks, book rating recommendation, user rating recommendation, and book summary recommendation, and explores the feasibility of LLM technology in book recommendation scenarios. At the same time, based on different evaluation schemes for book recommendation tasks and the existing classic recommendation models, this paper discusses the advantages and disadvantages of the BookGPT in book recommendation scenarios and analyzes the opportunities and improvement directions for subsequent LLMs in these scenarios.
随着大型语言模型(LLM)技术的不断发展和变化,以生成预训练Transformers(GPTs)为代表的许多领域经典场景再次出现了新的机会。本文将ChatGPT作为建模对象,首次将LLM技术融入典型的书籍资源理解和推荐场景,并将其实际应用中。通过基于ChatGPT构建一个类似于ChatGPT的书籍推荐系统(BookGPT)框架,本文尝试将ChatGPT应用于三个典型的任务:书籍评价推荐、用户评价推荐和书籍摘要推荐,并探索LLM技术在书籍推荐场景中的可行性。同时,基于书籍推荐任务的不同评估方法和现有经典推荐模型,本文讨论了BookGPT在书籍推荐场景中的优点和缺点,并分析在这些场景中后续LLM的机会和改进方向。
https://arxiv.org/abs/2305.15673
Tasks involving text generation based on multiple input texts, such as multi-document summarization, long-form question answering and contemporary dialogue applications, challenge models for their ability to properly consolidate partly-overlapping multi-text information. However, these tasks entangle the consolidation phase with the often subjective and ill-defined content selection requirement, impeding proper assessment of models' consolidation capabilities. In this paper, we suggest revisiting the sentence union generation task as an effective well-defined testbed for assessing text consolidation capabilities, decoupling the consolidation challenge from subjective content selection. To support research on this task, we present refined annotation methodology and tools for crowdsourcing sentence union, create the largest union dataset to date and provide an analysis of its rich coverage of various consolidation aspects. We then propose a comprehensive evaluation protocol for union generation, including both human and automatic evaluation. Finally, as baselines, we evaluate state-of-the-art language models on the task, along with a detailed analysis of their capacity to address multi-text consolidation challenges and their limitations.
基于多个输入文本的任务,例如多文档摘要、长篇问题回答和当前对话应用程序,挑战模型使其能够正确合并部分重叠的多文本信息。然而,这些任务将巩固阶段与往往主观且定义不清的内容选择要求联系起来,妨碍正确评估模型的巩固能力。在本文中,我们建议重新考虑 sentence union generation task 作为评估文本巩固能力的有效且定义明确的测试平台,将巩固挑战与主观内容选择要求分离。为支持该任务的研究,我们提出了改进的标注方法和工具,用于 crowdsource sentence union,创建目前最大的合并数据集,并提供了对其丰富覆盖各种巩固方面的分析。然后,我们提出了合并生成 comprehensive 评估协议,包括人类和自动评估。最后,作为基准,我们评估了任务最先进的语言模型的能力,并详细分析了它们如何应对多文本巩固挑战及其限制。
https://arxiv.org/abs/2305.15605
Large language models (LLMs) are excellent in-context learners. However, the sensitivity of data contained in prompts raises privacy concerns. Our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt LLMs. To address this vulnerability, one could forego prompting and resort to fine-tuning LLMs with known algorithms for private gradient descent. However, this comes at the expense of the practicality and efficiency offered by prompting. Therefore, we propose to privately learn to prompt. We first show that soft prompts can be obtained privately through gradient descent on downstream data. However, this is not the case for discrete prompts. Thus, we orchestrate a noisy vote among an ensemble of LLMs presented with different prompts, i.e., a flock of stochastic parrots. The vote privately transfers the flock's knowledge into a single public prompt. We show that LLMs prompted with our private algorithms closely match the non-private baselines. For example, using GPT3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with ($\epsilon=0.147, \delta=10^{-6}$)-differential privacy vs. 95.2% for the non-private baseline. Through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial APIs.
大型语言模型(LLM)是优秀的上下文学习工具。然而,包含prompt的数据敏感性引起了隐私担忧。我们的工作首先表明这些担忧是有效的:我们实例化了一个简单但非常有效的成员推断攻击,针对用于引导LLM的数据。为了解决这个问题,你可以放弃prompt,转而使用已知的算法对LLM进行私人梯度下降微调。但是,这要以牺牲prompt提供的实际功能和效率为代价。因此,我们建议私人学习如何引导prompt。我们首先表明,softprompt可以通过私人梯度下降在后续数据上实现。但是,Discreteprompt不是这种情况。因此,我们指挥一个由不同prompt引导的LLM群,即一群随机鹦鹉,进行有噪声的投票。投票私下将群的知识转换为一个公共prompt。我们表明,使用我们的私人算法引导的LLM与非私人基准模型非常接近。例如,使用GPT3作为基模型,我们在sst2数据集上实现92.7%的后续准确率,并具有($\epsilon=0.147, \delta=10^{-6}$) differential隐私,而非私人基准模型的准确率为95.2%。通过我们的实验,我们还表明,我们的prompt-based方法可以轻松地与现有的商业API集成。
https://arxiv.org/abs/2305.15594
Translating natural language sentences to first-order logic (NL-FOL translation) is a longstanding challenge in the NLP and formal logic literature. This paper introduces LogicLLaMA, a LLaMA-7B model fine-tuned for NL-FOL translation using LoRA on a single GPU. LogicLLaMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LogicLLaMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel supervised fine-tuning (SFT) + reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then fine-tunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. To train LogicLLaMA, we present MALLS (large language $\textbf{M}$odel gener$\textbf{A}$ted N$\textbf{L}$-FO$\textbf{L}$ pair$\textbf{S}$), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at $\href{this https URL}{\small \text{this https URL}}$.
将自然语言句子转换为第一级逻辑(NL-FOL translation)是NLP和形式逻辑文献中一个长期存在的挑战。本文介绍了逻辑LLaMA,一个LLaMA-7B模型通过单个GPU使用LoRA微调了NL-FOL translation。逻辑LLaMA能够直接翻译自然语言到FOL规则,比GPT-3.5表现更好。逻辑LLaMA也具备纠正GPT-3.5预测的FOL规则的能力,并且以GPT-4的成本 fraction 之一实现了与GPT-4相似的性能。纠正能力是通过一种新的监督微调(SFT) + 强化学习与人类反馈(RLHF)框架实现的,该框架最初从GPT-4合成的略有偏差的NL-FOL对开始训练,然后使用RLHF在GPT-3.5的输出上进行微调,使用FOL验证器作为奖励模型。为了训练逻辑LLaMA,我们提供了MALLS(大型语言模型生成nel-FOL pairs数据集),从GPT-4收集了34K高质量的、多样化的句子级别的NL-FOL对。数据集是通过实现一个程序流,提示GPT-4对,并动态调整 prompts 以确保从不同的复杂性级别中收集具有丰富和多样化的上下文的对,并验证生成的FOL规则的的有效性。代码、权重和数据可在$this https URL$上获取。
https://arxiv.org/abs/2305.15541
Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of games as a test bed for LLMs.
开放世界生存游戏对AI算法构成了巨大的挑战,因为它们需要同时处理多个任务、深入探索和目标优先级要求。尽管强化学习(RL)在解决游戏方面非常流行,但其高样本复杂度在复杂开放世界游戏(如crafter或Minecraft)中限制了其效果。我们提出了一种新颖的方法Spring,以阅读游戏的原始学术 paper,并通过使用大型语言模型(LLM)学习知识来推理并玩这个游戏。根据 LaTeX 源作为游戏上下文,并描述当前观察的Agent,我们的Spring框架使用一个具有游戏相关问题作为节点和依赖关系作为边的生成图。通过遍历生成图并计算每个节点的LLM响应,我们可以确定在环境中采取最佳行动的最佳方法,该行动直接转化为环境行动。在我们的实验中,我们研究了在crafter开放世界环境中不同形式提示引起的上下文“推理”质量。我们的实验表明,当持续思考一致序列时,LLM具有完成复杂高级轨迹的巨大潜力。定量上,Spring与GPT-4在无训练的情况下击败了训练了1000万步的最先进的RL基准模型。最后,我们展示了游戏作为LLM测试床的潜力。
https://arxiv.org/abs/2305.15486