Large Language Models (LLMs) still struggle with complex reasoning tasks. Motivated by the society of minds (Minsky, 1988), we propose ReConcile, a multi-model multi-agent framework designed as a round table conference among diverse LLM agents to foster diverse thoughts and discussion for improved consensus. ReConcile enhances the reasoning capabilities of LLMs by holding multiple rounds of discussion, learning to convince other agents to improve their answers, and employing a confidence-weighted voting mechanism. In each round, ReConcile initiates discussion between agents via a 'discussion prompt' that consists of (a) grouped answers and explanations generated by each agent in the previous round, (b) their uncertainties, and (c) demonstrations of answer-rectifying human explanations, used for convincing other agents. This discussion prompt enables each agent to revise their responses in light of insights from other agents. Once a consensus is reached and the discussion ends, ReConcile determines the final answer by leveraging the confidence of each agent in a weighted voting scheme. We implement ReConcile with ChatGPT, Bard, and Claude2 as the three agents. Our experimental results on various benchmarks demonstrate that ReConcile significantly enhances the reasoning performance of the agents (both individually and as a team), surpassing prior single-agent and multi-agent baselines by 7.7% and also outperforming GPT-4 on some of these datasets. We also experiment with GPT-4 itself as one of the agents in ReConcile and demonstrate that its initial performance also improves by absolute 10.0% through discussion and feedback from other agents. Finally, we also analyze the accuracy after every round and observe that ReConcile achieves better and faster consensus between agents, compared to a multi-agent debate baseline. Our code is available at: this https URL
大型语言模型(LLM)仍然面临复杂的推理任务。受到思维社会的启发(米斯基,1988年),我们提出了Reconcile,它是一个多模型多Agent框架,设计为在一个多样化的LLM代理之间的圆桌会议中促进多样化的思考和讨论,以改善共识。Reconcile通过多次讨论增强LLM的推理能力,学习说服其他代理改善他们的答案,并使用信心加权投票机制。在每个回合中,Reconcile通过一个“讨论prompt”启动代理之间的讨论,其中(a)包括每个代理在上一个回合中生成的分组答案和解释,(b)是他们的不确定性,(c)是人类解释的演示,用于说服其他代理。这个讨论prompt使每个代理根据其他代理的见解更新他们的答案。一旦共识达成并讨论结束,Reconcile通过利用每个代理的信心加权投票机制确定最终答案。我们使用ChatGPT、 Bard和Claude2作为三个代理,我们的各种基准实验结果表明,Reconcile极大地增强了代理的推理表现(个体和团队),超过先前的单代理和多代理基准7.7%,并且在这些数据集上比GPT-4表现更好。我们也尝试以GPT-4作为Reconcile中的代理之一进行实验,并证明其初始表现也改善了absolute 10.0%。最后,我们还分析每个回合的精度,并观察到Reconcile通过代理之间的讨论实现更好的和更快的共识,相比多代理辩论基准。我们的代码可在以下httpsURL上获取:
https://arxiv.org/abs/2309.13007
Assurance cases can be used to argue for the safety of products in safety engineering. In safety-critical areas, the construction of assurance cases is indispensable. Trustworthiness Derivation Trees (TDTs) enhance assurance cases by incorporating formal methods, rendering it possible for automatic reasoning about assurance cases. We present Trustworthiness Derivation Tree Analyzer (Trusta), a desktop application designed to automatically construct and verify TDTs. The tool has a built-in Prolog interpreter in its backend, and is supported by the constraint solvers Z3 and MONA. Therefore, it can solve constraints about logical formulas involving arithmetic, sets, Horn clauses etc. Trusta also utilizes large language models to make the creation and evaluation of assurance cases more convenient. It allows for interactive human examination and modification. We evaluated top language models like ChatGPT-3.5, ChatGPT-4, and PaLM 2 for generating assurance cases. Our tests showed a 50%-80% similarity between machine-generated and human-created cases. In addition, Trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process. This extraction is subject to human review and correction, blending the best of automated efficiency with human insight. To our knowledge, this marks the first integration of large language models in automatic creating and reasoning about assurance cases, bringing a novel approach to a traditional challenge. Through several industrial case studies, Trusta has proven to quickly find some subtle issues that are typically missed in manual inspection, demonstrating its practical value in enhancing the assurance case development process.
在安全性关键领域,建设质量保证案例是不可或缺的。 trustworthinessDerivation Trees(TDT)通过引入形式方法,提高了质量保证案例的质量,使得对质量保证案例进行自动推理变得可能。我们开发了 trustworthinessDerivation Tree Analyzer( Trusta),这是一个桌面应用程序,旨在自动构建和验证 TDT。该工具在后台拥有一个内置 Prolog 解释器,并支持约束求解器 Z3 和 MONA。因此,它可以解决涉及算术、集合、 Horn 条件等逻辑公式的约束。 trusta 还利用大型语言模型,使创建和评估质量保证案例更加便利。它允许人机交互的人类检查和修改。我们评估了如 ChatGPT-3.5、ChatGPT-4 和 PaLM2 等顶尖语言模型,以生成质量保证案例。我们的测试结果显示,机器生成的案例与人类生成的案例有 50%-80% 的相似性。此外, trusta 从自然语言文本中提取形式约束,促进了更容易的解释和验证过程。这种提取需要人类审查和修正,将自动化效率和人类洞察力相结合。据我们所知,这是第一次将大型语言模型集成到自动创建和推理质量保证案例方面,带来了传统挑战的一种新颖方法。通过几个工业案例研究, trusta 证明可以快速发现一些通常在手动检查中忽略的微妙问题,展示了它在增强质量保证案例开发过程中的实际价值。
https://arxiv.org/abs/2309.12941
Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs' comprehension in complex dialogue tasks.
任务导向对话系统(TOD)系统通过多回合对话协助用户执行各种任务,但大型语言模型(LLMs)往往难以理解这些复杂的上下文。在这个研究中,我们提出了一种新的“自我解释”引导策略,以增强多回合对话中的LLMs的理解能力。这种任务无关的方法要求模型在任务执行之前分析每个对话表述,从而提高在各种对话中心任务中的表现。从六个基准数据集的 experimental 结果来看,我们的方法和零样本引导相比,表现 consistently 更好,且与少量的引导效果相当或超过,表明它可能成为增强LLMs在复杂对话任务中理解能力的强大工具。
https://arxiv.org/abs/2309.12940
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The \emph{proposer LLM} of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The \emph{ranker LLM} evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM is able to reduce false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts.
随着软件项目的进展,代码质量变得越来越重要,因为它会影响软件的可靠性、维护性和安全性。因此,静态分析工具被广泛用于开发者工作流程中,以检测代码质量问题。然而,开发者需要额外的努力来修改代码以基于工具调查结果提高代码质量。在本研究中,我们研究使用(指令跟随)大型语言模型(LLMs)协助开发者修改代码以解决代码质量问题的工具。我们提出了一个工具,称为CORE(缩写为COde Revisions),它由一对LLMs组成,由提议者和排名者组成。静态分析工具供应商建议如何缓解工具警告,开发者遵循它们来修改代码。CORE的提议者LLM使用相同的建议并生成候选人代码修订。通过静态质量检查合格的候选人保留。然而,LLM可能引入微妙、意想不到的功能变化,可能未被静态分析发现。排名者LLM评估提议者所做出的更改,使用与开发者接受标准严格遵守的认可标准。CORE使用排名者LLM分配的评分来排名候选人修订,在向开发者展示之前。CORE可以修改59.2%的Python文件(通过52个质量检查),使其通过工具和人类评审员的审查。排名者LLM在这些情况下可以减少false positives的25.8%。CORE创造了76.8%的Java文件(通过10个质量检查)中的修订,与专门的程序修复工具的78.3%相当,工程 effort significantly less。
https://arxiv.org/abs/2309.12938
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as \emph{deepfake texts}. There are currently over 11K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as \emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We show the benefits of having a TDA layer when dealing with noisy, imbalanced, and heterogeneous datasets, by extracting TDA features from the reshaped $pooled\_output$ of RoBERTa as input. We use RoBERTa to capture contextual representations (i.e., semantic and syntactic linguistic features), while using TDA to capture the shape and structure of data (i.e., linguistic structures). Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets, achieving up to 7\% increase in Macro F1 score.
近年来大型语言模型(LLM)的进步使得可以生成任意长度高质量的文本,这些文本难以与人类编写的文本区分开来。我们将这些生成的文本称为 \emph{DeepFake texts}。目前 hugoface 模型 repo 中有超过 11K 个文本生成模型。因此,有恶意意图的用户可以轻松利用这些开源LLM生成大规模的有害文本和虚假信息。为了解决这个问题,我们希望有一种计算方法来确定给定文本是否为DeepFake文本,也就是进行图灵测试(TT)。特别是,在本文中,我们研究了更一般的问题,称为 \emph{作者身份确认(AA)},并在多分类环境中研究这个问题--不仅仅是确定给定文本是否为DeepFake文本,而是能够明确指出哪个LLM是作者。我们提出了 \textbf{TopRoBERTa} 来改进现有的AA解决方案,通过在RoBERTa模型中引入一个拓扑数据分析层,来捕获DeepFake文本中的更多语言学模式。我们展示了使用TDA层来处理噪声、不平衡和异质数据的好处,从RoBERTa的重构输出中提取TDA特征作为输入。我们使用RoBERTa捕获上下文表示(即语义和语法语言学特征),同时使用TDA捕获数据的形状和结构(即语言学结构)。最后, \textbf{TopRoBERTa} 在2/3个数据集上优于传统的RoBERTa,实现了7%的macro F1得分提高。
https://arxiv.org/abs/2309.12934
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
https://arxiv.org/abs/2309.12881
High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
高质量的文本嵌入是改善语义文本相似性任务的关键,它们是大型语言模型应用的关键组件。然而,现有文本嵌入模型面临一个共同的挑战,就是梯度消失问题,这主要是因为它们在优化目标中依赖余弦函数,而余弦函数有一个饱和区域。为了解决这一问题,本文提出了一种名为AnglE的新角度优化文本嵌入模型。AnglE的核心思想是引入复杂的空间角度优化。这种新的方法有效地缓解了余弦函数饱和区域产生的不利效应,这些效应可能会阻碍梯度和妨碍优化过程。为了建立全面的语义文本相似性评估,我们实验了现有的短文本语义文本相似性任务数据集和新从GitHub问题集收集的长篇文本语义文本相似性任务数据集。我们还检查了特定领域的有限标记数据下的特定语义文本相似性场景,并探索了AnglE与LLM标记数据的结合方式。广泛的实验涵盖了各种任务,包括短文本语义文本相似性任务、长篇文本语义文本相似性任务和特定领域的语义文本相似性任务。结果表明,AnglE比忽略余弦函数饱和区域的最先进的语义文本相似性模型表现更好。这些发现表明AnglE能够生成高质量的文本嵌入,以及在语义文本相似性任务中的角度优化的有用性。
https://arxiv.org/abs/2309.12871
Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality than genetic NMT systems. While there has been some continuous progress in NMT for English and other European languages, domain adaption in Arabic has received little attention in the literature. The current study, therefore, aims to explore the effectiveness of domain-specific adaptation for Arabic MT (AMT), in yet unexplored domain, financial news articles. To this end, we developed carefully a parallel corpus for Arabic-English (AR- EN) translation in the financial domain for benchmarking different domain adaptation methods. We then fine-tuned several pre-trained NMT and Large Language models including ChatGPT-3.5 Turbo on our dataset. The results showed that the fine-tuning is successful using just a few well-aligned in-domain AR-EN segments. The quality of ChatGPT translation was superior than other models based on automatic and human evaluations. To the best of our knowledge, this is the first work on fine-tuning ChatGPT towards financial domain transfer learning. To contribute to research in domain translation, we made our datasets and fine-tuned models available at this https URL.
神经网络机器翻译(NMT)在大规模语料库上训练时表现令人印象深刻。然而,通用NMT系统在跨域翻译方面表现出较差的性能。为了解决这个问题,近年来提出了许多域适应方法,这些方法通常比遗传的NMT系统提供更好的翻译质量。虽然英语和其他欧洲语言在NMT方面取得了一些进展,但在阿拉伯语的域适应方面在文献中却较少关注。因此,本研究的目的是探索阿拉伯MT(AMT)在尚未探索过的域——金融新闻 articles 中的域特定适应效果。为此,我们仔细开发了金融域中的阿拉伯-英语(AR- EN)翻译平行语料库,以基准不同的域适应方法。随后,我们对几个预先训练的NMT和大型语言模型,包括 ChatGPT-3.5 Turbo 进行了微调,在我们的数据集上成功进行了微调。结果显示,仅仅使用一些与域相关的 AR-EN Segments 就可以成功进行微调。ChatGPT 翻译的质量基于自动和人工评估被认为比其他模型更好。据我们所知,这是第一个针对金融域迁移学习的研究。为了做出贡献到域翻译研究,我们将该数据集和微调模型放在了这个 https URL 上。
https://arxiv.org/abs/2309.12863
As a common approach to learning English, reading comprehension primarily entails reading articles and answering related questions. However, the complexity of designing effective exercises results in students encountering standardized questions, making it challenging to align with individualized learners' reading comprehension ability. By leveraging the advanced capabilities offered by large language models, exemplified by ChatGPT, this paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS, based on the Zone of Proximal Development theory. ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation, among others, to enhance reading comprehension instruction. First, we develop a new algorithm that can predict learners' reading comprehension abilities using their historical data as the foundation for generating questions at an appropriate level of difficulty. Second, a series of new ChatGPT prompt patterns is proposed to address two key aspects of reading comprehension objectives: question generation, and automated evaluation. These patterns further improve the quality of generated questions. Finally, by integrating personalized ability and reading comprehension prompt patterns, ChatPRCS is systematically validated through experiments. Empirical results demonstrate that it provides learners with high-quality reading comprehension questions that are broadly aligned with expert-crafted questions at a statistical level.
作为学习英语的普遍方法,阅读理解主要涉及阅读文章并回答问题。然而,设计有效的练习会导致学生遇到标准化问题,这使得个性化学生的阅读理解能力难以匹配。通过利用大型语言模型的代表 ChatGPT 提供的先进能力,本文提出了一种名为 ChatPRCS 的个人化阅读支持系统,基于渐进发展理论的区。ChatPRCS 采用的方法包括阅读理解能力预测、问题生成和自动评估,以增强阅读理解指示。首先,我们开发了一种新算法,可以使用历史数据作为生成问题的基础,以在适当的难度级别上生成问题。其次,我们提出了一系列新的 ChatGPT 提示模式,以解决阅读理解目标的的两个关键方面:问题生成和自动评估。这些模式进一步改进了生成的问题的质量和质量。最后,通过整合个人能力和阅读理解提示模式,ChatPRCS 通过实验系统性地验证。实证结果表明,它提供高质量的阅读理解问题,这些问题的答案在统计层面上广泛与专家创建的问题一致。
https://arxiv.org/abs/2309.12808
Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
大型语言模型(LLM)作为一种强大的推理和生成工具,在各种自然语言任务中表现出非凡的性能,例如问答(QA)。在这些任务中,MHQA是一个被广泛讨论的类别,需要进行LLM和外部知识检索的无缝集成。现有的方法使用LLM生成推理路径和计划,并使用IR迭代地检索相关知识,但这些方法具有固有的缺陷。一方面,信息检索(IR)受到LLM生成低质量查询的限制。另一方面,LLM很容易受到IR生成的无关知识的影响。这些不准确的误差通过IR和LLM的迭代交互不断增加,最终导致 effectiveness 的灾难。为了克服上述障碍,在本文中,我们提出了MHQA的新型管道,称为“最短推理与计划评估(FuRePA)”,包括改进的框架(最短推理)和一个附加模块(计划评估器)。1) 最短推理通过掩盖前推理路径和生成LLM的查询,鼓励LLM在每次迭代中从头生成思考链。这种方法使LLM能够打破由以前误导性思考和查询(如果有)构建的束缚。2) 计划评估器是一个经过训练的评估者,从LLM提出的一组备选计划中选择适当的计划。我们的方法在三个备受认可的公共多级问答数据集上进行了评估,并在大多数指标上优于最先进的方法(实现回答准确性10%-12%)。
https://arxiv.org/abs/2309.12767
Lately, Large Language Models have been widely used in code generation. GPT4 is considered the most potent Large Language Model from Openai. In this paper, we examine GPT3.5 and GPT4 as coding assistants. More specifically, we have constructed appropriate tests to check whether the two systems can a) answer typical questions that can arise during the code development, b) produce reliable code, and c) contribute to code debugging. The test results are impressive. The performance of GPT4 is outstanding and signals an increase in the productivity of programmers and the reorganization of software development procedures based on these new tools.
近来,大型语言模型在代码生成中被广泛应用。GPT4被认为是Openai中性能最强的大型语言模型。在本文中,我们考虑了GPT3.5和GPT4作为编码助手。更具体地说,我们设计了适当的测试来检查两个系统是否能够回答代码开发中可能出现的典型问题、生产可靠代码以及帮助代码调试。测试结果令人印象深刻。GPT4的表现非常出色,这表明程序员的生产力以及基于这些新工具重新组织软件开发流程将得到提高。
https://arxiv.org/abs/2309.12732
Human knowledge is subject to uncertainties, imprecision, incompleteness and inconsistencies. Moreover, the meaning of many everyday terms is dependent on the context. That poses a huge challenge for the Semantic Web. This paper introduces work on an intuitive notation and model for defeasible reasoning with imperfect knowledge, and relates it to previous work on argumentation theory. PKN is to N3 as defeasible reasoning is to deductive logic. Further work is needed on an intuitive syntax for describing reasoning strategies and tactics in declarative terms, drawing upon the AIF ontology for inspiration. The paper closes with observations on symbolic approaches in the era of large language models.
人类知识面临着不确定性、不准确、不完整和不一致的挑战。此外,许多日常术语的意义取决于上下文。这对语义网构成了巨大的挑战。本文介绍了一种基于不完美知识的可预测推理的直观符号表示和模型,并将其与推理理论方面的先前工作联系起来。PKN类似于可预测推理相对于演绎逻辑的地位。还需要进一步研究直观语法,以在declarative术语中描述推理策略和战术,并借鉴AIF本体论作为灵感来源。本文最后总结了大型语言模型时代的符号方法。
https://arxiv.org/abs/2309.12731
Large language models (LLMs) have had a huge impact on society due to their impressive capabilities and vast knowledge of the world. Various applications and tools have been created that allow users to interact with these models in a black-box scenario. However, one limitation of this scenario is that users cannot modify the internal knowledge of the model, and the only way to add or modify internal knowledge is by explicitly mentioning it to the model during the current interaction. This learning process is called in-context training, and it refers to training that is confined to the user's current session or context. In-context learning has significant applications, but also has limitations that are seldom studied. In this paper, we present a study that shows how the model can suffer from interference between information that continually flows in the context, causing it to forget previously learned knowledge, which can reduce the model's performance. Along with showing the problem, we propose an evaluation benchmark based on the bAbI dataset.
大型语言模型(LLM)对社会发展产生了巨大的影响,因为其出色的能力和对世界的广泛知识。各种应用程序和工具被创建,使用户可以在一个黑盒场景中与这些模型交互。然而,这个场景的一个限制是用户不能修改模型的内部知识,并且添加或修改内部知识的唯一方法是在当前的交互中明确地向模型提到它。这种学习过程被称为上下文训练,它指的是局限于用户当前会话或上下文的培训。上下文训练具有重要的应用,但也具有很少被研究的限制。在本文中,我们提出一项研究,以展示模型如何受到不断流动Context中信息的干扰,导致它忘记先前学习的知识,从而降低模型的性能。同时,我们提出了基于AbBI数据集的评价基准。
https://arxiv.org/abs/2309.12727
Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
神经网络语言模型在一系列后续任务中表现出了卓越的表现。然而,对于这些模型内部化语法知识的程度仍存在有限的理解,因此各种数据集近年来被构建起来,以促进跨语言的语法评估模型。在本文中,我们介绍了JCoLA(日语语法接受库),它由10,020个带有二进制接受判定的语句组成。具体来说,这些语句从语言学教材、 Handbook 和期刊中手动提取,并将其分为内部语言数据(86 %;从教材和 Handbook 提取的相对简单的接受判定)和外部语言数据(14 %;从期刊提取的具有理论意义的接受判定),后者按照 12 种语言学现象进行分类。随后,我们评估了 9 种日语语言模型的不同语法知识的 JCoLA。结果表明,有几个模型可以在内部语言数据上超过人类表现,但在外部语言数据上却没有能力超过人类表现。语言学现象的错误分析进一步揭示了尽管神经网络语言模型擅长处理类似于论点结构的局部语法依赖,但它们在与类似于语音同意和NPI授权等长距离语法依赖面前的表现会减弱。
https://arxiv.org/abs/2309.12676
Answering numerical questions over hybrid contents from the given tables and text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.
从给定的表格和文本中回答混合内容的问题是一项挑战性的任务。最近,大型语言模型(LLM)在自然语言处理社区中引起了广泛关注。随着大型语言模型的出现,上下文学习和思维链提示已成为该领域的两个最受欢迎的研究主题。在本文中,我们介绍了一种新的提示策略,称为混合提示策略,并介绍了在TextTableQA问题中的思维提取方法。通过上下文学习,我们提示模型在处理混合数据时发展检索思维的能力。我们的方法和在少量样本情况下 MultiHiertt 数据集上的全监督顶级结果相比,取得了更好的表现。
https://arxiv.org/abs/2309.12669
Contract review is an essential step in construction projects to prevent potential losses. However, the current methods for reviewing construction contracts lack effectiveness and reliability, leading to time-consuming and error-prone processes. While large language models (LLMs) have shown promise in revolutionizing natural language processing (NLP) tasks, they struggle with domain-specific knowledge and addressing specialized issues. This paper presents a novel approach that leverages LLMs with construction contract knowledge to emulate the process of contract review by human experts. Our tuning-free approach incorporates construction contract domain knowledge to enhance language models for identifying construction contract risks. The use of a natural language when building the domain knowledge base facilitates practical implementation. We evaluated our method on real construction contracts and achieved solid performance. Additionally, we investigated how large language models employ logical thinking during the task and provide insights and recommendations for future research.
合同审查是建造项目防止潜在损失的必不可少的步骤。然而,当前用于审查建筑合同的方法缺乏效率和可靠性,导致浪费时间和容易出错的过程。虽然大型语言模型(LLM)在改变自然语言处理任务方面表现出了潜力,但它们与特定领域的知识和解决专业问题的能力而奋斗。本文提出了一种创新的方法,利用LLM中的建筑合同知识,模拟人类专家的合同审查过程。我们的无调整方法将建筑合同领域知识纳入其中,以提高语言模型识别建筑合同风险的能力。在建立领域知识库时使用自然语言 facilitate practical implementation。我们对我们的方法在真实建筑合同上的评估并取得稳定的性能。此外,我们研究大型语言模型在任务中如何使用逻辑思考,并为未来的研究提供见解和建议。
https://arxiv.org/abs/2309.12626
In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) plays a key role but its current assignment process is time-consuming. We introduce DRG-LLaMA, a large language model (LLM) fine-tuned on clinical notes for improved DRG prediction. Using Meta's LLaMA as the base model, we optimized it with Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries. With an input token length of 512, DRG-LLaMA-7B achieved a macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0% and a macro-averaged Area Under the Curve (AUC) of 0.986. Impressively, DRG-LLaMA-7B surpassed previously reported leading models on this task, demonstrating a relative improvement in macro-averaged F1 score of 40.3% compared to ClinicalBERT and 35.7% compared to CAML. When DRG-LLaMA is applied to predict base DRGs and complication or comorbidity (CC) / major complication or comorbidity (MCC), the top-1 prediction accuracy reached 67.8% for base DRGs and 67.5% for CC/MCC status. DRG-LLaMA performance exhibits improvements in correlation with larger model parameters and longer input context lengths. Furthermore, usage of LoRA enables training even on smaller GPUs with 48 GB of VRAM, highlighting the viability of adapting LLMs for DRGs prediction.
在美国的住院支付系统中,诊断相关组(DRG)扮演着关键角色,但其当前的任务分配过程却相当耗时。我们引入了DRG-LLaMA,这是一款大型语言模型(LLM),通过在临床记录中优化,以提高DRG预测能力。我们将Meta的LLaMA作为基础模型,通过低秩适应(LoRA)优化它在236,192MIMIC-IV出院摘要中的预测性能。输入 token 长度为512,DRG-LLaMA-7B获得了 macro-averaged F1 得分0.327,top-1预测准确率为52.0%,以及 macro-averaged AUC为0.986。相比之下,DRG-LLaMA-7B在任务中超过了先前报告的主要模型,表现出相对于 ClinicalBERT 和 CAML的 macro-averaged F1 得分相对改善。当将DRG-LLaMA应用于预测基础诊断相关组和并发症或复杂性(CC)或主要并发症或复杂性(MCC)时,基础诊断相关组的预测准确率达到了67.8%,CC/MCC状态的预测准确率则达到了67.5%。DRG-LLaMA 的表现与更大的模型参数和更长的输入上下文长度呈相关改善。此外,使用LoRA可以使在小GPU上使用48GB VRAM的训练中进行训练,强调了 adaptLLMs 对 DRG 预测的可行性。
https://arxiv.org/abs/2309.12625
Neural language models often fail to generate diverse and informative texts, limiting their applicability in real-world problems. While previous approaches have proposed to address these issues by identifying and penalizing undesirable behaviors (e.g., repetition, overuse of frequent words) from language models, we propose an alternative approach based on an observation: models primarily learn attributes within examples that are likely to cause degeneration problems. Based on this observation, we propose a new approach to prevent degeneration problems by training two models. Specifically, we first train a model that is designed to amplify undesirable patterns. We then enhance the diversity of the second model by focusing on patterns that the first model fails to learn. Extensive experiments on two tasks, namely language modeling and dialogue generation, demonstrate the effectiveness of our approach.
神经网络语言模型往往无法生成多样且有用的文本,限制它们在真实世界问题中的适用性。虽然过去的方案曾提议通过识别和惩罚语言模型中不希望出现的行为(例如重复、过度使用常用的词汇)来解决这些问题,但我们提出了一种基于观察的替代方案:模型主要从示例中学习可能导致退化问题的属性。基于这个观察,我们提出了一种新方法,通过训练两个模型来防止退化问题。具体而言,我们首先训练一个旨在放大不希望出现的模式的模型。然后,通过集中关注第一个模型无法学习的模式,我们增强第二个模型的多样性。针对两个任务(语言建模和对话生成)进行广泛的实验,证明了我们的方法的有效性。
https://arxiv.org/abs/2309.12619
Language models (LMs) are no longer restricted to ML community, and instruction-tuned LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a popular practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 ML models that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. Our initial experiments with ChatGPT-3.5, LLaMa, and Galactica showcase a significant gap in the understanding of research papers by these aforementioned LMs as well as generating factual textual responses. We posit that our dataset can be used to train models to automate the generation of model cards from paper text and reduce human effort in the model card curation process. The complete dataset is available on this https URL
语言模型(LM)不再局限于机器学习社区,经过指令调整的LM已经导致自主人工智能代理的兴起。随着LM的可用性不断增加,理解其能力、预期使用和开发周期也变得越来越重要。模型卡片是一种常见的方法,用于记录一个机器学习模型的详细信息。为了自动化模型卡片的生成,我们介绍了一个包含25个机器学习模型的500问答对的 dataset,涵盖了模型的关键方面,例如训练配置、数据集、偏差、建筑细节和训练资源。我们雇用了标注员从原始论文中抽取答案。进一步,我们探索了LM 在回答问题时生成模型卡片的能力。我们对ChatGPT-3.5、LLaMa和Galactica的最初实验展示了上述LMs对研究论文的理解和生成实际文本响应方面存在显著差距。我们假设我们的 dataset 可以用于训练模型自动从文本中提取模型卡片,并减少模型卡片整理过程中人类的工作量。完整 dataset 可以在 this https URL 上找到。
https://arxiv.org/abs/2309.12616
The development of large language models (LLMs) capable of following instructions and engaging in conversational interactions sparked increased interest in their utilization across various support tools. We investigate the utility of modern LLMs in assisting professional writers via an empirical user study (n=30). The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing. Participants are asked to submit a post-completion survey to provide feedback on the potential and pitfalls of LLMs as writing collaborators. Upon analyzing the writer-LLM interactions, we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing. Our findings from analyzing both the interactions and the survey responses highlight future research directions in creative writing assistance using LLMs.
大型语言模型(LLM)的开发能够遵循指令并参与对话交互,引起了对各种支持工具的广泛应用的兴趣。通过实证用户研究(n=30),我们探讨了现代LLM在帮助专业作家方面的有效性。我们的合作写作界面的设计基于写作的认知过程模型,将写作视为一个目标导向的思考过程,包括非线性的认知活动:规划、翻译和审查。参与者被要求完成完成后的调查,以提供对LLM作为写作合作伙伴的潜在和缺点的反馈。通过分析作家-LLM交互,我们发现,虽然作家在所有三种认知活动中寻求LLM的帮助,但他们在翻译和审查方面发现LLM更为有用。我们通过对交互和调查响应的分析,突出了使用LLM进行创意写作协助的未来研究方向。
https://arxiv.org/abs/2309.12570