It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent's imaginative ability reveals its generalization and comprehension of the open world.
设计一个多才多艺的智能体目标是创建一个可以以类似于人类方式遵循多样指令的通用智能体。然而,现有的方法通常由于对抽象和序列化自然语言指令的理解难度而无法稳定遵循指令。为此,我们介绍MineDreamer,一个基于具有挑战性的Minecraft模拟器构建的开放性智能体,采用创新的方法来增强在低级控制信号生成中的指令跟踪能力。具体来说,MineDreamer是基于Mulimodal Large Language Models(MLLMs)和diffusion模型的最新进展而开发的,并采用了一种Chain-of-Imagination(CoI)机制来设想执行指令和翻译想象的过程,将每个步骤的指令跟踪和视觉提示更加精确地个性化,以便在当前状态下生成键盘和鼠标动作,从而实现更高效的指令跟踪。大量实验证明,MineDreamer可以稳定地跟踪单步和多步指令,显著优于最佳通用智能体基线,并且几乎将其性能加倍。此外,对代理的想象能力的定性分析揭示了其对开放世界的通用性和理解。
https://arxiv.org/abs/2403.12037
As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present ROUTERBENCH, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through ROUTERBENCH, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at this https URL.
随着大型语言模型(LLMs)的应用范围不断扩展,有效的服务解决方案的需求变得越来越关键。尽管LLMs具有多样性,但没有一个模型可以最好地解决所有任务和应用,特别是在平衡性能和成本时。这一局限性导致开发了LLM路由系统,它们结合了各种模型的优势,克服了单个LLM的约束。然而,缺乏对评估LLM路由器性能的标准化基准,阻碍了该领域的发展。为了弥合这一差距,我们提出了ROUTERBENCH,一种旨在系统地评估LLM路由系统有效性的新评估框架,以及一个由来自代表性LLM的超过405k个推理结果组成的全面数据集,以支持制定路由策略的开发。我们进一步提出了LLM路由的理论框架,并通过ROUTERBENCH进行了各种路由方法的比较分析,强调它们在我们评估框架中的潜力和局限性。这项工作不仅正式化和推进了LLM路由系统的发展,而且还为它们的评估设定了一个标准,为更易获得且具有经济可行性的LLM部署铺平道路。代码和数据可在此链接下载:https://url.cn/xyz6h
https://arxiv.org/abs/2403.12031
Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models (LLMs), have revolutionized various natural language processing (NLP) tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. The paper begins by defining chart understanding, outlining problem formulations, and discussing fundamental building blocks crucial for studying chart understanding tasks. In the section on tasks and datasets, we explore various tasks within chart understanding and discuss their evaluation metrics and sources of both charts and textual inputs. Modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed in a dedicated section, highlighting issues such as domain-specific charts, lack of efforts in evaluation, and agent-oriented settings. This survey paper serves to provide valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: this https URL.
数据可视化以图表的形式发挥着重要作用,为数据分析提供了关键见解并有助于进行明智的决策。近年来,随着大型基础模型(如大型语言模型)的发展,自动图表理解取得了显著的进步。大型语言模型(LLMs)已经彻底颠覆了各种自然语言处理(NLP)任务,并越来越多地应用于图表理解任务。本文调查论文全面回顾了在基础模型背景下图表理解的最新发展、挑战和未来方向。论文首先定义了图表理解,概述了问题表述,并讨论了研究图表理解任务中至关重要的基础构建模块。在任务和数据集部分,我们探讨了图表理解中的各种任务,并讨论了它们的评估指标和图表与文本输入的来源。然后,我们检查了建模策略,包括基于分类和基于生成的方法,以及增强图表理解性能的工具扩展技术。此外,我们还讨论了每个任务的现有最佳表现,并讨论了如何提高性能。在专门的部分中,我们 addressing了挑战和未来方向,重点关注了诸如领域特定图表、评估工作不足和基于智能体环境的图表理解等问题。本文的研究调查论文旨在为利用大型基础模型进行图表理解提供宝贵的见解和方向。本文提及的研究论文,以及正在涌现出的新研究,将在这个链接上持续更新:https://this URL。
https://arxiv.org/abs/2403.12027
We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL .
我们介绍了一个具有多才多艺的柔性捕获视觉语言模型(VLM),可以生成不同长度的区域特定描述。该模型FlexCap通过为输入边界框生成长度条件下的捕获结果,从而控制其输出信息密度。描述可以从简洁的物体标签到详细的捕获信息。为了实现这一点,我们创建了各种长度的大规模训练数据集,从带标签的图像开始。这种柔性捕获能力具有几个宝贵的应用。首先,FlexCap在视觉基因组数据集上的密集捕获任务中表现出卓越的性能。其次,通过使用FlexCap生成局部描述作为大型语言模型的输入,可以构建视觉问答(VQA)系统。该系统在多个VQA数据集上实现了最先进的零散射击性能。我们还证明了使用FlexCap的“局部化然后描述”方法比其他VLM的“描述然后局部化”方法在开放性物体检测方面表现更好。我们突出柔性捕获模型的一个新颖特点,即它可以通过前缀条件提取多样视觉信息。最后,我们初步展示了FlexCap在图像分类、物体属性识别和视觉对话等任务上的广泛应用。项目网页:https://this URL 。
https://arxiv.org/abs/2403.12026
Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all.
大语言模型(LLMs)在满足复杂健康信息需求方面具有巨大的潜力,但也可能引入危害并加剧健康差异。可靠评估公平相关模型失败是实现促进健康公平的系统的一个关键步骤。在这项工作中,我们提供了资源和方法来揭示在长篇LLM生成的医疗问题回答中可能引发公平危害的偏见,并使用Med-PaLM 2进行实证研究,这是该领域有史以来最大的人类评估研究。我们的贡献包括一个多因素框架来评估LLM生成的答案中的偏见,以及EquityMedQA,一个包含手工编辑和LLM生成的具有对抗性查询增强的问题的数据集。我们的人工评估框架和数据集设计过程都基于迭代参与方法和对Med-PaLM 2回答中可能偏见的审查。通过我们的实证研究,我们发现,通过使用通过各种方法构建的集合,并结合彻底评估协议利用多个评估评分设计和对不同评分组的充分利用,可以揭示可能通过更狭隘的评估方法无法发现的偏见。我们的经验强调了使用多样的人工评估方法和涉及不同背景和专业知识的评分者的必要性。我们强调,尽管我们的框架可以识别出具体形式的偏见,但它不足以全面评估AI系统是否促进公平的健康结果。我们希望更广泛的社区利用并构建这些工具和方法,以实现一个共同的目标,即促进所有人均可获得且公平的医疗保健。
https://arxiv.org/abs/2403.12025
The prevailing approach to aligning Large Language Models (LLMs) typically relies on human or AI feedback and assumes access to specific types of preference datasets. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic. We build a sequential decision-making framework to formulate the problem of aligning LLMs using demonstration datasets. Drawing insights from inverse reinforcement learning and imitation learning, we introduce various approaches for divergence minimization in the LLM alignment tasks. Our analysis highlights the mass-covering and mode-seeking behaviors of these different approaches. Inclusively, we examine the pros and cons of the classical supervised fine-tuning method, elaborating on scenarios where different methods shine.
当前对 aligning Large Language Models (LLMs) 的普遍方法通常依赖于人类或 AI 反馈,并假定可以访问特定类型的偏好数据集。在我们的工作中,我们质疑这类数据集的有效性,并探讨了各种场景中,与专家演示相结合更具有现实意义的可能性。我们构建了一个序列决策框架来表述使用演示数据集调整 LLM 的问题。从反向强化学习和模仿学习中汲取了启示,我们引入了各种用于 LLM 调整任务中的偏差最小化方法。我们的分析突出了这些不同方法的众数覆盖和模式追求行为。包容性地,我们检查了经典监督微调方法的优点和缺点,并深入探讨了各种方法在哪些场景下更加出色的可能性。
https://arxiv.org/abs/2403.12017
Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at? We propose EnvGen, a novel framework to address this question. First, we prompt an LLM to generate training environments that allow agents to quickly learn different tasks in parallel. Concretely, the LLM is given the task description and simulator objectives that the agents should learn and is then asked to generate a set of environment configurations (e.g., different terrains, items given to agents, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance. We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster. We show qualitatively how the LLM adapts training environments to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require thousands of LLM calls. Lastly, we present detailed ablation studies for our design choices.
最近,通过交互式学习来学习实体学习的最新方法直接使用大型语言模型(LLMs)作为智能体来确定环境中的下一步行动。 由于其世界知识和推理能力,LLM智能体在基于强化学习(RL)的基础上实现了比之前更小的智能体的更强性能;然而,频繁地调用LLMs通常需要花费较长时间且代价较高。我们是否可以直接使用LLMs作为智能体呢?我们提出了一个名为EnvGen的新框架来回答这个问题。首先,我们要求LLM生成允许智能体快速学习不同任务的训练环境。具体来说,LLM被要求学习智能体应该学习的任务描述和模拟目标,然后被要求生成一系列环境配置(例如不同地形,分配给智能体的物品等)。接下来,我们在原始和LLM生成的环境中训练一个小型RL智能体。然后,通过向LLM提供智能体的表现反馈,使其能够持续调整生成的环境来逐步提高智能体缺乏的技能。我们在Crafter和Heist环境中进行了全面的实验,证明了EnvGen的有效性。我们发现,使用EnvGen训练的智能体可以在SOTA方法(包括GPT-4智能体)中取得更好的性能,并且能够显著更快地学习长距离任务。我们直观地展示了LLM如何适应性地调整训练环境以帮助智能体提高其较弱技能。此外,EnvGen的使用数量远少于LLM智能体,后者需要数千个LLM调用。最后,我们详细研究了我们的设计选择。
https://arxiv.org/abs/2403.12014
Feedback is a critical aspect of improvement. Unfortunately, when there is a lot of feedback from multiple sources, it can be difficult to distill the information into actionable insights. Consider student evaluations of teaching (SETs), which are important sources of feedback for educators. They can give instructors insights into what worked during a semester. A collection of SETs can also be useful to administrators as signals for courses or entire programs. However, on a large scale as in high-enrollment courses or administrative records over several years, the volume of SETs can render them difficult to analyze. In this paper, we discuss a novel method for analyzing SETs using natural language processing (NLP) and large language models (LLMs). We demonstrate the method by applying it to a corpus of 5,000 SETs from a large public university. We show that the method can be used to extract, embed, cluster, and summarize the SETs to identify the themes they express. More generally, this work illustrates how to use the combination of NLP techniques and LLMs to generate a codebook for SETs. We conclude by discussing the implications of this method for analyzing SETs and other types of student writing in teaching and research settings.
反馈是改进的重要方面。不幸的是,当有很多来自多个来源的反馈时,将其转化为可行的见解可能会有困难。可以考虑学生对教学的评估(SETs),这些评估对教育者来说非常重要。它们可以提供有关学期内所取得的工作的见解。一系列SETs也可以对管理员作为课程或整个项目的信号有帮助。然而,在大型课程或行政记录的范围内,SETs的数量可能会使它们难以分析。在本文中,我们讨论了一种利用自然语言处理(NLP)和大型语言模型(LLM)分析SETs的新方法。我们通过将该方法应用于一个大型公立大学5,000个SET的语料库来展示该方法。我们证实,该方法可以用于提取、嵌入、聚类和总结SETs,以识别它们所表达的主题。更一般地说,本研究展示了如何将NLP技术和LLM的组合用于为SETs生成代码本。我们最后讨论了这种方法对分析SETs和教学和研究环境中的其他类型学生写作的启示。
https://arxiv.org/abs/2403.11984
With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at this https URL.
随着生成模型的快速发展,人工智能生成内容(AIGC)在日常生活中的数量呈指数增长。其中,文本转视频(T2V)生成备受关注。尽管已经发布了许多用于生成高质量视频的T2V模型,但缺乏一种量化评估这些视频质量的方法仍然存在。为解决这个问题,我们建立了迄今为止最大的文本转视频质量评估数据库(T2VQA-DB)。该数据库由9个不同T2V模型生成的10,000个视频组成。我们还进行了一项主观研究,以获得每个视频的相应平均意见评分。基于T2VQA-DB,我们提出了一个新颖的基于变换器的模型,用于主观对文本转视频质量评估(T2VQA)。该模型从文本视频对齐和视频质量两个方面提取特征,然后利用大型语言模型的能力进行预测得分。实验结果表明,T2VQA优于现有的T2V指标和最先进的视频质量评估模型。定量分析表明,T2VQA能够给出与主观对齐的预测,验证了其有效性的可靠性。数据集和代码将在此处发布,链接如下。
https://arxiv.org/abs/2403.11956
Efficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 4-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting and input context length generalization with Larimar and show their effectiveness.
高效且准确地更新存储在大型语言模型(LLMs)中的知识是当今最紧迫的研究挑战之一。本文介绍了一种新颖的、基于大脑启发的架构——Larimar,用于通过分布式增强LLMs。Larimar的记忆允许在没有计算密集型重新训练或微调的情况下动态、一次性的更新知识。在多个事实编辑基准上的实验结果表明,Larimar在具有挑战性的序列编辑设置下的准确性相当于大多数竞争基线的水平,甚至在这种情况下也是如此。此外,由于所提出的架构简单、LLM无关且因此通用,因此表现出出色的速度——根据基LLM的速度可达4-10倍。我们还提供了基于Larimar的选定性遗忘和输入上下文长度扩展机制,并展示了它们的有效性。
https://arxiv.org/abs/2403.11901
Implicit gender bias in Large Language Models (LLMs) is a well-documented problem, and implications of gender introduced into automatic translations can perpetuate real-world biases. However, some LLMs use heuristics or post-processing to mask such bias, making investigation difficult. Here, we examine bias in LLMss via back-translation, using the DeepL translation API to investigate the bias evinced when repeatedly translating a set of 56 Software Engineering tasks used in a previous study. Each statement starts with 'she', and is translated first into a 'genderless' intermediate language then back into English; we then examine pronoun- choice in the back-translated texts. We expand prior research in the following ways: (1) by comparing results across five intermediate languages, namely Finnish, Indonesian, Estonian, Turkish and Hungarian; (2) by proposing a novel metric for assessing the variation in gender implied in the repeated translations, avoiding the over-interpretation of individual pronouns, apparent in earlier work; (3) by investigating sentence features that drive bias; (4) and by comparing results from three time-lapsed datasets to establish the reproducibility of the approach. We found that some languages display similar patterns of pronoun use, falling into three loose groups, but that patterns vary between groups; this underlines the need to work with multiple languages. We also identify the main verb appearing in a sentence as a likely significant driver of implied gender in the translations. Moreover, we see a good level of replicability in the results, and establish that our variation metric proves robust despite an obvious change in the behaviour of the DeepL translation API during the course of the study. These results show that the back-translation method can provide further insights into bias in language models.
隐含的性别偏见在大型语言模型(LLMs)中是一个有充分记录的问题,而将性别引入自动翻译可能会 perpetuate现实世界的偏见。然而,一些LLM使用技巧或后处理来掩盖这种偏见,使得调查变得困难。在这里,我们通过反向翻译来研究LLM中的性别偏见,使用DeepL翻译API研究在以前研究中使用的一组56个软件工程任务的反复翻译中表现出的性别偏见。每个陈述都以“她”开头,首先翻译成一种“中性”的中间语言,然后再翻译回英语;接着研究反向翻译文本中的代词选择。我们在以下方面拓展了先前的研究:(1)通过比较五种中间语言(芬兰语、印度尼西亚语、爱沙尼亚语、土耳其语和匈牙利语)的结果,发现了一些模式;(2)提出了一种新的指标,用于评估重复翻译中隐含的性别差异,避免对个别代词的过度解释,这是以往研究中未能解决的问题;(3)研究导致偏见的关键句子特征;(4)并将三个时间延迟数据集的结果与结果进行比较,以验证方法的可靠性。我们发现,有些语言表现出相似的代词使用模式,分为三个松散的组,但模式在组之间有所不同;这表明需要处理多种语言。我们还发现,句子中出现的主要动词很可能是隐含性别偏见的一个很可能的因素。此外,我们在结果中看到了很高的可重复性,并证实了我们的变化对DeepL翻译API的行为变化并不明显。这些结果表明,反向翻译法可以为语言模型中的偏见提供进一步的洞察。
https://arxiv.org/abs/2403.11896
Low-rank adaptation (LoRA) and its variants are widely employed in fine-tuning large models, including large language models for natural language processing and diffusion models for computer vision. This paper proposes a generalized framework called SuperLoRA that unifies and extends different LoRA variants, which can be realized under different hyper-parameter settings. Introducing grouping, folding, shuffling, projecting, and tensor factoring, SuperLoRA offers high flexibility compared with other LoRA variants and demonstrates superior performance for transfer learning tasks especially in the extremely few-parameter regimes.
低秩适应(LoRA)及其变体在微调大型模型中得到了广泛应用,包括自然语言处理中的大型语言模型和计算机视觉中的扩散模型。本文提出了一个通用的框架 called SuperLoRA,统一和扩展了不同的LoRA变体,可以在不同的超参数设置下实现。引入聚类、折叠、洗牌、投影和张量分解,SuperLoRA相比其他LoRA变体提供了更高的灵活性,并在尤其参数很少的情况下表现出了卓越的迁移学习任务性能。
https://arxiv.org/abs/2403.11887
Employing Large Language Models (LLMs) for semantic parsing has achieved remarkable success. However, we find existing methods fall short in terms of reliability and efficiency when hallucinations are encountered. In this paper, we address these challenges with a framework called QueryAgent, which solves a question step-by-step and performs step-wise self-correction. We introduce an environmental feedback-based self-correction method called ERASER. Unlike traditional approaches, ERASER leverages rich environmental feedback in the intermediate steps to perform selective and differentiated self-correction only when necessary. Experimental results demonstrate that QueryAgent notably outperforms all previous few-shot methods using only one example on GrailQA and GraphQ by 7.0 and 15.0 F1. Moreover, our approach exhibits superiority in terms of efficiency, including runtime, query overhead, and API invocation costs. By leveraging ERASER, we further improve another baseline (i.e., AgentBench) by approximately 10 points, revealing the strong transferability of our approach.
使用大型语言模型(LLMs)进行语义解析取得了显著的成功。然而,在遇到歧义时,我们发现现有的方法在可靠性和效率方面都存在不足。在本文中,我们通过一个名为QueryAgent的框架来解决这些挑战,该框架逐步解决问题并进行逐步自校。我们引入了一种基于环境反馈的自校正方法称为ERASER。与传统方法不同,ERASER利用中间步骤的丰富环境反馈仅在需要时进行选择性和差异化的自校正。实验结果表明,仅使用GrailQA和GraphQ的一个例子,QueryAgent比所有以前的几 shot方法提高了7.0和15.0的F1分数。此外,我们的方法在效率方面表现出优越性,包括运行时间、查询开销和API调用成本。通过利用ERASER,我们进一步提高了另一个基线(即AgentBench)约10个点,揭示了我们方法的有效性。
https://arxiv.org/abs/2403.11886
It is challenging for autonomous control systems to perform complex tasks in the presence of latent risks. Motivated by this challenge, this paper proposes an integrated framework that involves Large Language Models (LLMs), stochastic gradient descent (SGD), and optimization-based control. In the first phrase, the proposed framework breaks down complex tasks into a sequence of smaller subtasks, whose specifications account for contextual information and latent risks. In the second phase, these subtasks and their parameters are refined through a dual process involving LLMs and SGD. LLMs are used to generate rough guesses and failure explanations, and SGD is used to fine-tune parameters. The proposed framework is tested using simulated case studies of robots and vehicles. The experiments demonstrate that the proposed framework can mediate actions based on the context and latent risks and learn complex behaviors efficiently.
对于具有潜在风险的环境中执行复杂任务,自主控制系统具有一定的挑战性。为了应对这一挑战,本文提出了一种集成框架,涉及到大语言模型(LLMs)、随机梯度下降(SGD)和基于优化的控制。在第一段中,所提出的框架将复杂任务分解为一系列较小的子任务,这些子任务的规格考虑了上下文信息和潜在风险。在第二阶段,这些子任务及其参数通过涉及LLMs和SGD的双过程进行进一步优化。LLM用于生成粗略猜测和失败解释,而SGD用于微调参数。所提出的框架通过模拟机器人及车辆的案例研究进行了测试。实验结果表明,与传统方法相比,所提出的框架能够通过上下文及潜在风险来调节行为,并能够有效地学习复杂的行为。
https://arxiv.org/abs/2403.11863
In the rapidly evolving field of artificial intelligence (AI), the application of large language models (LLMs) in agriculture, particularly in pest management, remains nascent. We aimed to prove the feasibility by evaluating the content of the pest management advice generated by LLMs, including the Generative Pre-trained Transformer (GPT) series from OpenAI and the FLAN series from Google. Considering the context-specific properties of agricultural advice, automatically measuring or quantifying the quality of text generated by LLMs becomes a significant challenge. We proposed an innovative approach, using GPT-4 as an evaluator, to score the generated content on Coherence, Logical Consistency, Fluency, Relevance, Comprehensibility, and Exhaustiveness. Additionally, we integrated an expert system based on crop threshold data as a baseline to obtain scores for Factual Accuracy on whether pests found in crop fields should take management action. Each model's score was weighted by percentage to obtain a final score. The results showed that GPT-3.4 and GPT-4 outperform the FLAN models in most evaluation categories. Furthermore, the use of instruction-based prompting containing domain-specific knowledge proved the feasibility of LLMs as an effective tool in agriculture, with an accuracy rate of 72%, demonstrating LLMs' effectiveness in providing pest management suggestions.
在人工智能(AI)快速发展的领域中,将大型语言模型(LLMs)应用于农业,特别是害虫管理,仍然具有潜力。我们的目标是通过评估LLMs生成的害虫管理建议的内容,包括OpenAI的Generative Pre-trained Transformer(GPT)系列和Google的FLAN系列,来证明其可行性。考虑到农业建议的上下文特征,自动测量或量化LLMs生成的文本的质量变得是一项重要的挑战。我们提出了一个创新的方法,使用GPT-4作为评估者,对生成的内容进行 Coherence(连贯性)、Logical Consistency(逻辑一致性)、Fluency(流畅性)、Relevance(相关性)、Comprehensibility(完整性)和Exhaustiveness(彻底性)的评分。此外,我们还基于 crop threshold data 的专家系统作为基线,以获得在田地中发现的害虫是否应采取管理措施的分数。每个模型的分数都按百分比加权以获得最终得分。结果显示,GPT-3.4 和 GPT-4 在大多数评估类别中优于 FLAN 模型。此外,使用包含领域特定知识的有指导的提示证明了LLM在农业中作为有效工具的有效性,准确率为 72%,这表明LLM在提供虫害管理建议方面的有效性。
https://arxiv.org/abs/2403.11858
Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, thereby establishing a comprehensive library of guidelines and models for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with pertinent guidelines, guiding LLMs in response generation to ensure safe and high-quality outputs, thus aligning with human values. An additional optional stage involves fine-tuning a model with new well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluated our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.
大语言模型(LLMs)具有令人印象深刻的性能,但也存在诸如 biased content generation 和隐私问题等风险。目前的一种对齐方法包括基于原则的集成,但面临着来自没有安全训练的模型不精确的规则和不足的风险感知所带来的挑战。为了应对这些问题,我们引入了指南对齐,这是一种两阶段的方法。首先,经过安全训练的模型识别出潜在的风险并为各种输入制定具体指导方针,从而建立了一个全面的对输入对齐和模型的指南库。随后,检索模型将新输入与相关指南相关联,从而引导LLM在生成响应时确保安全和高质量输出,从而符合人类价值观。此外,一个可选的阶段涉及通过在第二步实施的过程中微调模型。我们的方法定制了指南以适应各种输入,从而提高了指南库的细粒度和全面性。此外,它还通过轻量级的检索模型引入了来自经过安全训练的LLM的安全专业知识。我们在三个基准上评估了我们的方法,证明了LLM的安全性和质量都有显著的提高。值得注意的是,我们的经过微调的模型Labrador,即使在130亿参数的情况下,也超越了GPT-3.5-turbo,并超过了GPT-4在对齐能力方面。
https://arxiv.org/abs/2403.11838
The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence. The current common practice is to finetune Large Language Models (LLMs) with 3D data and texts to enable 3D understanding. Despite their effectiveness, these approaches are inherently limited by the scale and diversity of the available 3D data. Alternatively, in this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework addressing the 3D scene understanding in a zero-shot manner. The essence of our approach centers on reconceptualizing the challenge of 3D scene perception as a process of understanding and synthesizing insights from multiple images, inspired by how our human beings attempt to understand 3D scenes. By consolidating this idea, we propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. Specifically, given an input 3D scene, Agent3D-Zero first processes a bird's-eye view image with custom-designed visual prompts, then iteratively chooses the next viewpoints to observe and summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints and thus facilitate observing 3D scenes. Extensive experiments demonstrate the effectiveness of the proposed framework in understanding diverse and previously unseen 3D environments.
理解并推理3D现实的能力是走向通用人工智能的重要里程碑。目前的常见做法是通过将3D数据和文本用于微调大语言模型(LLMs),实现3D理解。尽管这些方法的有效性得到了提高,但它们在很大程度上仍然受到可用3D数据规模和多样性的限制。相反,在这项工作中,我们引入了Agent3D-Zero,一种创新的3D感知代理框架,以通过零散地选择和分析一系列观点来理解3D场景。我们方法的核心是对3D场景感知挑战的重新认识,将其视为从多张图像中理解和合成见解的过程,受到我们人类尝试理解3D场景方式的启发。通过巩固这一想法,我们提出了通过积极选择和分析一系列观点来利用大型视觉语言模型(VLM)的新方法。具体来说,给定一个输入3D场景,Agent3D-Zero首先处理带有自定义视觉提示的鸟瞰图图像,然后迭代选择下一个观察点并总结底层知识。Agent3D-Zero的独特之处在于引入了新颖的视觉提示,这显著地增强了VLMs确定最有信息性的观点的能力,从而推动观察3D场景。大量实验证明,所提出的框架在理解多样且之前未见过的3D环境方面非常有效。
https://arxiv.org/abs/2403.11835
Machine learning models are vulnerable to maliciously crafted Adversarial Examples (AEs). Training a machine learning model with AEs improves its robustness and stability against adversarial attacks. It is essential to develop models that produce high-quality AEs. Developing such models has been much slower in natural language processing (NLP) than in areas such as computer vision. This paper introduces a practical and efficient adversarial attack model called SSCAE for \textbf{S}emantic, \textbf{S}yntactic, and \textbf{C}ontext-aware natural language \textbf{AE}s generator. SSCAE identifies important words and uses a masked language model to generate an early set of substitutions. Next, two well-known language models are employed to evaluate the initial set in terms of semantic and syntactic characteristics. We introduce (1) a dynamic threshold to capture more efficient perturbations and (2) a local greedy search to generate high-quality AEs. As a black-box method, SSCAE generates humanly imperceptible and context-aware AEs that preserve semantic consistency and the source language's syntactical and grammatical requirements. The effectiveness and superiority of the proposed SSCAE model are illustrated with fifteen comparative experiments and extensive sensitivity analysis for parameter optimization. SSCAE outperforms the existing models in all experiments while maintaining a higher semantic consistency with a lower query number and a comparable perturbation rate.
机器学习模型容易受到恶意构造的对抗性样本(AEs)的攻击。使用AEs训练机器学习模型可以提高其对抗性攻击的稳健性和稳定性。开发高质量AE模型的关键在于开发能够产生高质量AE的模型。在自然语言处理(NLP)领域,开发这样的模型要慢得多,而在计算机视觉领域则相对较快。本文介绍了一种实用且高效的对抗性攻击模型SSCAE,用于语义、句法和上下文感知 natural language AEs 生成器。SSCAE 能够识别出重要单词,并使用遮罩语言模型生成一系列早期的替换。接下来,我们采用了两个著名的自然语言模型来对初始集合进行评估,从语义和句法特征的角度。我们引入了(1)动态阈值以捕捉更有效的扰动,(2)局部贪心搜索以生成高质量 AEs。作为黑盒模型,SSCAE 生成了具有感知不到的人造和上下文感知特征的 AEs,同时保留了语义一致性和源语言的语法和语义要求。 SSCAE模型的有效性和优越性通过十五个比较实验以及参数优化的大量敏感性分析得到了说明。与现有模型相比,SSCAE在所有实验中都表现优异,同时具有较低的查询数量和可比的扰动率。
https://arxiv.org/abs/2403.11833
Recent advances in text-to-image synthesis have been enabled by exploiting a combination of language and vision through foundation models. These models are pre-trained on tremendous amounts of text-image pairs sourced from the World Wide Web or other large-scale databases. As the demand for high-quality image generation shifts towards ensuring content alignment between text and image, novel evaluation metrics have been developed with the aim of mimicking human judgments. Thus, researchers have started to collect datasets with increasingly complex annotations to study the compositionality of vision-language models and their incorporation as a quality measure of compositional alignment between text and image contents. In this work, we provide a comprehensive overview of existing text-to-image evaluation metrics and propose a new taxonomy for categorizing these metrics. We also review frequently adopted text-image benchmark datasets before discussing techniques to optimize text-to-image synthesis models towards quality and human preferences. Ultimately, we derive guidelines for improving text-to-image evaluation and discuss the open challenges and current limitations.
近年来,通过利用语言和视觉的基础模型,将文本转图像合成领域的进展得以实现。这些模型在大规模的文本-图像对来源的世界 wide web 或其他大型数据库上进行预训练。随着对高质量图像生成的需求转移到确保文本和图像内容的对齐,为了模仿人类判断,新颖的评估指标已经开发出来。因此,研究人员开始收集具有越来越复杂注释的数据集来研究视觉-语言模型的构成性以及它们作为文本和图像内容对齐质量度量的集成。在这项工作中,我们全面回顾了现有的文本-图像评估指标,并提出了为这些指标分类的新范畴。我们还在讨论如何优化文本-图像合成模型的技术和方法之前,回顾了经常采用的文本-图像基准数据集。最后,我们得出了改进文本-图像评估的指导方针,并讨论了当前的挑战和限制。
https://arxiv.org/abs/2403.11821
Metaphors in natural language are a reflection of fundamental cognitive processes such as analogical reasoning and categorisation, and are deeply rooted in everyday communication. Metaphor understanding is therefore an essential task for large language models (LLMs). We release the Metaphor Understanding Challenge Dataset (MUNCH), designed to evaluate the metaphor understanding capabilities of LLMs. The dataset provides over 10k paraphrases for sentences containing metaphor use, as well as 1.5k instances containing inapt paraphrases. The inapt paraphrases were carefully selected to serve as control to determine whether the model indeed performs full metaphor interpretation or rather resorts to lexical similarity. All apt and inapt paraphrases were manually annotated. The metaphorical sentences cover natural metaphor uses across 4 genres (academic, news, fiction, and conversation), and they exhibit different levels of novelty. Experiments with LLaMA and GPT-3.5 demonstrate that MUNCH presents a challenging task for LLMs. The dataset is freely accessible at this https URL.
隐喻在自然语言中是一种反映基本认知过程的镜子,如类比推理和分类,并且深深地扎根于日常交流中。因此,隐喻理解对于大型语言模型(LLMs)来说是一个必要任务。我们发布了隐喻理解挑战数据集(MUNCH),旨在评估LLMs的隐喻理解能力。该数据集包含近10000个包含隐喻使用的句子的同义词,以及1500个包含不当隐喻的实例。不当隐喻被仔细选择作为模型确实进行完整隐喻理解还是只是求助于词汇相似性的对照。所有适当的隐喻和不当隐喻都被手动标注。隐喻性句子涉及4个 genres(学术、新闻、小说和对话),并表现出不同程度的新颖性。与LLMA和GPT-3.5的实验结果相一致,MUNCH为LLMs设置了一个具有挑战性的任务。该数据集在https://这个URL上免费提供。
https://arxiv.org/abs/2403.11810