Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.
事实提取对于构建知识图谱至关重要。最近,对于下游任务中不断增加的时间性事实需求,导致出现了时间性事实提取任务。在本文中,我们重点讨论自然语言文本中提取时间性事实的问题。之前的研究未能解决在复杂句子中建立时间性事实的时间挑战。为了克服这一障碍,我们提出了一个基于时间轴的句子分解策略,使用大型语言模型(LLMs)进行预训练,确保对与各种事实相关的时间轴有细粒度的理解。此外,我们评估了LLMs的直接时间性事实提取性能,并得到不满意的结果。为此,我们引入了TSDRE,一种将LLM的分解能力融入对较小预训练语言模型(PLM)传统微调的方法。为了支持评估,我们构建了复杂的时间性事实提取数据集ComplexTRED。我们的实验结果表明,TSDRE在HyperRED-Temporal和ComplexTRED数据集上均取得了最先进的成果。
https://arxiv.org/abs/2405.10288
Numerous recent works aim to enhance the efficacy of Large Language Models (LLMs) through strategic prompting. In particular, the Optimization by PROmpting (OPRO) approach provides state-of-the-art performance by leveraging LLMs as optimizers where the optimization task is to find instructions that maximize the task accuracy. In this paper, we revisit OPRO for automated prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral 7B. Our investigation reveals that OPRO shows limited effectiveness in small-scale LLMs, with limited inference capabilities constraining optimization ability. We suggest future automatic prompting engineering to consider both model capabilities and computational costs. Additionally, for small-scale LLMs, we recommend direct instructions that clearly outline objectives and methodologies as robust prompt baselines, ensuring efficient and effective prompt engineering in ongoing research.
许多最近的工作旨在通过策略提示增强大型语言模型(LLMs)的效率。特别是,通过利用LLM作为优化器,Proof of Programming (OPRO)方法在优化任务中提供了最先进的性能。在本文中,我们重新审视了OPROMpting (OPR)方法用于自动提示相对较小的LLM,如LLLaMa-2系列和Mistral 7B。我们的调查显示,在小型LLM上,OPROMpting的优化效果有限,有限的语言能力限制了优化能力。我们建议,在未来的自动提示工程中,要考虑模型的特性和计算成本。此外,对于小型LLM,我们建议使用明确说明要达到的目标和方法的直接指令作为稳健的提示基础,以确保在 ongoing研究中有高效的和有效的提示工程。
https://arxiv.org/abs/2405.10276
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.
近年来,人们在自然语言推理(NLR)、数学推理和代码生成等领域对大型语言模型(LLMs)进行了评估。然而,据我们所知,还没有工作专门研究LLMs在自然语言生成(NLG)任务上的表现,这是衡量模型卓越性的关键标准。因此,本文对包括对话生成和文本摘要在内的几种知名且高性能的LLM进行了全面的评估,这些模型是基于ChatGPT、ChatGLM、T5模型的LLMA模型和Python模型的。我们将英语和中文数据集作为研究对象,涵盖了对话生成和文本摘要。此外,我们提出了一个包含输入模板和后处理策略的通用评估设置。我们的研究既包括自动结果,也包括详细的分析。
https://arxiv.org/abs/2405.10251
In this paper, we introduce a novel psychological benchmark, CPsyExam, constructed from questions sourced from Chinese language examinations. CPsyExam is designed to prioritize psychological knowledge and case analysis separately, recognizing the significance of applying psychological knowledge to real-world scenarios. From the pool of 22k questions, we utilize 4k to create the benchmark that offers balanced coverage of subjects and incorporates a diverse range of case analysis techniques.Furthermore, we evaluate a range of existing large language models~(LLMs), spanning from open-sourced to API-based models. Our experiments and analysis demonstrate that CPsyExam serves as an effective benchmark for enhancing the understanding of psychology within LLMs and enables the comparison of LLMs across various granularities.
在本文中,我们引入了一个新的心理基准CPsyExam,它是由来自汉语考试的问题来源构建的。CPsyExam旨在分别优先考虑心理知识和案例分析,认识到将心理知识应用于现实场景的重要性。从22k个问题中,我们利用4k个问题来创建这个基准,为学科和案例分析提供平衡的覆盖,并采用了一系列不同的案例分析技术。此外,我们评估了一系列现有的大型语言模型(LLMs),从开源到基于API模型的模型。我们的实验和分析表明,CPsyExam成为增强LLM中心理学理解的有力基准,并能够比较各种粒度水平的LLM。
https://arxiv.org/abs/2405.10212
The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at this https URL
大规模语言模型的快速演变带来了对它们在各种维度上的表现进行全面评估的需求。在本文中,我们提出了LFED,一个文学小说评估数据集,旨在评估LLM在长篇小说理解和推理方面的能力。我们收集了95部文学作品,这些作品要么最初用中文创作,要么翻译成中文,涵盖了几个世纪内的各种主题。我们定义了一个问题分类器,包括8个问题类别,以指导创建1304个问题。此外,我们进行了深入分析,以确定文学作品的具体属性(如小说类型、角色数量、出版年份)如何影响LLM在评估中的表现。通过一系列与最先进的LLM的实验,我们发现这些模型在有效地回答关于文学小说的相关问题时面临相当大的挑战,在零散设置下,ChatGPT的得分只有57.08%。该数据集将在这个https:// URL上公开发布。
https://arxiv.org/abs/2405.10166
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
近年来大型语言模型(LLMs)的成功引起了人们对开发个性化的角色扮演对话代理商的广泛关注,以增强他们在执行通用和专用对话任务方面的能力。然而,在将生成的语句个性化给说话人方面(无论是人类还是LLM),尚缺乏深入研究。为了填补这一空白,我们的研究引入了一个新颖的评估挑战:在代理生成的对话中进行说话人验证,旨在验证两组语句是否来自同一个说话人。为此,我们收集了一个包括成千上万个说话人和他们的话语的大型数据集。我们还在一个实验设置中开发和评估了说话人验证模型。进一步利用说话人验证模型评估基于LLM的角色扮演模型的个性化能力。全面的实验结果表明,当前的角色扮演模型在准确复制说话人方面失败,主要原因是它们固有的语言特性。
https://arxiv.org/abs/2405.10150
In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at this https URL.
在这项工作中,我们引入了LIBRA,一个在大型语言模型(LLM)上具有解耦视觉系统的原型模型。解耦的视觉系统解耦了内模态建模和跨模态交互,产生了独特的视觉信息建模和有效的跨模态理解。LIBRA通过在视觉和语言输入上进行离散自回归建模进行训练。具体来说,我们将一个经过跨模态桥接模块的径向视觉专家融入预训练的LLM中,以路由在注意力计算过程中视觉和语言流的视觉和跨模态交互场景,实现不同内模态建模和跨模态交互场景的注意力模式。实验结果表明,专门设计的LIBRA在仅有5000万训练数据的情况下,实现了与现有图像到文本场景中工作的MLLM基线相媲美的强大性能,为未来的多模态基础模型提供了新的视角。代码可以从该链接获取:https://www.example.com/libra。
https://arxiv.org/abs/2405.10140
The emergence of large language models (LLMs) capable of generating realistic texts and images has sparked ethical concerns across various sectors. In response, researchers in academia and industry are actively exploring methods to distinguish AI-generated content from human-authored material. However, a crucial question remains: What are the unique characteristics of AI-generated text? Addressing this gap, this study proposes StyloAI, a data-driven model that uses 31 stylometric features to identify AI-generated texts by applying a Random Forest classifier on two multi-domain datasets. StyloAI achieves accuracy rates of 81% and 98% on the test set of the AuTextification dataset and the Education dataset, respectively. This approach surpasses the performance of existing state-of-the-art models and provides valuable insights into the differences between AI-generated and human-authored texts.
大语言模型(LLMs)的涌现引发了各个领域的伦理担忧。为了回应这一问题,学术界和工业界的研究人员正在积极探讨如何区分由人工智能生成的内容和由人类编写的材料。然而,一个关键的问题仍然存在:人工智能生成的文本的独特特点是什么?为了解决这一空白,本研究提出了StyloAI,一种数据驱动的模型,它使用31个标度特征来识别通过在两个多领域数据集上应用随机森林分类器来检测人工智能生成的文本。StyloAI在AuTextification数据集和Education数据集的测试集上的准确率分别为81%和98%。这种方法超过了现有最先进的模型的性能,并为人工智能生成的文本和人类编写的文本之间的差异提供了宝贵的见解。
https://arxiv.org/abs/2405.10129
Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision/Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across steps in both domains.
多步骤指令(如食谱和教程)极大地受益于视觉辅助,例如一系列随指导步骤附带的图像。尽管大型语言模型(LLMs)已经能够生成连贯的文本步骤,但大型视觉/语言模型(LVLMs)在生成伴随图像序列方面能力较弱。最具挑战的是,生成的每个图像都需要遵守相关的文本步骤,并且要与序列前面的图像在视觉上保持一致。为解决这个问题,我们提出了一个生成一致图像序列的方法,该方法将潜在扩散模型(LDM)与大型语言模型(LLM)结合,将序列转换为摘要以保持序列的语义连贯。此外,为了保持图像序列的视觉连贯性,我们引入了一个副本机制,从相关步骤之前生成的图像的潜在向量迭代初始化反向扩散过程。两种策略都将指令步骤序列作为条件,并将当前图像的内容与之前的指令步骤和相应的图像连接起来。实验证明,与第二好的方法相比,所提出的方法在46.6%的案例中受到了人类的偏好,而在26.6%的案例中排在了第二。此外,自动指标表明,在两个领域中,所提出的方法都保持了语义连贯和视觉一致性。
https://arxiv.org/abs/2405.10122
Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code will be publicly available following acceptance.
将多模态知识集成到大型语言模型(LLMs)中代表了在对话生成能力方面的重要进步。然而,在零资源场景中有效整合这种知识仍然是一个巨大的挑战,因为缺乏多样化、高质量的多模态对话数据集。为了应对这个挑战,我们提出了Visual Implicit Knowledge Distillation Framework(VIKDF),一种旨在通过利用隐含多模态知识来增强LLMs以实现丰富对话生成的创新方法。VIKDF包括两个主要阶段:知识蒸馏,使用Implicit Query Transformer从图像文本对中提取和编码视觉隐含知识,并将其转化为知识向量;知识整合,采用一种新颖的双向变分信息融合技术,将这些蒸馏知识轻松地整合到LLMs中。这使得LLMs能够生成不仅 coherent 而且 engaging的对话,并通过隐含多模态线索展现对上下文的深刻理解,有效克服了零资源场景的局限性。我们在两个对话数据集上的广泛实验表明,VIKDF在生成高质量对话方面超过了现有最先进的模型。代码将在接受后公开可用。
https://arxiv.org/abs/2405.10121
LLM watermarking, which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of large language models. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily experiment with, understand, and assess the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at this https URL.
为了减轻大型语言模型的潜在滥用,将不可察觉但可以算法上检测到的信号嵌入到模型输出中以识别由LLM生成的文本已成为至关重要的。然而,LLM水印算法的丰富性、复杂性和评估过程复杂性给研究人员和社区轻松实验、理解和评估最新进展带来了挑战。为解决这些问题,我们引入了MarkLLM,一个用于LLM水印的开源工具包。MarkLLM提供了一个统一的、可扩展的框架,用于实现LLM水印算法,同时提供用户友好的界面,确保访问的便捷性。此外,它通过支持自动可视化算法的底层机制来增强理解。对于评估,MarkLLM提供了一个包括三个方面的全面评估工具包,以及两种类型的自动评估管道。通过MarkLLM,我们旨在支持研究人员,同时改善LLM水印技术的影响和认知,促进共识并在研究和应用中推动进一步发展。我们的代码可以从该链接获取。
https://arxiv.org/abs/2405.10051
Large language models (LLMs) are versatile and can address many tasks, but for computational efficiency, it is often desirable to distill their capabilities into smaller student models. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is "seeded" with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to standard 32-shot prompting and six baseline approaches.
大语言模型(LLMs)具有广泛的用途,可以处理许多任务,但在计算效率方面,通常希望将它们的功能缩小为较小的学生模型。为分类任务进行数据合成的一种方式是通过生成每个标签的示例来缩小LLM的功能。之前的方法使用少样本提示,这依赖于LLM的参数化知识来生成有用的示例。然而,这导致了重复问题、倾向于流行实体和人文风格的差异等问题。在本文中,我们提出了一种名为“合成-检索和精炼”(SynthesizRR)的方法,该方法通过检索增强来引入数据合成过程的多样性:由于检索到的段落有所不同,LLM会“播种”不同的内容以生成其示例。我们通过研究六个主题分类、情感分析、语调检测和幽默等领域的数据合成,探讨了合成策略的复杂性。我们发现,SynthesizRR在比较标准32- shot提示和六个基线方法时,大大提高了词汇和语义多样性、与人类文本的相似性和去耦性能。
https://arxiv.org/abs/2405.10040
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.
近年来,大型语言模型(LLMs)的进步推动了自动语音识别(ASR)中的生成误差纠正(GER)的发展,该旨在从解码的N个最佳假设预测听到的地面真实转录。得益于LLMs强大的语言生成能力以及N个最佳列表中的丰富信息,GER在增强ASR效果方面表现出巨大的效果。然而,它仍然存在两个局限性:1)LLMs在GER过程中无法感知原始语音,这可能导致语法正确但违反源语音内容的成果;2)N个最佳假设通常只在几个词上变化,这使得为GER发送所有它们变得冗余,可能会使LLM困惑于应关注哪些词,从而导致增加误译。在本文中,我们提出了ClozeGER,一种新的ASR生成误差纠正范式。首先,我们引入了一个多模态LLM(即SpeechGPT)以接收原始语音作为额外的输入以提高纠错输出的保真度。然后,我们将GER重新格式化为一个cloze测试,对logits进行归一化以消除输入信息冗余并简化GER,并提供明确的指导。实验证明,ClozeGER在9个流行的ASR数据集上取得了与普通GER的新突破。
https://arxiv.org/abs/2405.10025
Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.
多模态大型语言模型(MLLMs)在最近的各种视觉语言任务中取得了提高,包括视觉 grounding 功能。然而,在 MLLMs 中,视觉 grounding 的对抗性仍然是一个未探索的领域。为了填补这一空白,我们将视觉 grounding 任务中的参考表达理解(REC)作为示例任务,并提出了三种攻击范例。首先,无目标攻击导致 MLLMs 生成每个对象的错误边界框。其次,专有目标攻击导致所有生成的输出都指向同一个目标边界框。此外,异位目标攻击旨在将所有边界框在不同物体之间随机排列。大量实验证明,与 MLLMs 视觉 grounding 能力相关的建议方法可以成功地攻击其抗性。我们的方法不仅为设计新的攻击提供了新的视角,而且作为提高 MLLMs 视觉 grounding 强度的强基。
https://arxiv.org/abs/2405.09981
The use of question-answer (QA) pairs for training and evaluating large language models (LLMs) has attracted considerable attention. Yet few available QA datasets are based on knowledge from the scientific literature. Here we bridge this gap by presenting Automatic Generation of Scientific Question Answers (SciQAG), a framework for automatic generation and evaluation of scientific QA pairs sourced from published scientific literature. We fine-tune an open-source LLM to generate \num{960000} scientific QA pairs from full-text scientific papers and propose a five-dimensional metric to evaluate the quality of the generated QA pairs. We show via LLM-based evaluation that the generated QA pairs consistently achieve an average score of 2.5 out of 3 across five dimensions, indicating that our framework can distill key knowledge from papers into high-quality QA pairs at scale. We make the dataset, models, and evaluation codes publicly available.
使用问题-答案(QA)对训练和评估大型语言模型(LLMs)已经引起了相当的关注。然而,目前可用的QA数据集很少是基于科学文献的知识。在这里,我们通过介绍自动生成科学问题答案(SciQAG)框架,填补了这一空白。该框架用于自动生成和评估来源于发表科学文献的科学的QA对。我们通过训练一个开源的LLM,从全文科学论文中生成\num{960000}个科学QA对,并提出了一个五维的指标来评估生成的QA对的质量。我们通过LLM基于评估展示了生成的QA对在五个维度上的平均得分为2.5分,表明我们的框架可以将论文中的关键知识提炼成高质量QA对。我们将数据集、模型和评估代码公开发布。
https://arxiv.org/abs/2405.09939
As natural language generation (NLG) models have become prevalent, systematically assessing the quality of machine-generated texts has become increasingly important. Recent studies introduce LLM-based evaluators that operate as reference-free metrics, demonstrating their capability to adeptly handle novel tasks. However, these models generally rely on a single-agent approach, which, we argue, introduces an inherent limit to their performance. This is because there exist biases in LLM agent's responses, including preferences for certain text structure or content. In this work, we propose DEBATE, an NLG evaluation framework based on multi-agent scoring system augmented with a concept of Devil's Advocate. Within the framework, one agent is instructed to criticize other agents' arguments, potentially resolving the bias in LLM agent's answers. DEBATE substantially outperforms the previous state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation, SummEval and TopicalChat. We also show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
随着自然语言生成(NLG)模型的普及,系统地评估机器生成的文本质量变得越来越重要。最近的研究引入了LLM基评估器,这些评估器作为参考无评估指标,证明了它们巧妙处理新任务的能力。然而,这些模型通常依赖于单一代理商方法,我们认为这限制了它们的表现。这是因为LLM代理商回答中存在偏见,包括对某些文本结构和内容的偏好。在本文中,我们提出了DEBATE,一个基于多代理评分系统增强的NLG评估框架,并引入了恶魔辩护的概念。在框架内,一个代理商被指示批评其他代理商的论点,可能解决这个问题,即LLM代理商回答中的偏见。DEBATE在NLG评估中的两个元评估基准指标——SummEval和TopicalChat方面显著超过了最先进的水平。我们还证明了代理之间辩论的普遍性和一个代理的个性会影响评估器的性能。
https://arxiv.org/abs/2405.09935
Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function $\phi$ with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream task data. We explored the many positive effects of this method's customized tokenizer on domain-adaptive pretraining and verified this method can perform better than the ordinary method of just collecting data and fine-tuning. Based on our experiment, the continued pretraining process of IGOT with LLaMA-7B achieved 11.9\% token saving, 12.2\% training time saving, and 5.8\% maximum GPU VRAM usage saving, combined with the T5 model, we can even reach a 31.5\% of training time saving, making porting general generative AI to specific domains more effective than before. In domain-specific tasks, supervised $IGOT_\tau$ shows great performance on reducing both the convergence radius and convergence point during keep pretraining.
预训练大型语言模型(如ChatGPT、Claude等)在各种自然语言生成领域展示了强大的能力。然而,在专业领域中使用LLM仍然存在许多问题。当使用生成式AI处理下游任务时,一种常见的做法是在持续训练或微调预训练模型时,向预训练模型添加新的知识(例如,私用领域知识、前沿信息)。然而,是否有一种通用的领域适应训练范式仍然是一个未解决的问题。在本文中,我们提出了信息增益优化标记词(IGOT),它分析了下游任务的特殊标记词集,使用启发式函数$\phi$构建了一个新的子集,用于构建新的领域特定标记词,并在下游任务数据上继续预训练。我们探讨了这种方法定制标记词对领域适应预训练的许多积极影响,并验证了这一方法的表现比仅仅收集数据和微调更好。根据我们的实验,IGOT与LLLaMA-7B的持续预训练过程实现了11.9%的标记词节省,12.2%的训练时间节省,以及5.8%的最大GPU VRAM使用节省,与T5模型相结合,我们甚至可以达到31.5%的训练时间节省,使将通用生成式AI转移到具体领域更加有效。在领域特定任务中,有监督的IGOT_tau在减小收敛半径和收敛点方面表现出巨大性能。
https://arxiv.org/abs/2405.09857
This paper addresses the problem of object-goal navigation in autonomous inspections in real-world environments. Object-goal navigation is crucial to enable effective inspections in various settings, often requiring the robot to identify the target object within a large search space. Current object inspection methods fall short of human efficiency because they typically cannot bootstrap prior and common sense knowledge as humans do. In this paper, we introduce a framework that enables robots to use semantic knowledge from prior spatial configurations of the environment and semantic common sense knowledge. We propose SEEK (Semantic Reasoning for Object Inspection Tasks) that combines semantic prior knowledge with the robot's observations to search for and navigate toward target objects more efficiently. SEEK maintains two representations: a Dynamic Scene Graph (DSG) and a Relational Semantic Network (RSN). The RSN is a compact and practical model that estimates the probability of finding the target object across spatial elements in the DSG. We propose a novel probabilistic planning framework to search for the object using relational semantic knowledge. Our simulation analyses demonstrate that SEEK outperforms the classical planning and Large Language Models (LLMs)-based methods that are examined in this study in terms of efficiency for object-goal inspection tasks. We validated our approach on a physical legged robot in urban environments, showcasing its practicality and effectiveness in real-world inspection scenarios.
本文针对现实环境中自动驾驶检查中的物体目标导航问题进行了研究。物体目标导航对于在各种环境中有效检查物体至关重要,通常需要机器人识别一个大型搜索空间中的目标物体。当前的物体检查方法之所以不能达到人类效率,是因为它们通常无法利用人类在先前的空间配置中具有的语义知识和共同感觉。在本文中,我们引入了一个框架,使机器人可以使用环境先验配置中的语义知识和共同感觉知识。我们提出了SEEK(语义推理物体检查任务)框架,将语义先验知识与机器人的观察相结合以更有效地搜索和导航目标物体。SEEK有两个表示:动态场景图(DSG)和关系语义网络(RSN)。RSN是一个紧凑且实用的模型,用于估计DSG中空间元素中找到目标物体的概率。我们提出了一种新颖的概率规划框架,使用关系语义知识搜索物体。我们的仿真分析表明,SEEK在物体目标检查任务上优于本研究中使用的经典规划和基于大型语言模型的方法。我们在城市环境中使用实物机器人进行了验证,展示了其在现实世界检查场景中的实际性和有效性。
https://arxiv.org/abs/2405.09822
Traditional security mechanisms isolate resources from users who should not access them. We reflect the compositional nature of such security mechanisms back into the structure of LLMs to build a provably secure LLM; that we term SecureLLM. Other approaches to LLM safety attempt to protect against bad actors or bad outcomes, but can only do so to an extent making them inappropriate for sensitive data. SecureLLM blends access security with fine-tuning methods. Each data silo has associated with it a separate fine-tuning and a user has access only to the collection of fine-tunings that they have permission for. The model must then perform on compositional tasks at the intersection of those data silos with the combination of those individual fine-tunings. While applicable to any task like document QA or making API calls, in this work we concern ourselves with models that learn the layouts of new SQL databases to provide natural-language-to-SQL translation capabilities. Existing fine-tuning composition methods fail in this challenging environment, as they are not well-equipped for handling compositional tasks. Compositionality remains a challenge for LLMs. We contribute both a difficult new compositional natural-language-to-SQL translation task and a new perspective on LLM security that allows models to be deployed to secure environments today.
传统的安全机制将资源从不应访问它们的用户与资源进行隔离。我们将这种安全机制的组合性质反映回LLM的结构中,以构建一个可证明安全的LLM;我们称之为SecureLLM。其他LLM安全性方法试图保护用户免受不良行为或不良结果的影响,但它们只能做到这一点,使它们不适合敏感数据。SecureLLM将访问安全与微调方法相结合。每个数据 silo 都与它相关的单独微调和用户只能访问他们有权限访问的微调集合。模型 then 需要在那些数据 silos 的交叉点对组合任务进行操作,并与单个微调的组合进行操作。虽然适用于诸如文档QA或通过API调用等任何任务,但在这项工作中,我们关注的是学习新SQL数据库布局的自然语言到SQL翻译能力的模型。现有的微调组合方法在这个具有挑战性的环境中失败,因为它们在处理组合任务方面不够灵活。可组合性仍然是LLM的一个挑战。我们为LLM的安全性做出了贡献,包括一个困难的新组合自然语言到SQL翻译任务和一个允许模型部署到安全环境的新视角。
https://arxiv.org/abs/2405.09805