We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
生成式 AI 的快速发展是一把双刃剑,这不仅促进了内容创作,而且使图像编辑和检测变得更加容易和困难。尽管当前的图像伪造检测和局部定位(IFDL)方法通常有效,但它们往往面临两个挑战:(1)未知检测原理的黑盒性质,(2)不同编辑方法(如 Photoshop、DeepFake 和 AIGC 编辑)下的有限泛化能力。为解决这些问题,我们提出了可解释的 IFDL 任务,并设计了 FakeShield,一种多模态框架,能够评估图像真实性、生成修改区域mask,并提供基于像素级和图像级修改线索的判断依据。此外,我们还利用 GPT-4o 增强现有 IFDL 数据集,为训练 FakeShield 的修改分析能力创建了多模态 Tamper Description 数据集(MMTD-Set)。同时,我们引入了领域标签指导的伪造检测模块(DTE-FDM)和多模态伪造定位模块(MFLM),以解决各种修改检测解释和实现基于详细文本描述的伪造定位。大量实验证明,FakeShield 有效地检测和定位各种修改技术,与之前 IFDL 方法相比,提供了更高水平的有解释性和优越性。
https://arxiv.org/abs/2410.02761
Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at this https URL
语言模型中概念消除的传统评估框架缺乏全面评估,导致对消除方法的有效性评估不完整。我们提出了一种以三个关键标准为中心的评估范式:无罪(完全知识消除)、平滑性(保持条件流畅生成)和具体性(保留无关任务表现)。我们的评估指标自然地激励了消除语言记忆(ELM)新方法的开发,该方法旨在解决所有三个维度。ELM通过针对低秩的更新来改变被消除概念的输出分布,同时保留整个模型的功能,包括在提示被消除概念时的流畅性。我们在生物安全、网络安全和文学领域消除任务上评估了ELM的有效性。比较分析显示,ELM在所提出的指标上实现了卓越的性能,包括在消除主题评估中的近乎随机的得分、生成流畅性、以及在无关基准上的保留准确性和在对抗攻击下的鲁棒性。我们的代码、数据和训练模型都可以在https://这个网址上找到。
https://arxiv.org/abs/2410.02760
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: this https URL.
生成内容丰富的长视频在分钟级别上是值得追求的,但具有挑战性。自动回归的大语言模型(LLMs)在自然语言处理领域已经取得了巨大的成功,在视频生成领域探索自回归LLMs也局限于生成几秒钟的视频。在这项工作中,我们深入分析了防止自回归LLM基视频生成器生成长视频的挑战。根据观察和分析,我们提出了Loong,一种新的自回归LLM基视频生成器,可以生成分钟长的视频。具体来说,我们将文本标记和视频标记视为一个统一的序列,用于自回归LLMs,并从零开始训练模型。我们提出了渐进短到长训练以及损失重新分配方案来减轻长视频训练中的损失不平衡问题。我们进一步研究了推理策略,包括视频标记重新编码和采样策略,以减缓在推理过程中的误差累积。我们的Loong可以根据文本提示生成分钟级别的长视频,正如实验结果所证明。更多样本可在此链接中查看:https:// this URL。
https://arxiv.org/abs/2410.02757
Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。
https://arxiv.org/abs/2410.02755
Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
软件工程师主要通过编辑现有的程序来编写代码。相比之下,大型语言模型(LLMs)在一次性内合成程序。造成这种差异的一个原因是开源编辑数据的稀缺性。虽然高质量代码合成数据已经很稀缺了,但高质量编辑数据更加稀缺。为了填补这一空白,我们开发了一个名为LintSeq的合成数据生成算法。通过使用linter程序,它使用错误-free的插入来遍历可用于顺序编写程序的错误-free插入,将现有的代码重构为一系列代码编辑。它输出编辑序列作为文本字符串,由连续的程序差异组成。为了测试LintSeq,我们使用它将指令和程序对数据集重构为指令和程序差异对。然后,在重构和原始版本的了这个数据集上,我们微调了一系列参数从2.6B到14B的小型LLM,在代码合成基准测试中的零样本性能进行比较,比较基于零样本的性能。我们发现,在重复抽样过程中,经过优化的编辑序列模型产生的程序比基线更具有多样性。这使得基准测试的推理时间扩展更好,作为样本数的函数。例如,在HumanEval的pass@50上,通过合成编辑序列训练的小型LLM与GPT-4相当,并且在基准数据集上的绝对得分比基于基准数据集训练的模型快20%(+/-3%)。最后,我们还为代码理解预训练了自己的小型LM。我们发现,通过在合成代码编辑上微调小型模型,可以实现对于设备类模型的最先进的代码合成。我们的150M参数编辑序列LM与具有两倍参数的大型模型(包括重复抽样和Codex和AlphaCode)相匹敌或者更优。
https://arxiv.org/abs/2410.02749
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大型语言模型(LLMs)可以通过提示技术在领域之间生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,制定有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献,更完整。关键词的数量可以控制精度和召回之间的权衡。此外,我们的分析发现,在LLM上引入显眼的语句级别信息要优于词或句子级别。然而,对幻觉的影响并非普遍积极。为了进行这项分析,我们引入了Keyphrase Signal Extractor(CriSPO),一种轻量级模型,可以微调以提取显眼的关键词。通过使用CriSPO,我们在数据集上实现了一致的ROUGE提高,同时在不进行LLM自定义的情况下使用开放式权重和专有LLM。我们的研究结果为利用显眼信息构建基于提示的摘要系统提供了见解。
https://arxiv.org/abs/2410.02748
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English.
我们解决这个问题:将预训练的大语言模型扩展到训练时没有见过的新的领域,例如为原始模型看到了很少或几乎没有训练数据的语言添加一门语言。流行的解决方案如微调或低秩适应在领域适应方面是成功的,但它们在正式上并没有增加任何额外的容量,并且削弱了原始领域的性能。我们的论文从数据、架构和训练过程三个方面分析了这个问题,这些方面被共同考虑。特别是,我们改进了适配器,使得在保证网络输出在原始领域几乎不变的情况下,可以学习整个新的语言。为此,我们修改了新的残差块,使得每个新的残差块在原始领域输出接近零。这种中性残差的解决方案借鉴了专家混合的架构组件,是有效的:与在英语上训练的原始模型的学习权重相比,只需增加20%的残差项。这种解决方案在领域适应方面的表现要优于同时采用微调、低秩或原样适配器的其他方法。
https://arxiv.org/abs/2410.02744
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
强化学习从人类反馈(RLHF)已经在将大型语言模型(LLMs)与人类偏好对齐方面取得了有效性。然而,在长序列中,词级 RLHF 受到过时的奖励问题困扰,这使得模型难以分辨哪些动作导致了成功结果。这阻碍了学习效率并减慢了收敛速度。在本文中,我们提出了 MA-RLHF,一种简单而有效的 RLHF 框架,它将宏观行动——词或更高层次的语言构建的序列——融入到学习过程中。通过操作在更高抽象层次上,我们的方法减少了动作和奖励之间的时间间隔,促进了更快和更准确的奖励分配。这导致每个 episode 内的学习效率提高,同时训练或推理过程中的计算复杂度没有增加。我们通过在各种模型大小和任务上进行广泛的实验来验证我们的方法。我们的方法在文本摘要、对话生成、问题回答和程序合成等任务上取得了超过标准 RLHF 的显著性能提升。值得注意的是,我们的方法与普通 RLHF 的性能提升达到 30% 的效果。在训练时间上,我们的方法甚至比普通 RLHF 快 2 倍,并且在进一步训练后仍然表现出优异的性能。我们将把我们的代码和数据公开发布在本文的链接上。
https://arxiv.org/abs/2410.02743
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大规模语言模型(LLMs)可以通过提示技术在跨领域生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,创建有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献和更完整。关键词的数量可以控制精度-召回权衡。此外,我们的分析发现,在LLM上包含显眼的语义信息优于仅包含单词或句子级别的信息。然而,对于不同的LLM,影响幻觉的程度并不是普遍积极的。为了进行这项分析,我们引入了Keyphrase Signal Extractor(SigExt),一种轻量级的模型,可以微调以提取显眼的关键词。通过使用SigExt,我们在数据集上实现了一致的ROUGE提高,同时使用开箱即用的LLM和非LLM定制版本。我们的研究结果提供了关于利用显眼信息构建基于提示的摘要系统的见解。
https://arxiv.org/abs/2410.02741
LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
LLM-as-a-Judge作为一种基准测试评估方法以及模型训练中的监督奖励方法得到了广泛应用。然而,尽管它们在许多领域表现出色,但潜在问题尚未得到充分探讨,这削弱了它们的可靠性和应用范围。因此,我们识别出12个潜在偏见,并提出了一个新的自动偏见量化框架——CALM,通过使用自动和基于原则的修改系统有系统地量化并分析LLM-as-a-Judge中的每种偏见。我们的实验覆盖了多种流行语言模型,实验结果表明,尽管高级模型取得了可圈可点的整体性能,但在某些具体任务上仍然存在显著的偏见。经验结果表明,LLM-as-a-Judge的可靠性仍有待提高。此外,我们还讨论了这些偏见的显性和隐性影响,并给出了一些建议,以便用户在LLM-as-a-Judge应用中谨慎使用。我们的工作突出了利益相关者需要解决这些问题,并提醒用户在LLM-as-a-Judge应用中要谨慎使用。
https://arxiv.org/abs/2410.02736
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
在未知环境中进行物体导航对于在现实应用中部署感知代理至关重要。虽然我们通过大规模场景数据、高速模拟器和更强的模型见证了巨大的进步,但之前的研究主要集中在有限的场景类型和目标对象上。在本文中,我们研究了在多种场景类型中导航到多个目标对象的新任务。为了进行基准测试,我们提出了一个大型场景数据集DivScene,包含81个不同类型的场景,总共有4,614个场景。通过这个数据集,我们在基于模仿学习对一个大视觉语言模型(LVLM)进行微调,以构建端到端的 embodied 代理,NatVLM。LVLM 通过从环境中获取先前的观察并生成下一个动作进行训练。我们还引入了动作预测的 CoT 轨迹,以便在调整 LVLMs 时获得更好的性能。通过广泛的实验,我们发现,在没有人类监督的情况下,通过模仿学习可以在最短路径上构建的 BFS 规划器上构建出高性能的 LVLM 代理。我们的代理在 GPT-4o 上的成功率超过了20%。同时,我们进行了各种分析,证明了我们的代理具有很强的泛化能力。
https://arxiv.org/abs/2410.02730
Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
信息检索(IR)方法旨在针对给定查询识别相关的文档,这是由于其在各种自然语言任务中取得成功而备受关注。然而,现有的方法通常仅考虑文档中的文本信息,而忽略了文档可以包含多种形式的信息,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,阻止了它们捕捉到整个文档的上下文和段落之间的互动。我们认为,这两个限制导致了检索到的文档表示不是最优的。在本文中,为了应对这些限制,我们旨在通过将文档与不同形式的信息集成来生成更全面和细微的文档表示。具体来说,我们通过利用最近在视觉语言模型上取得的处理和整合文本、图像和表格统一格式和表示的能力来实现这一目标。此外,为了减轻将文档分割为段落所带来的信息损失,我们进一步将分割段落的表示合并为一个单独的文档表示,同时引入了重排策略来在必要时将相关段落的重排组合成一个单独的文档表示。然后,通过在各种信息检索场景中进行广泛的实验,包括文本和多模态查询,我们证明了我们的方法在很大程度上超过了相关基线,得益于在文档中考虑了多种形式的信息的统一处理。
https://arxiv.org/abs/2410.02729
Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
推理时间计算是一种增强大型语言模型(LLMs)性能的强大范例,其中最佳 of-N 采样是一种广泛使用的技术。然而,这种方法在计算上非常昂贵,需要同时实现(1)一个外部奖励模型和(2)生成多个样本。在这篇工作中,我们引入了一种新的自评估方案,旨在在保持或甚至提高性能的同时动态减少生成的样本数量。我们使用了一种生成奖励模型公式,使得LLM能够预测在重新生成过程中生成更好的响应。这些预测在没有外部奖励模型的情况下获得,可以用来决定是否生成更多样本、在早期阶段剪枝有前途的样本,或者选择最好的样本。这种能力非常便宜,因为它涉及生成一个预定义的标记。通过使用由真实未过滤的 LMSYS 用户提示构建的数据集进行训练,Llama 3.1 8B 在 AlpacaEval 上的 win率从 21% 增加到 34%,在 GSM8K 上的数学表现从 84% 增加到 91%。通过仅在LLM确定这样做有益时进行采样,并动态调整温度退火,我们证明了使用16个样本的改善量可以达到平均使用1.2个样本的74%。此外,我们还证明了在生成过程中,50%至75%的样本可以在最小性能损失的情况下被剪枝。总体而言,我们的方法使LLM在推理过程中实现更高效和可扩展的计算利用率。
https://arxiv.org/abs/2410.02725
Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size $T$ and context window of size $K$ and Markov chains defined on a finite state space of size $\mathcal{O}(T^K)$. We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
大语言模型(LLMs)在自然语言处理任务中的表现非常高效,而且远远超过了这一点。然而,对它们出色表现背后的全面理论分析仍然知之甚少。在本文中,我们通过将大小为$T$的词汇表和大小为$\mathcal{O}(T^K)$的有限状态空间定义的 Markov 链与具有大小为$K$的上下文窗口等价来解决这个问题。我们发现了关于 Markov 链存在静止分布的一些令人惊讶的发现,它们揭示了 LLMs 的推理能力、它们到达静止分布的速度以及温度对其的影响。然后我们证明了预训练和上下文泛化 bound,并表明这种类比使我们对它们的理解更加丰富。最后,我们通过实验展示了我们理论保证的正确性,以突显它们在实践中捕捉到的行为。
https://arxiv.org/abs/2410.02724
Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
大规模语言模型(LLMs)在大型语料库上预训练,并在许多通用自然语言处理(NLP)任务中表现出色,如问题回答(QA)。尽管它们具有高级语言能力,但在领域特定和知识密集型任务上,LLMs会受到幻觉、知识截止和知识归因不足的困扰。此外,将LLM的固有知识细分为高度特定的领域是一个耗时且昂贵的过程。最近,检索增强生成(RAG)过程作为一种优化LLM响应的方法而出现,通过将它们与预定义的语义网络参考。研究表明,使用知识图(KG)语义网络对RAG具有更好的QA准确率,通过考虑到相关的子图以保留结构化信息。在本文中,我们介绍了一个高度领域特定的LLM框架SMART-SLIC,该框架将RAG与KG和事实领域特定信息向量存储(VS)集成在一起。重要的是,为了避免知识库中的幻觉,我们通过NLP、数据挖掘和非负张量分解自动选择模型来构建这些高度领域特定的KGs和VS,而不是使用LLM。将我们的RAG与领域特定的: (i) KG(包含结构化信息)和(ii) VS(包含非结构化信息)相结合,可以开发出领域特定的聊天机器人,能够归因信息的来源、减轻幻觉、降低对细调的需求并擅长高度领域特定的问题回答任务。我们将SMART-SLIC与链式思考提示代理商相结合。该框架旨在适用于任何具体或专业领域。本文我们还展示了我们在关于恶意软件分析和检测领域的知识库上问题回答能力的实证研究。
https://arxiv.org/abs/2410.02721
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty enhances model calibration, improving robustness and mitigating semantic inconsistencies introduced by random chunking. Leveraging this insight, we propose an efficient unsupervised learning technique to train the retrieval model, alongside an effective data sampling and scaling strategy. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results while using only 4% of the training data compared to other advanced open-source retrieval models under distribution shift settings. Our method demonstrates strong calibration through span uncertainty, leading to improved generalization and robustness in long-context RAG tasks. Additionally, UncertaintyRAG provides a lightweight retrieval model that can be integrated into any large language model with varying context window lengths, without the need for fine-tuning, showcasing the flexibility of our approach.
我们提出了UncertaintyRAG,一种新颖的用于长上下文检索增强生成(RAG)的方法,该方法利用信号噪声比(SNR)基于范围不确定性来估计文本片段之间的相似度。这个范围不确定性增强了模型的校准,提高了稳健性,并减轻了由随机片段化带来的语义不一致性。借此启示,我们提出了一种有效的无监督学习方法来训练检索模型,并搭配有效的数据抽样和扩展策略。UncertaintyRAG在LLaMA-2-7B上的性能比基线提高了2.03%,在分布平移设置下,只使用了训练数据中的4%,即可实现与其它高级开源检索模型的最佳结果。我们的方法通过范围不确定性表现出强大的校准效果,从而在长上下文RAG任务中提高了泛化能力和稳健性。此外,UncertaintyRAG提供了一个轻量级的检索模型,可以集成到具有不同上下文窗口长度的任何大型语言模型中,无需进行微调,展示了我们方法的灵活性。
https://arxiv.org/abs/2410.02719
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
大语言模型(LLMs)通常会产生错误,包括事实性不准确、偏见和推理失败等,这些共同称为“幻觉”。 近年来,研究表明,LLMs的内部状态编码了其输出真实性相关的信息,并且这种信息可以用于检测错误。在本文中,我们证明了LLMs的内部表示比以前想象的更能编码真实性信息。我们首先发现,真实性信息集中在特定的标记上,并利用这一特性显著增强了错误检测性能。然而,我们发现,这样的错误检测器无法在数据集之间泛化,暗示着——与先前的说法相反——真实性编码不是普遍的,而是多面的。接下来,我们展示了内部表示还可以用于预测模型可能出现的错误类型,促进开发定制化缓解策略。最后,我们揭示了LLMs的内部编码和外部行为之间的差异:它们可能编码正确的答案,但总是生成错误的答案。这些见解从模型内部的角度进一步加深了我们对于LLM错误的了解,这对于未来研究增强错误分析和缓解方法具有指导意义。
https://arxiv.org/abs/2410.02707