It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: this https URL.
生成内容丰富的长视频在分钟级别上是值得追求的,但具有挑战性。自动回归的大语言模型(LLMs)在自然语言处理领域已经取得了巨大的成功,在视频生成领域探索自回归LLMs也局限于生成几秒钟的视频。在这项工作中,我们深入分析了防止自回归LLM基视频生成器生成长视频的挑战。根据观察和分析,我们提出了Loong,一种新的自回归LLM基视频生成器,可以生成分钟长的视频。具体来说,我们将文本标记和视频标记视为一个统一的序列,用于自回归LLMs,并从零开始训练模型。我们提出了渐进短到长训练以及损失重新分配方案来减轻长视频训练中的损失不平衡问题。我们进一步研究了推理策略,包括视频标记重新编码和采样策略,以减缓在推理过程中的误差累积。我们的Loong可以根据文本提示生成分钟级别的长视频,正如实验结果所证明。更多样本可在此链接中查看:https:// this URL。
https://arxiv.org/abs/2410.02757
Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。
https://arxiv.org/abs/2410.02755
Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
软件工程师主要通过编辑现有的程序来编写代码。相比之下,大型语言模型(LLMs)在一次性内合成程序。造成这种差异的一个原因是开源编辑数据的稀缺性。虽然高质量代码合成数据已经很稀缺了,但高质量编辑数据更加稀缺。为了填补这一空白,我们开发了一个名为LintSeq的合成数据生成算法。通过使用linter程序,它使用错误-free的插入来遍历可用于顺序编写程序的错误-free插入,将现有的代码重构为一系列代码编辑。它输出编辑序列作为文本字符串,由连续的程序差异组成。为了测试LintSeq,我们使用它将指令和程序对数据集重构为指令和程序差异对。然后,在重构和原始版本的了这个数据集上,我们微调了一系列参数从2.6B到14B的小型LLM,在代码合成基准测试中的零样本性能进行比较,比较基于零样本的性能。我们发现,在重复抽样过程中,经过优化的编辑序列模型产生的程序比基线更具有多样性。这使得基准测试的推理时间扩展更好,作为样本数的函数。例如,在HumanEval的pass@50上,通过合成编辑序列训练的小型LLM与GPT-4相当,并且在基准数据集上的绝对得分比基于基准数据集训练的模型快20%(+/-3%)。最后,我们还为代码理解预训练了自己的小型LM。我们发现,通过在合成代码编辑上微调小型模型,可以实现对于设备类模型的最先进的代码合成。我们的150M参数编辑序列LM与具有两倍参数的大型模型(包括重复抽样和Codex和AlphaCode)相匹敌或者更优。
https://arxiv.org/abs/2410.02749
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大型语言模型(LLMs)可以通过提示技术在领域之间生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,制定有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献,更完整。关键词的数量可以控制精度和召回之间的权衡。此外,我们的分析发现,在LLM上引入显眼的语句级别信息要优于词或句子级别。然而,对幻觉的影响并非普遍积极。为了进行这项分析,我们引入了Keyphrase Signal Extractor(CriSPO),一种轻量级模型,可以微调以提取显眼的关键词。通过使用CriSPO,我们在数据集上实现了一致的ROUGE提高,同时在不进行LLM自定义的情况下使用开放式权重和专有LLM。我们的研究结果为利用显眼信息构建基于提示的摘要系统提供了见解。
https://arxiv.org/abs/2410.02748
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
强化学习从人类反馈(RLHF)已经在将大型语言模型(LLMs)与人类偏好对齐方面取得了有效性。然而,在长序列中,词级 RLHF 受到过时的奖励问题困扰,这使得模型难以分辨哪些动作导致了成功结果。这阻碍了学习效率并减慢了收敛速度。在本文中,我们提出了 MA-RLHF,一种简单而有效的 RLHF 框架,它将宏观行动——词或更高层次的语言构建的序列——融入到学习过程中。通过操作在更高抽象层次上,我们的方法减少了动作和奖励之间的时间间隔,促进了更快和更准确的奖励分配。这导致每个 episode 内的学习效率提高,同时训练或推理过程中的计算复杂度没有增加。我们通过在各种模型大小和任务上进行广泛的实验来验证我们的方法。我们的方法在文本摘要、对话生成、问题回答和程序合成等任务上取得了超过标准 RLHF 的显著性能提升。值得注意的是,我们的方法与普通 RLHF 的性能提升达到 30% 的效果。在训练时间上,我们的方法甚至比普通 RLHF 快 2 倍,并且在进一步训练后仍然表现出优异的性能。我们将把我们的代码和数据公开发布在本文的链接上。
https://arxiv.org/abs/2410.02743
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大规模语言模型(LLMs)可以通过提示技术在跨领域生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,创建有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献和更完整。关键词的数量可以控制精度-召回权衡。此外,我们的分析发现,在LLM上包含显眼的语义信息优于仅包含单词或句子级别的信息。然而,对于不同的LLM,影响幻觉的程度并不是普遍积极的。为了进行这项分析,我们引入了Keyphrase Signal Extractor(SigExt),一种轻量级的模型,可以微调以提取显眼的关键词。通过使用SigExt,我们在数据集上实现了一致的ROUGE提高,同时使用开箱即用的LLM和非LLM定制版本。我们的研究结果提供了关于利用显眼信息构建基于提示的摘要系统的见解。
https://arxiv.org/abs/2410.02741
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
近年来在多模态模型的进步突出了重构式字幕的价值,然而关键挑战仍然存在。例如,虽然合成字幕通常提供更好的质量和图像文本对齐,但并不清楚它们是否可以完全取代AltTexts:合成字幕及其与原始爬取的AltText之间的相互作用仍然不太清楚。此外,不同的多模态基础模型可能对特定的字幕格式有独特的偏好,但努力确定每个模型的最佳重构式仍然有限。在这项工作中,我们提出了一个新颖、可控制和可扩展的 captioning 管道,旨在为各种多模态模型生成定制化的字幕格式。通过将 Short Synthetic Captions(SSC)与Dense Synthetic Captions(DSC+)作为案例研究,我们系统地探讨了它们对不同模型(如CLIP、多模态LLM和扩散模型)与AltText之间的影响。我们的研究结果表明,将人造字幕和原始文本相结合的半监督方法可以优于仅使用人造字幕,提高两者的对齐度和性能,每个模型都表现出对特定字幕格式的偏好。这种全面的分析为优化字幕策略提供了宝贵的洞见,从而推动了多模态基础模型的预训练。
https://arxiv.org/abs/2410.02740
LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
LLM-as-a-Judge作为一种基准测试评估方法以及模型训练中的监督奖励方法得到了广泛应用。然而,尽管它们在许多领域表现出色,但潜在问题尚未得到充分探讨,这削弱了它们的可靠性和应用范围。因此,我们识别出12个潜在偏见,并提出了一个新的自动偏见量化框架——CALM,通过使用自动和基于原则的修改系统有系统地量化并分析LLM-as-a-Judge中的每种偏见。我们的实验覆盖了多种流行语言模型,实验结果表明,尽管高级模型取得了可圈可点的整体性能,但在某些具体任务上仍然存在显著的偏见。经验结果表明,LLM-as-a-Judge的可靠性仍有待提高。此外,我们还讨论了这些偏见的显性和隐性影响,并给出了一些建议,以便用户在LLM-as-a-Judge应用中谨慎使用。我们的工作突出了利益相关者需要解决这些问题,并提醒用户在LLM-as-a-Judge应用中要谨慎使用。
https://arxiv.org/abs/2410.02736
Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
推理时间计算是一种增强大型语言模型(LLMs)性能的强大范例,其中最佳 of-N 采样是一种广泛使用的技术。然而,这种方法在计算上非常昂贵,需要同时实现(1)一个外部奖励模型和(2)生成多个样本。在这篇工作中,我们引入了一种新的自评估方案,旨在在保持或甚至提高性能的同时动态减少生成的样本数量。我们使用了一种生成奖励模型公式,使得LLM能够预测在重新生成过程中生成更好的响应。这些预测在没有外部奖励模型的情况下获得,可以用来决定是否生成更多样本、在早期阶段剪枝有前途的样本,或者选择最好的样本。这种能力非常便宜,因为它涉及生成一个预定义的标记。通过使用由真实未过滤的 LMSYS 用户提示构建的数据集进行训练,Llama 3.1 8B 在 AlpacaEval 上的 win率从 21% 增加到 34%,在 GSM8K 上的数学表现从 84% 增加到 91%。通过仅在LLM确定这样做有益时进行采样,并动态调整温度退火,我们证明了使用16个样本的改善量可以达到平均使用1.2个样本的74%。此外,我们还证明了在生成过程中,50%至75%的样本可以在最小性能损失的情况下被剪枝。总体而言,我们的方法使LLM在推理过程中实现更高效和可扩展的计算利用率。
https://arxiv.org/abs/2410.02725
Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size $T$ and context window of size $K$ and Markov chains defined on a finite state space of size $\mathcal{O}(T^K)$. We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
大语言模型(LLMs)在自然语言处理任务中的表现非常高效,而且远远超过了这一点。然而,对它们出色表现背后的全面理论分析仍然知之甚少。在本文中,我们通过将大小为$T$的词汇表和大小为$\mathcal{O}(T^K)$的有限状态空间定义的 Markov 链与具有大小为$K$的上下文窗口等价来解决这个问题。我们发现了关于 Markov 链存在静止分布的一些令人惊讶的发现,它们揭示了 LLMs 的推理能力、它们到达静止分布的速度以及温度对其的影响。然后我们证明了预训练和上下文泛化 bound,并表明这种类比使我们对它们的理解更加丰富。最后,我们通过实验展示了我们理论保证的正确性,以突显它们在实践中捕捉到的行为。
https://arxiv.org/abs/2410.02724
Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
大规模语言模型(LLMs)在大型语料库上预训练,并在许多通用自然语言处理(NLP)任务中表现出色,如问题回答(QA)。尽管它们具有高级语言能力,但在领域特定和知识密集型任务上,LLMs会受到幻觉、知识截止和知识归因不足的困扰。此外,将LLM的固有知识细分为高度特定的领域是一个耗时且昂贵的过程。最近,检索增强生成(RAG)过程作为一种优化LLM响应的方法而出现,通过将它们与预定义的语义网络参考。研究表明,使用知识图(KG)语义网络对RAG具有更好的QA准确率,通过考虑到相关的子图以保留结构化信息。在本文中,我们介绍了一个高度领域特定的LLM框架SMART-SLIC,该框架将RAG与KG和事实领域特定信息向量存储(VS)集成在一起。重要的是,为了避免知识库中的幻觉,我们通过NLP、数据挖掘和非负张量分解自动选择模型来构建这些高度领域特定的KGs和VS,而不是使用LLM。将我们的RAG与领域特定的: (i) KG(包含结构化信息)和(ii) VS(包含非结构化信息)相结合,可以开发出领域特定的聊天机器人,能够归因信息的来源、减轻幻觉、降低对细调的需求并擅长高度领域特定的问题回答任务。我们将SMART-SLIC与链式思考提示代理商相结合。该框架旨在适用于任何具体或专业领域。本文我们还展示了我们在关于恶意软件分析和检测领域的知识库上问题回答能力的实证研究。
https://arxiv.org/abs/2410.02721
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
大语言模型(LLMs)通常会产生错误,包括事实性不准确、偏见和推理失败等,这些共同称为“幻觉”。 近年来,研究表明,LLMs的内部状态编码了其输出真实性相关的信息,并且这种信息可以用于检测错误。在本文中,我们证明了LLMs的内部表示比以前想象的更能编码真实性信息。我们首先发现,真实性信息集中在特定的标记上,并利用这一特性显著增强了错误检测性能。然而,我们发现,这样的错误检测器无法在数据集之间泛化,暗示着——与先前的说法相反——真实性编码不是普遍的,而是多面的。接下来,我们展示了内部表示还可以用于预测模型可能出现的错误类型,促进开发定制化缓解策略。最后,我们揭示了LLMs的内部编码和外部行为之间的差异:它们可能编码正确的答案,但总是生成错误的答案。这些见解从模型内部的角度进一步加深了我们对于LLM错误的了解,这对于未来研究增强错误分析和缓解方法具有指导意义。
https://arxiv.org/abs/2410.02707
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. While recent works have demonstrated that state-of-the-art schemes are in fact vulnerable to spoofing, they lack deeper qualitative analysis of the texts produced by spoofing methods. In this work, we for the first time reveal that there are observable differences between genuine and spoofed watermark texts. Namely, we show that regardless of their underlying approach, all current spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts, effectively discovering that a watermark was spoofed. Our experimental evaluation shows high test power across all current spoofing methods, providing insights into their fundamental limitations, and suggesting a way to mitigate this threat.
LLM水印作为一种可靠地归属LLM生成的文本所有权的潜在方法而闻名。水印可信度的威胁来自伪造攻击,其中未经授权的第三方伪造水印,使它能够错误地将任意文本归因于特定的LLM。尽管最近的工作表明,最先进的方法实际上很容易受到伪造攻击,但它们缺乏对伪造方法生成的文本的深入定性分析。在这项工作中,我们第一次揭示了真实和伪造水印文本之间的可观察差异。具体来说,我们证明了,无论其底层方法如何,所有当前伪造方法都会在伪造文本中留下可观察的痕迹,表明水印伪造。我们基于这些发现提出了严格的统计测试,可靠地揭示了这种痕迹的存在,有效地发现了水印被伪造。我们的实验评估表明,所有当前的伪造方法都具有很高的测试效力,提供了对它们基本局限性的洞察,并提出了减轻这种威胁的方法。
https://arxiv.org/abs/2410.02693
As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.
随着大型语言模型(LLMs)变得越来越强大,确保其安全和与人类价值观保持一致仍然是一个关键挑战。理想情况下,LLMs应提供有益的回答,同时避免披露有害或敏感信息。然而,当前的 alignment 方法,这些方法依赖拒绝策略,如将模型训练为完全拒绝有害提示或应用粗略过滤器,因为其二进制性质而受到限制。这些方法要么完全否认访问信息,要么在信息不足的情况下授予它,导致过于谨慎的回答或无法检测到细微的有害内容。例如,LLMs 可能因滥用担忧而拒绝提供关于药物的基本、公共信息。此外,这些基于拒绝的方法很难处理混合内容场景,并且缺乏适应语境敏感性的能力,可能导致对良性内容的过度审查。为了克服这些挑战,我们引入了 HiddenGuard,一种用于在 LLMs 中进行细粒度、安全生成的全新框架。HiddenGuard 包含了 Prism(在流媒体审核中实现真实时间、词级检测和编辑有害内容的表示路由器),它与 LLM 并行工作,利用中间隐藏状态对有害内容进行实时、词级检测和编辑。这种细粒度的方法允许更细微、上下文感知的审核,使模型可以在选择性地编辑或替换敏感信息的同时生成有益的回答,而不仅仅是直接拒绝。我们还贡献了一个覆盖各种上下文的完整数据集,其中包含了可能有害信息的词级细粒度注释。我们的实验证明,HiddenGuard 在检测和编辑有害内容的同时保留模型的整体实用性和信息性方面取得了超过 90% 的 F1 分数。
https://arxiv.org/abs/2410.02684
As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow's Hierarchy of Needs, Aristotle's Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.
随着我们在日常生活中越来越多地寻求LLM在决策中的指导,许多这些决策并不是非黑即白的,并且取决于用户的个人价值观和道德准则。我们提出了DailyDilemmas数据集,这是一个包含1360个在日常生活中的道德困境的数据集。每个困境都包括两种可能的行动,并且每种行动都涉及到受到影响的各方和 invoked的人类价值观。基于这些困境,我们在日常生活中话题上汇总了人类价值观,例如人际关系、工作和环境问题。我们对LLM在这些困境上的行动进行了评估,以确定他们将采取的行动以及这些行动所代表的人类价值观。然后,我们通过社会、心理学和哲学五个影响较大的理论对这些价值观进行分析。这些理论是:世界价值观调查、道德基础理论、马斯洛需求层次理论、亚里士多德美德理论和情感 wheel 理论。我们发现,LLM在关于自我表达生存价值观方面与自我表达和生存价值观最为相似,在道德基础理论方面与关心忠诚方面最为相似。有趣的是,我们在一些核心价值上发现了很大的偏好差异,例如真理fulness,例如Mixtral-8x7B模型往往忽视了它,而GPT-4-turbo模型往往选择了它。我们还研究了OpenAI(ModelSpec)和Anthropic(宪法AI)最近发布的指导,以了解他们在面对复杂道德推理的日常生活中环境中的实际价值优先级。我们发现,用户无法有效地使用系统提示来引导这种优先级。
https://arxiv.org/abs/2410.02683
Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$100x less training compute.
语音助手,如 Siri 和 Google Assistant,通常会分别对音频和文本进行建模,导致丢失的语音信息和增加的复杂性。为解决这个问题,最近的努力使用带有监督微调(SFT)的端到端语音大语言模型(LLMs)训练了一种模型,即“忘记”了文本仅LLM的“能力”。我们的工作提出了一种没有指令数据的语音大语言模型(LLM)的训练范式,使用文本仅LLM的转录作为自监督。重要的是,这个过程可以无标注响应进行。我们证明了我们的去噪语音助手(DiVA)适用于口语问题回答、分类和翻译。此外,我们还证明了DiVA更好地满足用户需求,尽管使用的是$>$100x更少的训练计算,其 win 率与诸如 Qwen 2 Audio 等最先进的模型的 win 率相比仍高72%。
https://arxiv.org/abs/2410.02678
To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
要使大型语言模型(LLMs)在各种文化中更有帮助,制定有效的文化知识基准至关重要,以衡量和跟踪我们的进展。有效的基准需要是健壮的、多样化的和具有挑战性的。我们介绍了一个名为CulturalBench的一组1,227个由人类编写的经人类验证的问题,用于有效地评估LLMs的文化知识,涵盖45个全球地区,包括那些代表性不足的地区,如孟加拉国、津巴布韦和秘鲁。问题-每个都被五名独立注释者验证-跨越了17个多样的话题,从食物喜好到问候礼仪。我们在两个设置下评估模型:CulturalBench-Easy和CulturalBench-Hard,它们共享相同的问题,但以不同的方式提出。我们发现LLMs对设置差异非常敏感(例如,GPT-4o与27.3%的差异)。与人类表现(92.6%的准确度)相比,CulturalBench-Hard对具有最佳性能的前沿LLM(GPT-4o)来说更具挑战性,其性能只有61.5%,而最差(Llama3-8b)的性能只有21.4%。此外,我们发现LLMs经常对具有多个正确答案的复杂问题陷入困境(例如:中国通常使用哪些餐具?),揭示了倾向于集中到单一答案的趋势。我们的结果还表明,OpenAI GPT-4o在所有与大洋洲和中东地区相关的问题上显著优于其他专有和开源模型。然而,所有模型在有关南美洲和中东地区的问题上均表现不佳。
https://arxiv.org/abs/2410.02677
We present the first correct-by-construction learning-based system for step-by-step mathematical integration. The key idea is to learn a policy, represented by a GPT transformer model, which guides the search for the right mathematical integration rule, to be carried out by a symbolic solver. Concretely, we introduce a symbolic engine with axiomatically correct actions on mathematical expressions, as well as the first dataset for step-by-step integration. Our GPT-style transformer model, trained on this synthetic data, demonstrates strong generalization by surpassing its own data generator in accuracy and efficiency, using 50% fewer search steps. Our experimental results with SoTA LLMs also demonstrate that the standard approach of fine-tuning LLMs on a set of question-answer pairs is insufficient for solving this mathematical task. This motivates the importance of discovering creative methods for combining LLMs with symbolic reasoning engines, of which our work is an instance.
我们提出了第一个基于构建的逐步数学整合学习系统。关键思想是学习一个用GPT变换器模型表示的策略,该策略指导符号求解器搜索正确的数学整合规则。具体来说,我们引入了一个具有轴理正确动作的符号引擎,以及用于逐步整合的第一组数据。我们对这个合成数据进行训练的GPT风格变换器模型在准确性和效率上超过了其自己的数据生成器,使用了50%的搜索步骤。我们使用SoTA LLMs的实验结果也表明,在问题-答案对集合上对LLMs进行微调的标准方法不足以解决这个数学问题。这激励我们发现将LLMs与符号推理引擎相结合的创意方法的重要性,而我们的工作正是这一实例。
https://arxiv.org/abs/2410.02666