There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
最近,人们普遍认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所提出的更复杂挑战。然而,这是真的吗?我们的研究结果表明,即使处理短视频,LMMs仍然缺乏许多基本推理能力。我们引入了Vinoground,一个包含1000个短和自然视频对的时间反事实LMM评估基准。我们证明了现有的LMMs在区分不同动作和物体变换的时间差异方面严重挣扎。例如,最佳模型GPT-4o在我们的文本和视频评分上的得分仅为~50%,与人类基线(~90%)相比存在很大的差距。所有开源的多模态模型和CLIP基于模型表现得更糟,产生主要是随机猜测的性能。通过这项工作,我们阐明了一个重要的问题,即短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在此链接查看:https://github.com/jhlau/Vinoground
https://arxiv.org/abs/2410.02763
We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
生成式 AI 的快速发展是一把双刃剑,这不仅促进了内容创作,而且使图像编辑和检测变得更加容易和困难。尽管当前的图像伪造检测和局部定位(IFDL)方法通常有效,但它们往往面临两个挑战:(1)未知检测原理的黑盒性质,(2)不同编辑方法(如 Photoshop、DeepFake 和 AIGC 编辑)下的有限泛化能力。为解决这些问题,我们提出了可解释的 IFDL 任务,并设计了 FakeShield,一种多模态框架,能够评估图像真实性、生成修改区域mask,并提供基于像素级和图像级修改线索的判断依据。此外,我们还利用 GPT-4o 增强现有 IFDL 数据集,为训练 FakeShield 的修改分析能力创建了多模态 Tamper Description 数据集(MMTD-Set)。同时,我们引入了领域标签指导的伪造检测模块(DTE-FDM)和多模态伪造定位模块(MFLM),以解决各种修改检测解释和实现基于详细文本描述的伪造定位。大量实验证明,FakeShield 有效地检测和定位各种修改技术,与之前 IFDL 方法相比,提供了更高水平的有解释性和优越性。
https://arxiv.org/abs/2410.02761
Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。
https://arxiv.org/abs/2410.02755
Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
软件工程师主要通过编辑现有的程序来编写代码。相比之下,大型语言模型(LLMs)在一次性内合成程序。造成这种差异的一个原因是开源编辑数据的稀缺性。虽然高质量代码合成数据已经很稀缺了,但高质量编辑数据更加稀缺。为了填补这一空白,我们开发了一个名为LintSeq的合成数据生成算法。通过使用linter程序,它使用错误-free的插入来遍历可用于顺序编写程序的错误-free插入,将现有的代码重构为一系列代码编辑。它输出编辑序列作为文本字符串,由连续的程序差异组成。为了测试LintSeq,我们使用它将指令和程序对数据集重构为指令和程序差异对。然后,在重构和原始版本的了这个数据集上,我们微调了一系列参数从2.6B到14B的小型LLM,在代码合成基准测试中的零样本性能进行比较,比较基于零样本的性能。我们发现,在重复抽样过程中,经过优化的编辑序列模型产生的程序比基线更具有多样性。这使得基准测试的推理时间扩展更好,作为样本数的函数。例如,在HumanEval的pass@50上,通过合成编辑序列训练的小型LLM与GPT-4相当,并且在基准数据集上的绝对得分比基于基准数据集训练的模型快20%(+/-3%)。最后,我们还为代码理解预训练了自己的小型LM。我们发现,通过在合成代码编辑上微调小型模型,可以实现对于设备类模型的最先进的代码合成。我们的150M参数编辑序列LM与具有两倍参数的大型模型(包括重复抽样和Codex和AlphaCode)相匹敌或者更优。
https://arxiv.org/abs/2410.02749
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
近年来在多模态模型的进步突出了重构式字幕的价值,然而关键挑战仍然存在。例如,虽然合成字幕通常提供更好的质量和图像文本对齐,但并不清楚它们是否可以完全取代AltTexts:合成字幕及其与原始爬取的AltText之间的相互作用仍然不太清楚。此外,不同的多模态基础模型可能对特定的字幕格式有独特的偏好,但努力确定每个模型的最佳重构式仍然有限。在这项工作中,我们提出了一个新颖、可控制和可扩展的 captioning 管道,旨在为各种多模态模型生成定制化的字幕格式。通过将 Short Synthetic Captions(SSC)与Dense Synthetic Captions(DSC+)作为案例研究,我们系统地探讨了它们对不同模型(如CLIP、多模态LLM和扩散模型)与AltText之间的影响。我们的研究结果表明,将人造字幕和原始文本相结合的半监督方法可以优于仅使用人造字幕,提高两者的对齐度和性能,每个模型都表现出对特定字幕格式的偏好。这种全面的分析为优化字幕策略提供了宝贵的洞见,从而推动了多模态基础模型的预训练。
https://arxiv.org/abs/2410.02740
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
在未知环境中进行物体导航对于在现实应用中部署感知代理至关重要。虽然我们通过大规模场景数据、高速模拟器和更强的模型见证了巨大的进步,但之前的研究主要集中在有限的场景类型和目标对象上。在本文中,我们研究了在多种场景类型中导航到多个目标对象的新任务。为了进行基准测试,我们提出了一个大型场景数据集DivScene,包含81个不同类型的场景,总共有4,614个场景。通过这个数据集,我们在基于模仿学习对一个大视觉语言模型(LVLM)进行微调,以构建端到端的 embodied 代理,NatVLM。LVLM 通过从环境中获取先前的观察并生成下一个动作进行训练。我们还引入了动作预测的 CoT 轨迹,以便在调整 LVLMs 时获得更好的性能。通过广泛的实验,我们发现,在没有人类监督的情况下,通过模仿学习可以在最短路径上构建的 BFS 规划器上构建出高性能的 LVLM 代理。我们的代理在 GPT-4o 上的成功率超过了20%。同时,我们进行了各种分析,证明了我们的代理具有很强的泛化能力。
https://arxiv.org/abs/2410.02730
Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
信息检索(IR)方法旨在针对给定查询识别相关的文档,这是由于其在各种自然语言任务中取得成功而备受关注。然而,现有的方法通常仅考虑文档中的文本信息,而忽略了文档可以包含多种形式的信息,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,阻止了它们捕捉到整个文档的上下文和段落之间的互动。我们认为,这两个限制导致了检索到的文档表示不是最优的。在本文中,为了应对这些限制,我们旨在通过将文档与不同形式的信息集成来生成更全面和细微的文档表示。具体来说,我们通过利用最近在视觉语言模型上取得的处理和整合文本、图像和表格统一格式和表示的能力来实现这一目标。此外,为了减轻将文档分割为段落所带来的信息损失,我们进一步将分割段落的表示合并为一个单独的文档表示,同时引入了重排策略来在必要时将相关段落的重排组合成一个单独的文档表示。然后,通过在各种信息检索场景中进行广泛的实验,包括文本和多模态查询,我们证明了我们的方法在很大程度上超过了相关基线,得益于在文档中考虑了多种形式的信息的统一处理。
https://arxiv.org/abs/2410.02729
Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
推理时间计算是一种增强大型语言模型(LLMs)性能的强大范例,其中最佳 of-N 采样是一种广泛使用的技术。然而,这种方法在计算上非常昂贵,需要同时实现(1)一个外部奖励模型和(2)生成多个样本。在这篇工作中,我们引入了一种新的自评估方案,旨在在保持或甚至提高性能的同时动态减少生成的样本数量。我们使用了一种生成奖励模型公式,使得LLM能够预测在重新生成过程中生成更好的响应。这些预测在没有外部奖励模型的情况下获得,可以用来决定是否生成更多样本、在早期阶段剪枝有前途的样本,或者选择最好的样本。这种能力非常便宜,因为它涉及生成一个预定义的标记。通过使用由真实未过滤的 LMSYS 用户提示构建的数据集进行训练,Llama 3.1 8B 在 AlpacaEval 上的 win率从 21% 增加到 34%,在 GSM8K 上的数学表现从 84% 增加到 91%。通过仅在LLM确定这样做有益时进行采样,并动态调整温度退火,我们证明了使用16个样本的改善量可以达到平均使用1.2个样本的74%。此外,我们还证明了在生成过程中,50%至75%的样本可以在最小性能损失的情况下被剪枝。总体而言,我们的方法使LLM在推理过程中实现更高效和可扩展的计算利用率。
https://arxiv.org/abs/2410.02725
Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size $T$ and context window of size $K$ and Markov chains defined on a finite state space of size $\mathcal{O}(T^K)$. We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
大语言模型(LLMs)在自然语言处理任务中的表现非常高效,而且远远超过了这一点。然而,对它们出色表现背后的全面理论分析仍然知之甚少。在本文中,我们通过将大小为$T$的词汇表和大小为$\mathcal{O}(T^K)$的有限状态空间定义的 Markov 链与具有大小为$K$的上下文窗口等价来解决这个问题。我们发现了关于 Markov 链存在静止分布的一些令人惊讶的发现,它们揭示了 LLMs 的推理能力、它们到达静止分布的速度以及温度对其的影响。然后我们证明了预训练和上下文泛化 bound,并表明这种类比使我们对它们的理解更加丰富。最后,我们通过实验展示了我们理论保证的正确性,以突显它们在实践中捕捉到的行为。
https://arxiv.org/abs/2410.02724
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
我们提出了LLaVA-Critic,第一个开源的大多模态模型(LMM),旨在作为通才评估器评估各种多模态任务的性能。LLaVA-Critic使用高质量的评论指令跟随数据集进行训练,该数据集包含了各种评估标准和场景。我们的实验证明了模型在两个关键领域中的有效性:(1)LLM作为评判者,LLaVA-Critic提供可靠的评估分数,在多个评估基准上与或超过GPT模型;和(2)偏好学习,为偏好学习生成奖励信号,提高模型对齐能力。这项工作突出了开源LMM在自批评和评估方面的潜力,为未来研究铺平了道路,深入研究可扩展的超人类对齐反馈机制。
https://arxiv.org/abs/2410.02712
Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.
无需注意力的上下文中的多余元素会降低性能。我们引入了选择性注意力,这是一种简单的参数free修改标准注意力机制,将注意力减少到多余元素上。选择性注意力在各种模型大小和上下文长度的语言建模任务中提高了性能。例如,使用C4上训练的语言建模目标的各种变换器,在具有选择性注意力的变换器中,具有与具有大约2X个头和参数的标准变换器相等的性能。选择性注意力还允许在推理期间减小注意力的上下文缓冲区的大小,从而在内存和计算需求上实现有意义地减少。例如,具有100M参数并且在C4上使用上下文大小为512、1024和2048的各种变换器,在具有选择性注意力的情况下,其注意力模块需要16X、25X和47X少的内存, respectively,与没有选择性注意力的情况相同,具有相同的验证预测精度。
https://arxiv.org/abs/2410.02703
As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow's Hierarchy of Needs, Aristotle's Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.
随着我们在日常生活中越来越多地寻求LLM在决策中的指导,许多这些决策并不是非黑即白的,并且取决于用户的个人价值观和道德准则。我们提出了DailyDilemmas数据集,这是一个包含1360个在日常生活中的道德困境的数据集。每个困境都包括两种可能的行动,并且每种行动都涉及到受到影响的各方和 invoked的人类价值观。基于这些困境,我们在日常生活中话题上汇总了人类价值观,例如人际关系、工作和环境问题。我们对LLM在这些困境上的行动进行了评估,以确定他们将采取的行动以及这些行动所代表的人类价值观。然后,我们通过社会、心理学和哲学五个影响较大的理论对这些价值观进行分析。这些理论是:世界价值观调查、道德基础理论、马斯洛需求层次理论、亚里士多德美德理论和情感 wheel 理论。我们发现,LLM在关于自我表达生存价值观方面与自我表达和生存价值观最为相似,在道德基础理论方面与关心忠诚方面最为相似。有趣的是,我们在一些核心价值上发现了很大的偏好差异,例如真理fulness,例如Mixtral-8x7B模型往往忽视了它,而GPT-4-turbo模型往往选择了它。我们还研究了OpenAI(ModelSpec)和Anthropic(宪法AI)最近发布的指导,以了解他们在面对复杂道德推理的日常生活中环境中的实际价值优先级。我们发现,用户无法有效地使用系统提示来引导这种优先级。
https://arxiv.org/abs/2410.02683
To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
要使大型语言模型(LLMs)在各种文化中更有帮助,制定有效的文化知识基准至关重要,以衡量和跟踪我们的进展。有效的基准需要是健壮的、多样化的和具有挑战性的。我们介绍了一个名为CulturalBench的一组1,227个由人类编写的经人类验证的问题,用于有效地评估LLMs的文化知识,涵盖45个全球地区,包括那些代表性不足的地区,如孟加拉国、津巴布韦和秘鲁。问题-每个都被五名独立注释者验证-跨越了17个多样的话题,从食物喜好到问候礼仪。我们在两个设置下评估模型:CulturalBench-Easy和CulturalBench-Hard,它们共享相同的问题,但以不同的方式提出。我们发现LLMs对设置差异非常敏感(例如,GPT-4o与27.3%的差异)。与人类表现(92.6%的准确度)相比,CulturalBench-Hard对具有最佳性能的前沿LLM(GPT-4o)来说更具挑战性,其性能只有61.5%,而最差(Llama3-8b)的性能只有21.4%。此外,我们发现LLMs经常对具有多个正确答案的复杂问题陷入困境(例如:中国通常使用哪些餐具?),揭示了倾向于集中到单一答案的趋势。我们的结果还表明,OpenAI GPT-4o在所有与大洋洲和中东地区相关的问题上显著优于其他专有和开源模型。然而,所有模型在有关南美洲和中东地区的问题上均表现不佳。
https://arxiv.org/abs/2410.02677
Despite the remarkable success achieved by neural networks, particularly those represented by MLP and Transformer, we reveal that they exhibit potential flaws in the modeling and reasoning of periodicity, i.e., they tend to memorize the periodic data rather than genuinely understanding the underlying principles of periodicity. However, periodicity is a crucial trait in various forms of reasoning and generalization, underpinning predictability across natural and engineered systems through recurring patterns in observations. In this paper, we propose FAN, a novel network architecture based on Fourier Analysis, which empowers the ability to efficiently model and reason about periodic phenomena. By introducing Fourier Series, the periodicity is naturally integrated into the structure and computational processes of the neural network, thus achieving a more accurate expression and prediction of periodic patterns. As a promising substitute to multi-layer perceptron (MLP), FAN can seamlessly replace MLP in various models with fewer parameters and FLOPs. Through extensive experiments, we demonstrate the effectiveness of FAN in modeling and reasoning about periodic functions, and the superiority and generalizability of FAN across a range of real-world tasks, including symbolic formula representation, time series forecasting, and language modeling.
尽管神经网络取得了令人瞩目的成功,尤其是那些用MLP和Transformer表示的神经网络,但我们发现它们在周期性建模和推理中存在潜在缺陷,即它们倾向于记忆周期性数据而不是真正理解周期性背后的原则。然而,周期性在各种推理和泛化形式中是一个关键特征,通过观察到观测到的重复模式,为自然和工程系统提供可预测性。在本文中,我们提出了FAN,一种基于傅里叶分析的新网络架构,它具有高效建模和推理周期性现象的能力。通过引入傅里叶级数,周期性自然地整合到神经网络的结构和计算过程中,从而实现更准确地表达和预测周期性模式。作为多层感知器(MLP)的有前途的替代品,FAN可以在具有较少参数和FLOPs的各种模型中替换MLP。通过广泛的实验,我们证明了FAN在建模和推理周期性函数方面的有效性,以及FAN在各种现实世界任务中的优越性和普适性,包括符号公式表示、时间序列预测和语言建模。
https://arxiv.org/abs/2410.02675
We present the first correct-by-construction learning-based system for step-by-step mathematical integration. The key idea is to learn a policy, represented by a GPT transformer model, which guides the search for the right mathematical integration rule, to be carried out by a symbolic solver. Concretely, we introduce a symbolic engine with axiomatically correct actions on mathematical expressions, as well as the first dataset for step-by-step integration. Our GPT-style transformer model, trained on this synthetic data, demonstrates strong generalization by surpassing its own data generator in accuracy and efficiency, using 50% fewer search steps. Our experimental results with SoTA LLMs also demonstrate that the standard approach of fine-tuning LLMs on a set of question-answer pairs is insufficient for solving this mathematical task. This motivates the importance of discovering creative methods for combining LLMs with symbolic reasoning engines, of which our work is an instance.
我们提出了第一个基于构建的逐步数学整合学习系统。关键思想是学习一个用GPT变换器模型表示的策略,该策略指导符号求解器搜索正确的数学整合规则。具体来说,我们引入了一个具有轴理正确动作的符号引擎,以及用于逐步整合的第一组数据。我们对这个合成数据进行训练的GPT风格变换器模型在准确性和效率上超过了其自己的数据生成器,使用了50%的搜索步骤。我们使用SoTA LLMs的实验结果也表明,在问题-答案对集合上对LLMs进行微调的标准方法不足以解决这个数学问题。这激励我们发现将LLMs与符号推理引擎相结合的创意方法的重要性,而我们的工作正是这一实例。
https://arxiv.org/abs/2410.02666
Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi-agent decision-making problems because they miss the trial-and-error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language-guided simulator into the multi-agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi-agent decision-making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi-Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.
近年来在生成模型方面的进步,已经在许多领域引起了显著的创新,如图像生成和聊天机器人。尽管这些模型取得了成功,但它们通常在复杂的多智能体决策问题中产生零乱和不准确的解决方案,因为它们无法体验和推理,就像人类一样。为了应对这个局限,我们探讨了一个将语言指导的模拟器整合到多智能体强化学习流程中的范例,以增强生成的答案。模拟器是一个世界模型,它分别学习动态和奖励,其中动态模型包括图像标记者和因果变换器,以生成交互转移自回归,而奖励模型是通过最大化在语言指导下的专家演示轨迹的概率来学习的双向变换器。给出现有状态的图像和任务描述,我们使用世界模型来训练联合策略,并通过在动态模型上运行收敛策略来生成图像序列作为答案。 实证结果表明,这种框架可以通过在 StarCraft Multi-Agent Challenge 基准训练和未见过的任务上表现出卓越的性能来提高多智能体决策问题的答案。特别是,它可以在交互状态下生成一致的交互序列,并解释可解释的奖励函数,为未来训练生成模型开辟了道路。
https://arxiv.org/abs/2410.02664
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
我们研究了持续训练和监督微调(SFT)自然语言处理(NLM)模型的持续改进,以充分利用长上下文信息。首先,我们建立了可靠的评估协议来指导模型开发——我们使用一系列长上下文任务,在SFT之后使用指令数据评估模型,这能更好地揭示模型的长上下文能力。在我们 robust 的评估支持下,我们进行了全面的实验,以决定继续预训练的数据混合、指令调整数据集以及其他设计选择。我们发现:(1)代码仓库和书籍是长数据的良好来源,但将它们与高质量的短数据相结合至关重要;(2)使用超过评估长度的序列长度可以提高长上下文性能;(3)对于 SFT,仅使用短指令数据集在长上下文任务上产生很强的性能。我们的最终模型 ProLong-8B,从 LLama-3 初始化并训练了 40B 标记,在长度为 128K 的同类模型中展示了最先进的长期上下文性能。尽管在长上下文训练中只看到了 5% 的标记,但 ProLong 仍然超越了 LLama-3.18B-Instruct。此外,ProLong 还可以有效处理长达 512K 的标记,是 publicly available LMs 中最长上下文窗口的其中之一。
https://arxiv.org/abs/2410.02660