There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
最近,人们普遍认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所提出的更复杂挑战。然而,这是真的吗?我们的研究结果表明,即使处理短视频,LMMs仍然缺乏许多基本推理能力。我们引入了Vinoground,一个包含1000个短和自然视频对的时间反事实LMM评估基准。我们证明了现有的LMMs在区分不同动作和物体变换的时间差异方面严重挣扎。例如,最佳模型GPT-4o在我们的文本和视频评分上的得分仅为~50%,与人类基线(~90%)相比存在很大的差距。所有开源的多模态模型和CLIP基于模型表现得更糟,产生主要是随机猜测的性能。通过这项工作,我们阐明了一个重要的问题,即短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在此链接查看:https://github.com/jhlau/Vinoground
https://arxiv.org/abs/2410.02763
The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
生成式 AI 的快速发展是一把双刃剑,这不仅促进了内容创作,而且使图像编辑和检测变得更加容易和困难。尽管当前的图像伪造检测和局部定位(IFDL)方法通常有效,但它们往往面临两个挑战:(1)未知检测原理的黑盒性质,(2)不同编辑方法(如 Photoshop、DeepFake 和 AIGC 编辑)下的有限泛化能力。为解决这些问题,我们提出了可解释的 IFDL 任务,并设计了 FakeShield,一种多模态框架,能够评估图像真实性、生成修改区域mask,并提供基于像素级和图像级修改线索的判断依据。此外,我们还利用 GPT-4o 增强现有 IFDL 数据集,为训练 FakeShield 的修改分析能力创建了多模态 Tamper Description 数据集(MMTD-Set)。同时,我们引入了领域标签指导的伪造检测模块(DTE-FDM)和多模态伪造定位模块(MFLM),以解决各种修改检测解释和实现基于详细文本描述的伪造定位。大量实验证明,FakeShield 有效地检测和定位各种修改技术,与之前 IFDL 方法相比,提供了更高水平的有解释性和优越性。
https://arxiv.org/abs/2410.02761
Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。
https://arxiv.org/abs/2410.02755
Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
软件工程师主要通过编辑现有的程序来编写代码。相比之下,大型语言模型(LLMs)在一次性内合成程序。造成这种差异的一个原因是开源编辑数据的稀缺性。虽然高质量代码合成数据已经很稀缺了,但高质量编辑数据更加稀缺。为了填补这一空白,我们开发了一个名为LintSeq的合成数据生成算法。通过使用linter程序,它使用错误-free的插入来遍历可用于顺序编写程序的错误-free插入,将现有的代码重构为一系列代码编辑。它输出编辑序列作为文本字符串,由连续的程序差异组成。为了测试LintSeq,我们使用它将指令和程序对数据集重构为指令和程序差异对。然后,在重构和原始版本的了这个数据集上,我们微调了一系列参数从2.6B到14B的小型LLM,在代码合成基准测试中的零样本性能进行比较,比较基于零样本的性能。我们发现,在重复抽样过程中,经过优化的编辑序列模型产生的程序比基线更具有多样性。这使得基准测试的推理时间扩展更好,作为样本数的函数。例如,在HumanEval的pass@50上,通过合成编辑序列训练的小型LLM与GPT-4相当,并且在基准数据集上的绝对得分比基于基准数据集训练的模型快20%(+/-3%)。最后,我们还为代码理解预训练了自己的小型LM。我们发现,通过在合成代码编辑上微调小型模型,可以实现对于设备类模型的最先进的代码合成。我们的150M参数编辑序列LM与具有两倍参数的大型模型(包括重复抽样和Codex和AlphaCode)相匹敌或者更优。
https://arxiv.org/abs/2410.02749
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
强化学习从人类反馈(RLHF)已经在将大型语言模型(LLMs)与人类偏好对齐方面取得了有效性。然而,在长序列中,词级 RLHF 受到过时的奖励问题困扰,这使得模型难以分辨哪些动作导致了成功结果。这阻碍了学习效率并减慢了收敛速度。在本文中,我们提出了 MA-RLHF,一种简单而有效的 RLHF 框架,它将宏观行动——词或更高层次的语言构建的序列——融入到学习过程中。通过操作在更高抽象层次上,我们的方法减少了动作和奖励之间的时间间隔,促进了更快和更准确的奖励分配。这导致每个 episode 内的学习效率提高,同时训练或推理过程中的计算复杂度没有增加。我们通过在各种模型大小和任务上进行广泛的实验来验证我们的方法。我们的方法在文本摘要、对话生成、问题回答和程序合成等任务上取得了超过标准 RLHF 的显著性能提升。值得注意的是,我们的方法与普通 RLHF 的性能提升达到 30% 的效果。在训练时间上,我们的方法甚至比普通 RLHF 快 2 倍,并且在进一步训练后仍然表现出优异的性能。我们将把我们的代码和数据公开发布在本文的链接上。
https://arxiv.org/abs/2410.02743
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
在未知环境中进行物体导航对于在现实应用中部署感知代理至关重要。虽然我们通过大规模场景数据、高速模拟器和更强的模型见证了巨大的进步,但之前的研究主要集中在有限的场景类型和目标对象上。在本文中,我们研究了在多种场景类型中导航到多个目标对象的新任务。为了进行基准测试,我们提出了一个大型场景数据集DivScene,包含81个不同类型的场景,总共有4,614个场景。通过这个数据集,我们在基于模仿学习对一个大视觉语言模型(LVLM)进行微调,以构建端到端的 embodied 代理,NatVLM。LVLM 通过从环境中获取先前的观察并生成下一个动作进行训练。我们还引入了动作预测的 CoT 轨迹,以便在调整 LVLMs 时获得更好的性能。通过广泛的实验,我们发现,在没有人类监督的情况下,通过模仿学习可以在最短路径上构建的 BFS 规划器上构建出高性能的 LVLM 代理。我们的代理在 GPT-4o 上的成功率超过了20%。同时,我们进行了各种分析,证明了我们的代理具有很强的泛化能力。
https://arxiv.org/abs/2410.02730
Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
推理时间计算是一种增强大型语言模型(LLMs)性能的强大范例,其中最佳 of-N 采样是一种广泛使用的技术。然而,这种方法在计算上非常昂贵,需要同时实现(1)一个外部奖励模型和(2)生成多个样本。在这篇工作中,我们引入了一种新的自评估方案,旨在在保持或甚至提高性能的同时动态减少生成的样本数量。我们使用了一种生成奖励模型公式,使得LLM能够预测在重新生成过程中生成更好的响应。这些预测在没有外部奖励模型的情况下获得,可以用来决定是否生成更多样本、在早期阶段剪枝有前途的样本,或者选择最好的样本。这种能力非常便宜,因为它涉及生成一个预定义的标记。通过使用由真实未过滤的 LMSYS 用户提示构建的数据集进行训练,Llama 3.1 8B 在 AlpacaEval 上的 win率从 21% 增加到 34%,在 GSM8K 上的数学表现从 84% 增加到 91%。通过仅在LLM确定这样做有益时进行采样,并动态调整温度退火,我们证明了使用16个样本的改善量可以达到平均使用1.2个样本的74%。此外,我们还证明了在生成过程中,50%至75%的样本可以在最小性能损失的情况下被剪枝。总体而言,我们的方法使LLM在推理过程中实现更高效和可扩展的计算利用率。
https://arxiv.org/abs/2410.02725
Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
大规模语言模型(LLMs)在大型语料库上预训练,并在许多通用自然语言处理(NLP)任务中表现出色,如问题回答(QA)。尽管它们具有高级语言能力,但在领域特定和知识密集型任务上,LLMs会受到幻觉、知识截止和知识归因不足的困扰。此外,将LLM的固有知识细分为高度特定的领域是一个耗时且昂贵的过程。最近,检索增强生成(RAG)过程作为一种优化LLM响应的方法而出现,通过将它们与预定义的语义网络参考。研究表明,使用知识图(KG)语义网络对RAG具有更好的QA准确率,通过考虑到相关的子图以保留结构化信息。在本文中,我们介绍了一个高度领域特定的LLM框架SMART-SLIC,该框架将RAG与KG和事实领域特定信息向量存储(VS)集成在一起。重要的是,为了避免知识库中的幻觉,我们通过NLP、数据挖掘和非负张量分解自动选择模型来构建这些高度领域特定的KGs和VS,而不是使用LLM。将我们的RAG与领域特定的: (i) KG(包含结构化信息)和(ii) VS(包含非结构化信息)相结合,可以开发出领域特定的聊天机器人,能够归因信息的来源、减轻幻觉、降低对细调的需求并擅长高度领域特定的问题回答任务。我们将SMART-SLIC与链式思考提示代理商相结合。该框架旨在适用于任何具体或专业领域。本文我们还展示了我们在关于恶意软件分析和检测领域的知识库上问题回答能力的实证研究。
https://arxiv.org/abs/2410.02721
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
我们提出了LLaVA-Critic,第一个开源的大多模态模型(LMM),旨在作为通才评估器评估各种多模态任务的性能。LLaVA-Critic使用高质量的评论指令跟随数据集进行训练,该数据集包含了各种评估标准和场景。我们的实验证明了模型在两个关键领域中的有效性:(1)LLM作为评判者,LLaVA-Critic提供可靠的评估分数,在多个评估基准上与或超过GPT模型;和(2)偏好学习,为偏好学习生成奖励信号,提高模型对齐能力。这项工作突出了开源LMM在自批评和评估方面的潜力,为未来研究铺平了道路,深入研究可扩展的超人类对齐反馈机制。
https://arxiv.org/abs/2410.02712
As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow's Hierarchy of Needs, Aristotle's Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.
随着我们在日常生活中越来越多地寻求LLM在决策中的指导,许多这些决策并不是非黑即白的,并且取决于用户的个人价值观和道德准则。我们提出了DailyDilemmas数据集,这是一个包含1360个在日常生活中的道德困境的数据集。每个困境都包括两种可能的行动,并且每种行动都涉及到受到影响的各方和 invoked的人类价值观。基于这些困境,我们在日常生活中话题上汇总了人类价值观,例如人际关系、工作和环境问题。我们对LLM在这些困境上的行动进行了评估,以确定他们将采取的行动以及这些行动所代表的人类价值观。然后,我们通过社会、心理学和哲学五个影响较大的理论对这些价值观进行分析。这些理论是:世界价值观调查、道德基础理论、马斯洛需求层次理论、亚里士多德美德理论和情感 wheel 理论。我们发现,LLM在关于自我表达生存价值观方面与自我表达和生存价值观最为相似,在道德基础理论方面与关心忠诚方面最为相似。有趣的是,我们在一些核心价值上发现了很大的偏好差异,例如真理fulness,例如Mixtral-8x7B模型往往忽视了它,而GPT-4-turbo模型往往选择了它。我们还研究了OpenAI(ModelSpec)和Anthropic(宪法AI)最近发布的指导,以了解他们在面对复杂道德推理的日常生活中环境中的实际价值优先级。我们发现,用户无法有效地使用系统提示来引导这种优先级。
https://arxiv.org/abs/2410.02683
To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
要使大型语言模型(LLMs)在各种文化中更有帮助,制定有效的文化知识基准至关重要,以衡量和跟踪我们的进展。有效的基准需要是健壮的、多样化的和具有挑战性的。我们介绍了一个名为CulturalBench的一组1,227个由人类编写的经人类验证的问题,用于有效地评估LLMs的文化知识,涵盖45个全球地区,包括那些代表性不足的地区,如孟加拉国、津巴布韦和秘鲁。问题-每个都被五名独立注释者验证-跨越了17个多样的话题,从食物喜好到问候礼仪。我们在两个设置下评估模型:CulturalBench-Easy和CulturalBench-Hard,它们共享相同的问题,但以不同的方式提出。我们发现LLMs对设置差异非常敏感(例如,GPT-4o与27.3%的差异)。与人类表现(92.6%的准确度)相比,CulturalBench-Hard对具有最佳性能的前沿LLM(GPT-4o)来说更具挑战性,其性能只有61.5%,而最差(Llama3-8b)的性能只有21.4%。此外,我们发现LLMs经常对具有多个正确答案的复杂问题陷入困境(例如:中国通常使用哪些餐具?),揭示了倾向于集中到单一答案的趋势。我们的结果还表明,OpenAI GPT-4o在所有与大洋洲和中东地区相关的问题上显著优于其他专有和开源模型。然而,所有模型在有关南美洲和中东地区的问题上均表现不佳。
https://arxiv.org/abs/2410.02677
We present the first correct-by-construction learning-based system for step-by-step mathematical integration. The key idea is to learn a policy, represented by a GPT transformer model, which guides the search for the right mathematical integration rule, to be carried out by a symbolic solver. Concretely, we introduce a symbolic engine with axiomatically correct actions on mathematical expressions, as well as the first dataset for step-by-step integration. Our GPT-style transformer model, trained on this synthetic data, demonstrates strong generalization by surpassing its own data generator in accuracy and efficiency, using 50% fewer search steps. Our experimental results with SoTA LLMs also demonstrate that the standard approach of fine-tuning LLMs on a set of question-answer pairs is insufficient for solving this mathematical task. This motivates the importance of discovering creative methods for combining LLMs with symbolic reasoning engines, of which our work is an instance.
我们提出了第一个基于构建的逐步数学整合学习系统。关键思想是学习一个用GPT变换器模型表示的策略,该策略指导符号求解器搜索正确的数学整合规则。具体来说,我们引入了一个具有轴理正确动作的符号引擎,以及用于逐步整合的第一组数据。我们对这个合成数据进行训练的GPT风格变换器模型在准确性和效率上超过了其自己的数据生成器,使用了50%的搜索步骤。我们使用SoTA LLMs的实验结果也表明,在问题-答案对集合上对LLMs进行微调的标准方法不足以解决这个数学问题。这激励我们发现将LLMs与符号推理引擎相结合的创意方法的重要性,而我们的工作正是这一实例。
https://arxiv.org/abs/2410.02666
Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi-agent decision-making problems because they miss the trial-and-error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language-guided simulator into the multi-agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi-agent decision-making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi-Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.
近年来在生成模型方面的进步,已经在许多领域引起了显著的创新,如图像生成和聊天机器人。尽管这些模型取得了成功,但它们通常在复杂的多智能体决策问题中产生零乱和不准确的解决方案,因为它们无法体验和推理,就像人类一样。为了应对这个局限,我们探讨了一个将语言指导的模拟器整合到多智能体强化学习流程中的范例,以增强生成的答案。模拟器是一个世界模型,它分别学习动态和奖励,其中动态模型包括图像标记者和因果变换器,以生成交互转移自回归,而奖励模型是通过最大化在语言指导下的专家演示轨迹的概率来学习的双向变换器。给出现有状态的图像和任务描述,我们使用世界模型来训练联合策略,并通过在动态模型上运行收敛策略来生成图像序列作为答案。 实证结果表明,这种框架可以通过在 StarCraft Multi-Agent Challenge 基准训练和未见过的任务上表现出卓越的性能来提高多智能体决策问题的答案。特别是,它可以在交互状态下生成一致的交互序列,并解释可解释的奖励函数,为未来训练生成模型开辟了道路。
https://arxiv.org/abs/2410.02664
LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at this https URL, to advance our understanding of AI-driven persuasion and its societal implications.
随着工作流程中涉及生成供人类消费的内容(例如,市场营销)以及与人类直接互动(例如,通过聊天机器人),LLM(自然语言处理)越来越多地被使用。开发出具有生成可验证的说服力信息的系统,为社会的各个领域带来了机会和挑战。一方面,这样的系统可能会对如广告和社会公益等领域产生积极影响,例如解决吸毒问题;另一方面,它们也可能被用于传播不实信息和塑造政治观点。为了发挥LLM对社会的积极影响,我们需要开发系统来测量和基准它们的说服力。在这个动机下,我们介绍了PersuasionBench和PersuasionArena,第一个大规模基准和包含一系列任务以测量生成模型的说服力的大规模 arena。我们研究了LLM知道并利用哪些语言模式来生成更有说服力的语言的程度。我们的研究结果表明,LLM的说服力与模型大小呈正相关,但较小的模型也可以具有比更大模型更高的说服力。值得注意的是,使用合成和自然数据集进行定向训练显著增强了小模型的说服力,挑战了规模依赖假设。我们的研究结果对模型开发者和政策制定者都具有关键影响。例如,虽然欧盟人工智能法案和加利福尼亚的SB-1047旨在根据浮点运算数量对AI模型进行监管,但我们发现,这样的简单指标单独无法捕捉AI对社会影响的全部范围。我们邀请社区探索并贡献到PersuasionArena和PersuasionBench,可在此链接处访问:https://www.persuasionarena.com/。这将有助于我们更好地理解AI驱动的说服力和其社会影响。
https://arxiv.org/abs/2410.02653
Cellular automata have become a cornerstone for investigating emergence and self-organization across diverse scientific disciplines, spanning neuroscience, artificial life, and theoretical physics. However, the absence of a hardware-accelerated cellular automata library limits the exploration of new research directions, hinders collaboration, and impedes reproducibility. In this work, we introduce CAX (Cellular Automata Accelerated in JAX), a high-performance and flexible open-source library designed to accelerate cellular automata research. CAX offers cutting-edge performance and a modular design through a user-friendly interface, and can support both discrete and continuous cellular automata with any number of dimensions. We demonstrate CAX's performance and flexibility through a wide range of benchmarks and applications. From classic models like elementary cellular automata and Conway's Game of Life to advanced applications such as growing neural cellular automata and self-classifying MNIST digits, CAX speeds up simulations up to 2,000 times faster. Furthermore, we demonstrate CAX's potential to accelerate research by presenting a collection of three novel cellular automata experiments, each implemented in just a few lines of code thanks to the library's modular architecture. Notably, we show that a simple one-dimensional cellular automaton can outperform GPT-4 on the 1D-ARC challenge.
细胞自动机已成为研究 emergence和自组织现象跨多个学科的基石,包括神经科学、人工生命和理论物理学。然而,缺乏硬件加速的细胞自动机库限制了探索新的研究方向,阻碍了合作,并阻碍了可重复性。在这项工作中,我们介绍了 CAX(细胞自动机加速器),一个高性能且灵活的开放源代码库,旨在加速细胞自动机研究。CAX 通过用户友好的界面提供了尖端的性能和模块化设计,并可以支持任何数量维度的离散和连续细胞自动机。我们通过广泛的基准测试和应用展示了 CAX 的性能和灵活性。从经典模型如基本细胞自动机和康威的生物游戏到高级应用如生长神经细胞自动机和自分类的MNIST数字,CAX 加快了模拟速度2000倍。此外,我们通过展示由库的模块化架构实现的一系列新颖细胞自动机实验,证明了 CAX 在加速研究方面的潜力。值得注意的是,我们展示了简单的单维度细胞自动机在1D-ARC挑战中可以击败GPT-4。
https://arxiv.org/abs/2410.02651
Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
信息检索(IR)系统在现代数字生活发挥了重要作用,并巩固了在新一代生成人工智能时代继续发挥其有用性的地位。强大的语言处理能力和出色的可扩展性使得大型语言模型(LLMs)成为了IR系统中零散度排序的首选。到目前为止,基于LLM的排序方法依赖于强大的生成能力,这限制了它们的使用仅限于专业或强大的专用模型。鉴于这些限制,我们问:自回归生成是否对LLMs执行排序是必要且最优的?我们假设,LLMs中存在与排序相关的丰富信号,这些信号可能无法完全利用生成能力。为了更直接地利用这些信号,我们提出了上下文排序(ICR)方法,一种新方法,它利用搜索查询引起的关注模式变化来准确且有效地进行排序。为了减轻LLMs固有的偏见,我们提出了一个内容免费的查询的校准方法。由于没有生成能力,ICR仅需要两个前馈 pass 来重新排序 N 个文档,这使得它比需要至少 $O(N)$ 前馈 pass 的生成性排序方法更加高效。我们的新设计还使ICR能够应用于任何LLM,同时保证形成良好的排名。在标准单跳和多跳信息检索基准上,使用两个流行的开放权重LLM的实验表明,ICR在实际应用中优于 RankGPT,并将延迟降低约60%。通过详细的分析,我们表明ICR在需要更复杂排序信号的任务上表现特别强。我们的研究结果呼吁进一步探讨如何利用开放重量LLM的更多新颖方法。
https://arxiv.org/abs/2410.02642
While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach's potential for making the best use of the native capabilities of foundation models.
虽然多模态基础模型现在可以原生地处理数据,包括文本以外,但在医疗、金融和社会科学等领域中处理大量多维度时间序列数据时,它们仍然没有被充分利用,这代表了一个丰富的数据驱动见解的错失机会。本文提出了一种简单而有效的的方法,利用现有模型的视觉编码器来“通过绘制图表”看待时间序列数据,无需进行昂贵的额外模型训练。我们的实证评估结果表明,这种方法在提供原始时间序列数据作为文本的同时超过了它,并且还具有使视觉时间序列表示减少模型API成本的额外优势。我们通过模拟数据任务来验证我们的假设,从简单的功能形式识别开始,逐渐进展到从嘈杂散点图提取趋势。为了从模拟任务中展示从清晰推理步骤到更复杂、现实世界场景的泛化能力,我们将该方法应用于消费者健康任务——尤其是跌倒检测、活动识别和准备评估,这些任务涉及异质、嘈杂数据和多步骤推理。在基于文本的绘制表现与基于图的绘制表现之间(在零散投放虚拟任务中的性能增加达到120%,在真实世界任务中的性能增加达到150%)的全面成功,突出了我们在基础模型上充分利用原功能的能力。
https://arxiv.org/abs/2410.02637
Recently introduced dialogue systems have demonstrated high usability. However, they still fall short of reflecting real-world conversation scenarios. Current dialogue systems exhibit an inability to replicate the dynamic, continuous, long-term interactions involving multiple partners. This shortfall arises because there have been limited efforts to account for both aspects of real-world dialogues: deeply layered interactions over the long-term dialogue and widely expanded conversation networks involving multiple participants. As the effort to incorporate these aspects combined, we introduce Mixed-Session Conversation, a dialogue system designed to construct conversations with various partners in a multi-session dialogue setup. We propose a new dataset called MiSC to implement this system. The dialogue episodes of MiSC consist of 6 consecutive sessions, with four speakers (one main speaker and three partners) appearing in each episode. Also, we propose a new dialogue model with a novel memory management mechanism, called Egocentric Memory Enhanced Mixed-Session Conversation Agent (EMMA). EMMA collects and retains memories from the main speaker's perspective during conversations with partners, enabling seamless continuity in subsequent interactions. Extensive human evaluations validate that the dialogues in MiSC demonstrate a seamless conversational flow, even when conversation partners change in each session. EMMA trained with MiSC is also evaluated to maintain high memorability without contradiction throughout the entire conversation.
近年来引入的对话系统表现出了很高的可用性。然而,它们仍然无法反映真实世界的对话场景。当前的对话系统无法复制涉及多个参与者的动态、连续、长期互动。这一缺陷是因为在考虑真实世界对话的这两个方面时,已经做出了有限的努力:长期对话和涉及多个参与者的广泛扩展对话网络。随着将这些方面相结合,我们引入了Mixed-Session Conversation,一种旨在在多会话设置中构建对话的对话系统。我们提出了一个名为MiSC的新数据集来实现这一系统。MiSC的对话剧情由6个连续的会话组成,其中4个说话者(一个主要发言人和三个伙伴)出现在每个会话中。此外,我们提出了一个名为Egocentric Memory Enhanced Mixed-Session Conversation Agent(EMMA)的新对话模型。EMMA从发言人的角度收集并保留在对话中与伙伴的会话中的记忆,使得后续互动中的连续性更加顺畅。大量的人类评估证实,在MiSC中的对话表现出流畅的会话流,即使对话伙伴在每节中发生变化。使用MiSC训练的EMMA还评估了保持整个对话高可记忆性的效果,且没有矛盾。
https://arxiv.org/abs/2410.02503
Knowledge claims are abundant in the literature on large language models (LLMs); but can we say that GPT-4 truly "knows" the Earth is round? To address this question, we review standard definitions of knowledge in epistemology and we formalize interpretations applicable to LLMs. In doing so, we identify inconsistencies and gaps in how current NLP research conceptualizes knowledge with respect to epistemological frameworks. Additionally, we conduct a survey of 100 professional philosophers and computer scientists to compare their preferences in knowledge definitions and their views on whether LLMs can really be said to know. Finally, we suggest evaluation protocols for testing knowledge in accordance to the most relevant definitions.
知识断言在大型语言模型(LLMs)的文獻中很常見;但我们可以说 GPT-4 真正 "知道" 地球是圓的吗?为了解决这个问题,我们审查了关于知识在形而上学中的标准定义,并 formalize 适用于 LLMs 的解释。在这样做的时候,我们识别出当前 NLP 研究在知识与形而上学框架之间存在的不一致性和空白。此外,我们对 100 名专业哲学家兼计算机科学家进行了调查,以比较他们对知识定义的偏好以及他们是否认为 LLMs 真的可以被认为知道。最后,我们提出了测试知识符合最相关定义的评价协议。
https://arxiv.org/abs/2410.02499