We introduce a simple yet effective approach for separating transmitted and reflected light. Our key insight is that the powerful novel view synthesis capabilities provided by modern inverse rendering methods (e.g.,~3D Gaussian splatting) allow one to perform flash/no-flash reflection separation using unpaired measurements -- this relaxation dramatically simplifies image acquisition over conventional paired flash/no-flash reflection separation methods. Through extensive real-world experiments, we demonstrate our method, Flash-Splat, accurately reconstructs both transmitted and reflected scenes in 3D. Our method outperforms existing 3D reflection separation methods, which do not leverage illumination control, by a large margin. Our project webpage is at this https URL.
我们提出了一个简单而有效的分离传输和反射光的方法。我们的关键洞见是,现代反向渲染方法(例如~3D高斯展开)提供的强大新视图合成能力允许使用非配对测量进行闪光/无闪光反射分离——这种松弛大大简化了传统配对闪光/无闪光反射分离方法中的图像采集过程。通过大量现实世界的实验,我们证明了我们的方法Flash-Splat准确地重构了3D中的传输和反射场景。与现有的3D反射分离方法相比,我们的方法在很大程度上超过了它们,这些方法没有利用照明控制。我们的项目网页地址是https://www. Flash-Splat.org。
https://arxiv.org/abs/2410.02764
There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
最近,人们普遍认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所提出的更复杂挑战。然而,这是真的吗?我们的研究结果表明,即使处理短视频,LMMs仍然缺乏许多基本推理能力。我们引入了Vinoground,一个包含1000个短和自然视频对的时间反事实LMM评估基准。我们证明了现有的LMMs在区分不同动作和物体变换的时间差异方面严重挣扎。例如,最佳模型GPT-4o在我们的文本和视频评分上的得分仅为~50%,与人类基线(~90%)相比存在很大的差距。所有开源的多模态模型和CLIP基于模型表现得更糟,产生主要是随机猜测的性能。通过这项工作,我们阐明了一个重要的问题,即短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在此链接查看:https://github.com/jhlau/Vinoground
https://arxiv.org/abs/2410.02763
We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
生成式 AI 的快速发展是一把双刃剑,这不仅促进了内容创作,而且使图像编辑和检测变得更加容易和困难。尽管当前的图像伪造检测和局部定位(IFDL)方法通常有效,但它们往往面临两个挑战:(1)未知检测原理的黑盒性质,(2)不同编辑方法(如 Photoshop、DeepFake 和 AIGC 编辑)下的有限泛化能力。为解决这些问题,我们提出了可解释的 IFDL 任务,并设计了 FakeShield,一种多模态框架,能够评估图像真实性、生成修改区域mask,并提供基于像素级和图像级修改线索的判断依据。此外,我们还利用 GPT-4o 增强现有 IFDL 数据集,为训练 FakeShield 的修改分析能力创建了多模态 Tamper Description 数据集(MMTD-Set)。同时,我们引入了领域标签指导的伪造检测模块(DTE-FDM)和多模态伪造定位模块(MFLM),以解决各种修改检测解释和实现基于详细文本描述的伪造定位。大量实验证明,FakeShield 有效地检测和定位各种修改技术,与之前 IFDL 方法相比,提供了更高水平的有解释性和优越性。
https://arxiv.org/abs/2410.02761
Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at this https URL
语言模型中概念消除的传统评估框架缺乏全面评估,导致对消除方法的有效性评估不完整。我们提出了一种以三个关键标准为中心的评估范式:无罪(完全知识消除)、平滑性(保持条件流畅生成)和具体性(保留无关任务表现)。我们的评估指标自然地激励了消除语言记忆(ELM)新方法的开发,该方法旨在解决所有三个维度。ELM通过针对低秩的更新来改变被消除概念的输出分布,同时保留整个模型的功能,包括在提示被消除概念时的流畅性。我们在生物安全、网络安全和文学领域消除任务上评估了ELM的有效性。比较分析显示,ELM在所提出的指标上实现了卓越的性能,包括在消除主题评估中的近乎随机的得分、生成流畅性、以及在无关基准上的保留准确性和在对抗攻击下的鲁棒性。我们的代码、数据和训练模型都可以在https://这个网址上找到。
https://arxiv.org/abs/2410.02760
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: this https URL.
生成内容丰富的长视频在分钟级别上是值得追求的,但具有挑战性。自动回归的大语言模型(LLMs)在自然语言处理领域已经取得了巨大的成功,在视频生成领域探索自回归LLMs也局限于生成几秒钟的视频。在这项工作中,我们深入分析了防止自回归LLM基视频生成器生成长视频的挑战。根据观察和分析,我们提出了Loong,一种新的自回归LLM基视频生成器,可以生成分钟长的视频。具体来说,我们将文本标记和视频标记视为一个统一的序列,用于自回归LLMs,并从零开始训练模型。我们提出了渐进短到长训练以及损失重新分配方案来减轻长视频训练中的损失不平衡问题。我们进一步研究了推理策略,包括视频标记重新编码和采样策略,以减缓在推理过程中的误差累积。我们的Loong可以根据文本提示生成分钟级别的长视频,正如实验结果所证明。更多样本可在此链接中查看:https:// this URL。
https://arxiv.org/abs/2410.02757
We present CorPipe 24, the winning entry to the CRAC 2024 Shared Task on Multilingual Coreference Resolution. In this third iteration of the shared task, a novel objective is to also predict empty nodes needed for zero coreference mentions (while the empty nodes were given on input in previous years). This way, coreference resolution can be performed on raw text. We evaluate two model variants: a~two-stage approach (where the empty nodes are predicted first using a pretrained encoder model and then processed together with sentence words by another pretrained model) and a single-stage approach (where a single pretrained encoder model generates empty nodes, coreference mentions, and coreference links jointly). In both settings, CorPipe surpasses other participants by a large margin of 3.9 and 2.8 percent points, respectively. The source code and the trained model are available at this https URL .
我们展示了 CorPipe 24,这是 CRAC 2024 共享任务中获胜的论文。在第三轮共享任务中,一个新的目标是不仅要预测用于零参考提及的空白节点,还要预测前几年在输入中给出的空白节点。这样,参考关系解析就可以在原始文本上进行。我们评估了两种模型变体:a~双阶段方法(使用预训练编码器模型首先预测空白节点,然后与句子单词一起由另一个预训练模型处理)和单阶段方法(使用单个预训练编码器模型生成空白节点、参考关系提及和参考关系链接)。在两种设置下,CorPipe 都比其他参与者领先 3.9 和 2.8 个百分点。源代码和训练模型都可以在上述链接处找到。
https://arxiv.org/abs/2410.02756
Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。
https://arxiv.org/abs/2410.02755
Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
软件工程师主要通过编辑现有的程序来编写代码。相比之下,大型语言模型(LLMs)在一次性内合成程序。造成这种差异的一个原因是开源编辑数据的稀缺性。虽然高质量代码合成数据已经很稀缺了,但高质量编辑数据更加稀缺。为了填补这一空白,我们开发了一个名为LintSeq的合成数据生成算法。通过使用linter程序,它使用错误-free的插入来遍历可用于顺序编写程序的错误-free插入,将现有的代码重构为一系列代码编辑。它输出编辑序列作为文本字符串,由连续的程序差异组成。为了测试LintSeq,我们使用它将指令和程序对数据集重构为指令和程序差异对。然后,在重构和原始版本的了这个数据集上,我们微调了一系列参数从2.6B到14B的小型LLM,在代码合成基准测试中的零样本性能进行比较,比较基于零样本的性能。我们发现,在重复抽样过程中,经过优化的编辑序列模型产生的程序比基线更具有多样性。这使得基准测试的推理时间扩展更好,作为样本数的函数。例如,在HumanEval的pass@50上,通过合成编辑序列训练的小型LLM与GPT-4相当,并且在基准数据集上的绝对得分比基于基准数据集训练的模型快20%(+/-3%)。最后,我们还为代码理解预训练了自己的小型LM。我们发现,通过在合成代码编辑上微调小型模型,可以实现对于设备类模型的最先进的代码合成。我们的150M参数编辑序列LM与具有两倍参数的大型模型(包括重复抽样和Codex和AlphaCode)相匹敌或者更优。
https://arxiv.org/abs/2410.02749
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (CriSPO), a lightweight model that can be finetuned to extract salient keyphrases. By using CriSPO, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大型语言模型(LLMs)可以通过提示技术在领域之间生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,制定有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献,更完整。关键词的数量可以控制精度和召回之间的权衡。此外,我们的分析发现,在LLM上引入显眼的语句级别信息要优于词或句子级别。然而,对幻觉的影响并非普遍积极。为了进行这项分析,我们引入了Keyphrase Signal Extractor(CriSPO),一种轻量级模型,可以微调以提取显眼的关键词。通过使用CriSPO,我们在数据集上实现了一致的ROUGE提高,同时在不进行LLM自定义的情况下使用开放式权重和专有LLM。我们的研究结果为利用显眼信息构建基于提示的摘要系统提供了见解。
https://arxiv.org/abs/2410.02748
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English.
我们解决这个问题:将预训练的大语言模型扩展到训练时没有见过的新的领域,例如为原始模型看到了很少或几乎没有训练数据的语言添加一门语言。流行的解决方案如微调或低秩适应在领域适应方面是成功的,但它们在正式上并没有增加任何额外的容量,并且削弱了原始领域的性能。我们的论文从数据、架构和训练过程三个方面分析了这个问题,这些方面被共同考虑。特别是,我们改进了适配器,使得在保证网络输出在原始领域几乎不变的情况下,可以学习整个新的语言。为此,我们修改了新的残差块,使得每个新的残差块在原始领域输出接近零。这种中性残差的解决方案借鉴了专家混合的架构组件,是有效的:与在英语上训练的原始模型的学习权重相比,只需增加20%的残差项。这种解决方案在领域适应方面的表现要优于同时采用微调、低秩或原样适配器的其他方法。
https://arxiv.org/abs/2410.02744
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at this https URL .
强化学习从人类反馈(RLHF)已经在将大型语言模型(LLMs)与人类偏好对齐方面取得了有效性。然而,在长序列中,词级 RLHF 受到过时的奖励问题困扰,这使得模型难以分辨哪些动作导致了成功结果。这阻碍了学习效率并减慢了收敛速度。在本文中,我们提出了 MA-RLHF,一种简单而有效的 RLHF 框架,它将宏观行动——词或更高层次的语言构建的序列——融入到学习过程中。通过操作在更高抽象层次上,我们的方法减少了动作和奖励之间的时间间隔,促进了更快和更准确的奖励分配。这导致每个 episode 内的学习效率提高,同时训练或推理过程中的计算复杂度没有增加。我们通过在各种模型大小和任务上进行广泛的实验来验证我们的方法。我们的方法在文本摘要、对话生成、问题回答和程序合成等任务上取得了超过标准 RLHF 的显著性能提升。值得注意的是,我们的方法与普通 RLHF 的性能提升达到 30% 的效果。在训练时间上,我们的方法甚至比普通 RLHF 快 2 倍,并且在进一步训练后仍然表现出优异的性能。我们将把我们的代码和数据公开发布在本文的链接上。
https://arxiv.org/abs/2410.02743
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the need to train models for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on hallucination is not universally positive across LLMs. To conduct this analysis, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and open-weight and proprietary LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
大规模语言模型(LLMs)可以通过提示技术在跨领域生成流畅的摘要。这减少了为摘要应用程序训练模型的需求。然而,创建有效的提示以指导LLMs生成具有适当细节和写作风格的摘要仍然具有挑战性。在本文中,我们探讨了从源文档中提取显眼信息来增强摘要提示。我们证明了在提示中添加关键词可以提高ROUGE F1和召回,使生成的摘要更相似于参考文献和更完整。关键词的数量可以控制精度-召回权衡。此外,我们的分析发现,在LLM上包含显眼的语义信息优于仅包含单词或句子级别的信息。然而,对于不同的LLM,影响幻觉的程度并不是普遍积极的。为了进行这项分析,我们引入了Keyphrase Signal Extractor(SigExt),一种轻量级的模型,可以微调以提取显眼的关键词。通过使用SigExt,我们在数据集上实现了一致的ROUGE提高,同时使用开箱即用的LLM和非LLM定制版本。我们的研究结果提供了关于利用显眼信息构建基于提示的摘要系统的见解。
https://arxiv.org/abs/2410.02741
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
近年来在多模态模型的进步突出了重构式字幕的价值,然而关键挑战仍然存在。例如,虽然合成字幕通常提供更好的质量和图像文本对齐,但并不清楚它们是否可以完全取代AltTexts:合成字幕及其与原始爬取的AltText之间的相互作用仍然不太清楚。此外,不同的多模态基础模型可能对特定的字幕格式有独特的偏好,但努力确定每个模型的最佳重构式仍然有限。在这项工作中,我们提出了一个新颖、可控制和可扩展的 captioning 管道,旨在为各种多模态模型生成定制化的字幕格式。通过将 Short Synthetic Captions(SSC)与Dense Synthetic Captions(DSC+)作为案例研究,我们系统地探讨了它们对不同模型(如CLIP、多模态LLM和扩散模型)与AltText之间的影响。我们的研究结果表明,将人造字幕和原始文本相结合的半监督方法可以优于仅使用人造字幕,提高两者的对齐度和性能,每个模型都表现出对特定字幕格式的偏好。这种全面的分析为优化字幕策略提供了宝贵的洞见,从而推动了多模态基础模型的预训练。
https://arxiv.org/abs/2410.02740
LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
LLM-as-a-Judge作为一种基准测试评估方法以及模型训练中的监督奖励方法得到了广泛应用。然而,尽管它们在许多领域表现出色,但潜在问题尚未得到充分探讨,这削弱了它们的可靠性和应用范围。因此,我们识别出12个潜在偏见,并提出了一个新的自动偏见量化框架——CALM,通过使用自动和基于原则的修改系统有系统地量化并分析LLM-as-a-Judge中的每种偏见。我们的实验覆盖了多种流行语言模型,实验结果表明,尽管高级模型取得了可圈可点的整体性能,但在某些具体任务上仍然存在显著的偏见。经验结果表明,LLM-as-a-Judge的可靠性仍有待提高。此外,我们还讨论了这些偏见的显性和隐性影响,并给出了一些建议,以便用户在LLM-as-a-Judge应用中谨慎使用。我们的工作突出了利益相关者需要解决这些问题,并提醒用户在LLM-as-a-Judge应用中要谨慎使用。
https://arxiv.org/abs/2410.02736
Navigating complex environments requires Unmanned Aerial Vehicles (UAVs) and autonomous systems to perform trajectory tracking and obstacle avoidance in real-time. While many control strategies have effectively utilized linear approximations, addressing the non-linear dynamics of UAV, especially in obstacle-dense environments, remains a key challenge that requires further research. This paper introduces a Non-linear Model Predictive Control (NMPC) framework for the DJI Matrice 100, addressing these challenges by using a dynamic model and B-spline interpolation for smooth reference trajectories, ensuring minimal deviation while respecting safety constraints. The framework supports various trajectory types and employs a penalty-based cost function for control accuracy in tight maneuvers. The framework utilizes CasADi for efficient real-time optimization, enabling the UAV to maintain robust operation even under tight computational constraints. Simulation and real-world indoor and outdoor experiments demonstrated the NMPC ability to adapt to disturbances, resulting in smooth, collision-free navigation.
导航复杂的环境需要无人机(UAVs)和自主系统在实时进行轨迹跟踪和避障。虽然许多控制策略有效地利用了线性近似,但处理UAV的非线性动力学,特别是在密集障碍物环境中,仍然是一个关键挑战,需要进一步研究。本文介绍了一种非线性模型预测控制(NMPC)框架,用于DJI Matrice 100,通过使用动态模型和B-spline插值来提供平滑的参考轨迹,确保在遵守安全约束的情况下最小偏差。该框架支持各种轨迹类型,并采用基于惩罚的成本函数来控制精确度在紧缩操纵中。该框架利用CasADi实现高效的实时优化,使无人机在计算约束紧张的情况下仍保持稳健操作。模拟和现实世界的室内和室外实验证明,NMPC能力能够适应干扰,从而实现平滑、无碰撞的导航。
https://arxiv.org/abs/2410.02732
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
在未知环境中进行物体导航对于在现实应用中部署感知代理至关重要。虽然我们通过大规模场景数据、高速模拟器和更强的模型见证了巨大的进步,但之前的研究主要集中在有限的场景类型和目标对象上。在本文中,我们研究了在多种场景类型中导航到多个目标对象的新任务。为了进行基准测试,我们提出了一个大型场景数据集DivScene,包含81个不同类型的场景,总共有4,614个场景。通过这个数据集,我们在基于模仿学习对一个大视觉语言模型(LVLM)进行微调,以构建端到端的 embodied 代理,NatVLM。LVLM 通过从环境中获取先前的观察并生成下一个动作进行训练。我们还引入了动作预测的 CoT 轨迹,以便在调整 LVLMs 时获得更好的性能。通过广泛的实验,我们发现,在没有人类监督的情况下,通过模仿学习可以在最短路径上构建的 BFS 规划器上构建出高性能的 LVLM 代理。我们的代理在 GPT-4o 上的成功率超过了20%。同时,我们进行了各种分析,证明了我们的代理具有很强的泛化能力。
https://arxiv.org/abs/2410.02730
Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
信息检索(IR)方法旨在针对给定查询识别相关的文档,这是由于其在各种自然语言任务中取得成功而备受关注。然而,现有的方法通常仅考虑文档中的文本信息,而忽略了文档可以包含多种形式的信息,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,阻止了它们捕捉到整个文档的上下文和段落之间的互动。我们认为,这两个限制导致了检索到的文档表示不是最优的。在本文中,为了应对这些限制,我们旨在通过将文档与不同形式的信息集成来生成更全面和细微的文档表示。具体来说,我们通过利用最近在视觉语言模型上取得的处理和整合文本、图像和表格统一格式和表示的能力来实现这一目标。此外,为了减轻将文档分割为段落所带来的信息损失,我们进一步将分割段落的表示合并为一个单独的文档表示,同时引入了重排策略来在必要时将相关段落的重排组合成一个单独的文档表示。然后,通过在各种信息检索场景中进行广泛的实验,包括文本和多模态查询,我们证明了我们的方法在很大程度上超过了相关基线,得益于在文档中考虑了多种形式的信息的统一处理。
https://arxiv.org/abs/2410.02729