Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.
最近在大型视觉-语言模型领域的进展已经实现了高度表达性和多样化的向量素描生成。然而,最先进的方法依赖于耗时的优化过程,该过程涉及通过预训练模型反复反馈来确定笔画位置。因此,尽管这些方法能够产生令人印象深刻的素描,但在实际应用中却受到限制。在本研究中,我们介绍了SwiftSketch,这是一种基于图像条件下的向量素描生成扩散模型,能够在不到一秒钟的时间内生成高质量的素描。 SwiftSketch的工作原理是通过逐步清除从高斯分布采样的笔画控制点中的噪声来实现的。其基于变换器解码器架构的设计旨在有效处理向量表示的离散性质,并捕捉不同笔画之间的固有全局依赖关系。为了训练SwiftSketch,我们构建了一个由图像-素描对组成的合成数据集,解决了现有素描数据集中存在的问题,这些数据集通常是由非专业人员创建且缺乏专业质量。 为了生成这些合成素描,我们提出了ControlSketch方法,该方法通过引入深度感知的ControlNet来增强基于SDS的技术,从而实现精确的空间控制。我们证明SwiftSketch能够跨各种概念进行泛化,并高效地产生结合了高保真度和自然、视觉上吸引人风格的素描。
https://arxiv.org/abs/2502.08642
The evaluation of cross-lingual semantic search capabilities of models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. To allow for domain-specific evaluation, we introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual semantic search task that requires only a set of parallel sentence pairs of the language pair of interest within the target domain. This task focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than hard negatives generated by a large language model. We create four instances of our introduced CLSD task for the language pair German-French within the domain of news. Within this case study, we find that models that are also fine-tuned for retrieval tasks (e.g., multilingual E5) benefit from using English as the pivot language, while bitext mining models such as LaBSE perform best directly cross-lingually. We also show a fine-grained similarity analysis enabled by our distractor generation strategy, indicating that different embedding models are sensitive to different types of perturbations.
模型跨语言语义搜索能力的评估通常仅限于来自信息检索和语义文本相似性等任务的现有数据集。为了允许特定领域的评估,我们引入了跨语言语义判别(CLSD),这是一个新颖的跨语言语义搜索任务,只需要目标领域内的语言对的一组平行句子对即可完成。该任务侧重于模型在跨语言排序真实平行句子的能力方面高于由大型语言模型生成的硬负样本。 我们为德语-法语的语言对,在新闻领域内创建了四个我们的CLSD任务实例。在此案例研究中,我们发现那些也为检索任务进行了微调(例如多语言E5)的模型在使用英语作为枢纽语言时受益最多,而像LaBSE这样的双语文本挖掘模型直接跨语言表现最好。 我们还展示了由我们的干扰生成策略所支持的细粒度相似性分析,表明不同的嵌入模型对不同类型的扰动敏感。
https://arxiv.org/abs/2502.08638
Query expansion is widely used in Information Retrieval (IR) to improve search outcomes by enriching queries with additional contextual information. Although recent Large Language Model (LLM) based methods generate pseudo-relevant content and expanded terms via multiple prompts, they often yield repetitive, narrow expansions that lack the diverse context needed to retrieve all relevant information. In this paper, we introduce QA-Expand, a novel and effective framework for query expansion. It first generates multiple relevant questions from the initial query and subsequently produces corresponding pseudo-answers as surrogate documents. A feedback model further rewrites and filters these answers to ensure only the most informative augmentations are incorporated. Extensive experiments on benchmarks such as BEIR and TREC demonstrate that QA-Expand enhances retrieval performance by up to 13% over state-of-the-art methods, offering a robust solution for modern retrieval challenges.
查询扩展在信息检索(IR)中广泛应用,通过为查询添加额外的上下文信息来改善搜索结果。尽管最近基于大型语言模型(LLM)的方法通过多重提示生成伪相关内容和扩展项,但它们通常会产生重复且狭窄的扩展,缺乏获取所有相关信息所需的多样化背景。 本文介绍了QA-Expand,这是一种新颖且有效的查询扩展框架。该框架首先从初始查询中生成多个相关问题,然后为这些问题产生相应的伪回答作为替代文档。一个反馈模型进一步重写和过滤这些答案,确保只整合最具有信息量的补充内容。在BEIR和TREC等基准测试中的广泛实验表明,QA-Expand比最先进的方法提高了高达13%的检索性能,提供了一种应对现代检索挑战的强大解决方案。
https://arxiv.org/abs/2502.08557
Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.
从错误中学习是人类智能的基本特征。先前的研究表明,当提供详细的理由说明为什么一个答案是错的或如何纠正它时,大型语言模型(LLMs)也可以从错误的答案中学习。在本项研究中,我们探讨了当没有提供此类解释的情况下,LLM是否能够在数学推理任务中从错误中学习。我们调查了这些模型是否能够仅通过观察错误和正确答案而隐式地推断出这样的理由。令人惊讶的是,我们发现,在去除上下文中提供的理由后,并且简单地将错误的答案与正确的并列显示时,LLMs在平均性能上表现得更好。这种方法在我们的评估中也大大优于链式思维提示法(chain-of-thought prompting)。这些结果在不同大小和推理能力各异的LLM之间是一致的。 此外,我们进行了深入分析,并表明使用错误答案和正确答案进行提问比引入更多样化的问答对到上下文中更能提高性能并促进更好的泛化。最后,我们展示了仅观察过错误和正确答案的新模型生成的理由,在人类评分中与依靠示例理由产生的理由一样获得高分。 我们的研究结果证明了LLMs确实具有在上下文中的隐式学习能力。
https://arxiv.org/abs/2502.08550
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at this https URL
视频时刻检索是一项用于评估视觉语言模型性能的常见任务,它涉及根据查询句子在视频中定位特定时刻的开始和结束时间。当前的任务设定假设被查询的时刻存在于视频中,这会导致当提供不相关的查询句时产生假阳性的时刻预测。在这篇论文中,我们提出了负样本感知视频时刻检索(Negative-Aware Video Moment Retrieval, NA-VMR)任务,该任务不仅考虑了时刻检索的准确性还考虑了对负面查询的拒绝精度。 我们区分了域内和域外负面查询,并为两个流行的视频时刻检索数据集——QVHighlights 和 Charades-STA 提供了新的评估基准。本文分析了现有最先进的视频时刻检索方法适应负样本感知视频时刻检索的能力,并提出了 UniVTG-NA,这是对 UniVTG 的改进版本,旨在解决 NA-VMR 问题。UniVTG-NA 在保持召回率@1精度(平均 96.13%)的同时,达到了很高的负面拒绝准确性(平均 98.4%)。数据集分割和代码在以下链接提供:[此URL](http://this https URL)
https://arxiv.org/abs/2502.08544
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
接下来的标记预测一直是大规模语言模型预训练中使用的标准训练目标。通过优化单个标记级别的困惑度,学习到了表示形式。我们提出了一个新颖的预训练框架——连续概念混合(CoCoMix),该框架结合了离散的下一个标记预测和连续的概念。具体而言,CoCoMix 预测由预先训练好的稀疏自动编码器学到的连续概念,并通过在隐藏状态中插入与标记隐藏表示交错的方式将这些概念融入模型之中。通过语言建模和下游推理任务等多个基准实验显示,CoCoMix 在样本效率上表现更佳,并且持续优于标准的下一个标记预测、知识蒸馏以及插入暂停标记的方法。 我们发现,在端到端框架中结合概念学习与插值对于性能提升至关重要。此外,CoCoMix 还通过允许直接检查和修改预测的概念来增强模型的可解释性和可控性,为指导模型内部推理过程提供了一种透明的方式。
https://arxiv.org/abs/2502.08524
Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
基于大型语言模型(LLM)的忠实度评估器常常会被文本的流畅性所迷惑,难以识别摘要中的错误。我们提出了一种新的方法来进行总结忠实度评价:分配多个基于LLM的代理不同的初始立场(无论这些立场是否符合他们的实际信念),并要求他们为自己的立场找出理由进行辩护,从而展开多轮辩论以达成一致意见。这种均匀分布的初始设定可以导致更多样化的观点,进而引发更有意义的讨论,并最终识别出更多的错误。 此外,通过分析最近发布的忠实度评估数据集,我们观察到自然情况下并非所有的总结都严格地忠于原始文档或完全不忠于其内容。因此,我们引入了一个新的维度——模糊性(ambiguity),并制定了详细的分类来标识这类特殊情况。 实验结果表明,我们的方法不仅能够帮助识别出这些模棱两可的情况,并且在处理非模棱两可的总结时也表现出更强的效果。
https://arxiv.org/abs/2502.08514
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing approaches. Code is available at: this https URL.
大型语言模型(LLMs)被广泛应用于生成各种自然语言处理(NLP)任务的合成数据集,例如文本分类和摘要。然而,准确衡量这些合成数据集的多样性——这对于实现稳健的模型性能至关重要——仍然是一个重大挑战。在本文中,我们介绍了DCScore,这是一种从分类角度测量合成数据集多样性的新方法。具体来说,DCScore将多样性评估表述为一个样本分类任务,并利用了样本之间的相互关系。此外,我们还提供了关于DCScore所满足的多样性相关公理的理论验证,强调其作为原则性多样性评价方法的角色。在合成数据集上的实验结果表明,DCScore与被评估的数据集中多个多样性的伪真实情况之间存在更强的相关性,这凸显了它的有效性。而且,无论是经验还是理论证据都证明,相较于现有方法,DCScore能够显著降低计算成本。代码可在以下链接获取:[此URL](请将“this https URL”替换为实际提供的GitHub或其它版本控制系统的链接)。
https://arxiv.org/abs/2502.08512
Grammatical error correction (GEC) aims to correct grammatical, spelling, and semantic errors in natural language text. With the growing of large language models (LLMs), direct text generation has gradually become the focus of the GEC methods, and few-shot in-context learning presents a cost-effective solution. However, selecting effective in-context examples remains challenging, as the similarity between input texts does not necessarily correspond to similar grammatical error patterns. In this paper, we propose a novel retrieval method based on natural language grammatical error explanations (GEE) to address this issue. Our method retrieves suitable few-shot demonstrations by matching the GEE of the test input with that of pre-constructed database samples, where explanations for erroneous samples are generated by LLMs. We conducted multilingual GEC few-shot experiments on both major open-source and closed-source LLMs. Experiments across five languages show that our method outperforms existing semantic and BM25-based retrieval techniques, without requiring additional training or language adaptation. This also suggests that matching error patterns is key to selecting examples.
语法错误校正(GEC)旨在纠正自然语言文本中的语法、拼写和语义错误。随着大型语言模型(LLMs)的发展,直接文本生成逐渐成为GEC方法的重点,而少量样例的上下文学习则提供了一种成本效益高的解决方案。然而,选择有效的上下文示例仍然具有挑战性,因为输入文本之间的相似性不一定对应于类似的语法错误模式。为此,在本文中我们提出了一种基于自然语言语法规则解释(GEE)的新颖检索方法来解决这一问题。我们的方法通过匹配测试输入与预构建数据库样本的GEE来检索合适的少量样例演示,其中错误样本的解释由LLMs生成。我们在主要开源和闭源LLMs上进行了多语言GEC少量样例实验。跨五种语言的实验表明,我们的方法超越了现有的语义和基于BM25的检索技术,在不需额外训练或语言适应的情况下取得更好的效果。这还表明匹配错误模式是选择示例的关键。
https://arxiv.org/abs/2502.08507
This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and this http URL this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.
这项工作介绍了Salamandra,一个包含三种不同规模(20亿、70亿和400亿参数)的开源解码器专用大型语言模型系列。这些模型是基于多语种数据从头训练出来的,该数据包括35种欧洲语言及代码文本。我们精心策划的数据集仅由来自各种来源的开放访问数据组成。除了基础模型之外,还在聊天应用程序中发布了通过公共领域指令数据微调得到的补充检查点。此外,我们也分享了初步的多模态实验结果,这些结果作为概念验证展示了Salamandra系列的潜在应用价值。 在多项跨语言基准测试上的广泛评估表明,Salamandra具有强大的能力,并且与同规模的开源模型相比,在性能上达到了竞争水平。我们提供了标准下游任务以及偏见和安全性相关关键方面的全面评价结果。通过这份技术报告,我们旨在推广开放科学理念,分享了我们的设计选择、数据策划策略及评估方法的所有细节。 除此之外,我们还偏离了惯例,公开发布了训练和评估脚本以供公众使用。所有模型均在宽松的Apache 2.0许可证下发布,以此促进未来的研究,并便于商业用途,从而为大型语言模型的开源生态系统做出贡献。
https://arxiv.org/abs/2502.08489
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at this https URL.
链式思维(Chain-of-Thought,CoT)提示作为一种增强语言模型推理能力的有力技术已经出现。然而,生成长且正确的CoT轨迹具有挑战性。最近的研究表明,循环变压器在长度泛化方面表现出色,但其有限的一般性和适应性使其无法替代自回归解决方案。为了更好地利用循环变压器的优势,我们提出了RELAY(通过循环对齐进行迭代推理)。具体而言,我们将链式思维的步骤与循环迭代对齐,并在训练循环变压器时引入中间监督。这种额外的迭代级监督不仅保留了循环变压器长度泛化的能力,还使其能够预测未见数据的CoT推理步骤。因此,我们利用该循环变压器生成复杂问题的准确推理链,这些问题超出了训练长度,然后用于微调自回归模型。我们进行了广泛的实验,并且结果证明了我们方法的有效性,在自回归模型性能方面取得了显著改进。代码将在提供的链接中发布。 (原文链接地址未提供具体网址,请根据需要补充完整。)
https://arxiv.org/abs/2502.08482
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in this https URL.
多模态嵌入模型因其能够将来自不同模式(如文本和图像)的数据映射到统一的表示空间而受到了广泛关注。然而,有限的标注多模态数据常常限制了这些模型的表现。近期的研究通过利用数据合成来解决这一问题,但合成数据的质量仍然是一个关键瓶颈。在这项工作中,我们确定了高质量合成多模态数据的三个标准:首先,广泛的覆盖范围确保生成的数据涵盖了多种任务和模式,使其适用于各种下游场景;其次,强大的跨模式对齐使得不同的模式在语义上保持一致;第三,高保真度保证合成数据保留真实细节以增强其可靠性。遵循这些原则,我们合成了符合以下条件的数据库:(1)涵盖广泛的任务、模态组合和语言,(2)通过多模态大语言模型的一次深度思维过程生成,并且(3)结合了带有准确相关文本的真实世界图像,确保保真度通过自我评估和改进实现。利用这些高质量的合成及标注数据集,我们训练了一个跨模态多语言E5模型mmE5。广泛的实验表明,mmE5在MMEB基准测试中达到了最先进的性能,并且在XTD基准测试中的多语种表现也更为优越。我们的代码、数据集和模型可在上述链接获得(原文中的具体链接未提供)。
https://arxiv.org/abs/2502.08468
We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.
我们提出了一种新颖的方法——标签空间缩减(LSR),用于提升大型语言模型(LLMs)的零样本分类性能。LSR通过系统地对候选类别进行排序和减少,迭代优化分类标签空间,使模型能够专注于最相关的选项。利用未标记数据与基于数据驱动模型的统计学习能力,LSR能够在测试时动态优化标签空间表示。 我们在七个基准测试中进行了实验,结果表明,相较于标准零样本分类基线方法,在使用Llama-3.1-70B时,LSR使宏平均F1分数提高了平均7.0%(最高提升达14.2%),在使用Claude-3.5-Sonnet时则提升了平均3.3%(最多可达11.1%)。 为了降低LSR的计算开销——每次迭代需要额外调用一次LLM,我们建议将模型蒸馏成一个概率分类器,从而实现高效的推理。
https://arxiv.org/abs/2502.08436
Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.
大型语言模型(LLMs)正在帮助数百万用户撰写关于各种问题的文章,通过这种方式向用户展示不同的观点和想法。这引发了对议题偏见的担忧,即一个LLM倾向于在给定的问题上只呈现一种视角,从而可能影响用户的思考方式。迄今为止,还没有办法衡量LLM在实际用户互动中所表现出的具体议题偏见,这让解决有偏见的LLM带来的风险变得困难。因此,我们创建了IssueBench:一组基于来自真实用户互动中的3900个模板(如“写一篇关于……的文章”)和212个政治问题(如“AI监管”),用于测量大型语言模型撰写辅助过程中的议题偏见的249万个现实提示。使用IssueBench,我们展示了在最先进的LLM中,议题偏见普遍存在且持久不变。此外,我们还显示了不同模型之间的偏见具有明显的相似性,并且所有模型在美国政治问题的一小部分上更倾向于民主党选民的观点而非共和党选民的观点。IssueBench可以轻松地适应其他议题、模板或任务的使用需求。通过实现稳健和现实的测量方法,我们希望IssueBench能为关于LLM偏见及如何解决这些问题的持续讨论提供新的证据质量。
https://arxiv.org/abs/2502.08395
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model's performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and this http URL solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.
基于多实例学习(MIL)的框架已成为处理数字病理学中具有吉比特大小和层次图像上下文的全滑动图像(WSI)的主流方法。然而,这些方法严重依赖于大量的袋级标签,并且仅从原始切片中学到内容,因此容易受到数据分布变化的影响。近期,基于视觉语言模型(VLM)的方法通过在大规模病理图-文本对上进行预训练来引入了语言先验知识。但是,之前的文本提示缺乏对病理学先验知识的考虑,因而未能显著提升模型性能。此外,此类配对的收集和预训练过程非常耗时。 为了解决上述问题,我们提出了一种双尺度视觉-语言多实例学习(ViLa-MIL)框架,用于全滑动图像分类。具体来说,基于冻结的大规模语言模型(LLM),我们提出了一个双尺度视觉描述文本提示,以有效提升VLM的性能。为了将VLM高效地应用于处理WSI,对于图像分支,我们提出了一种原型引导的补丁解码器,通过将相似的补丁分组到同一个原型中来逐步聚合补丁特征;而对于文本分支,我们引入了一个上下文引导的文本解码器,通过结合多粒度图像上下文来增强文本特征。在三个跨癌症和跨中心亚型的数据集上进行了广泛的实验,证明了ViLa-MIL的优势。
https://arxiv.org/abs/2502.08391
The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.
注意力机制对于基于变压器的大语言模型(LLM)的出色能力至关重要。然而,由于其与序列长度呈二次关系,计算注意力在计算上非常耗费资源。我们提出了一种称为Top-Theta注意力或简称为Top-$\theta$的新方法,该方法通过将注意力元素与精心校准的阈值进行比较来选择性地修剪不太重要的注意力元素。这种方法极大地提高了自注意矩阵乘法的效率,并同时保持了模型的准确性,在生成解码过程中减少了V缓存行的需求量3倍,在预填充阶段则减少注意力元素的数量10倍。我们的方法不需要重新训练模型;相反,只需要一个短暂的校准阶段即可使该方法对分布变化具有鲁棒性,因此不必为不同数据集重调阈值。与top-k注意力机制不同,Top-$\theta$消除了全向量依赖性,使其适用于平铺和扩展操作,并避免了成本高昂的top-k搜索过程。我们方法的关键创新之一是开发出高效的数值补偿技术,这些技术有助于即使在激烈地修剪注意分数的情况下也能够保持模型准确性。
https://arxiv.org/abs/2502.08363
Retrieval-Augmented Generation (RAG) has emerged as a prominent method for incorporating domain knowledge into Large Language Models (LLMs). While RAG enhances response relevance by incorporating retrieved domain knowledge in the context, retrieval errors can still lead to hallucinations and incorrect answers. To recover from retriever failures, domain knowledge is injected by fine-tuning the model to generate the correct response, even in the case of retrieval errors. However, we observe that without systematic knowledge augmentation, fine-tuned LLMs may memorize new information but still fail to extract relevant domain knowledge, leading to poor performance. In this work, we present a novel framework that significantly enhances the fine-tuning process by augmenting the training data in two ways -- context augmentation and knowledge paraphrasing. In context augmentation, we create multiple training samples for a given QA pair by varying the relevance of the retrieved information, teaching the model when to ignore and when to rely on retrieved content. In knowledge paraphrasing, we fine-tune with multiple answers to the same question, enabling LLMs to better internalize specialized knowledge. To mitigate catastrophic forgetting due to fine-tuning, we add a domain-specific identifier to a question and also utilize a replay buffer containing general QA pairs. Experimental results demonstrate the efficacy of our method over existing techniques, achieving up to 10\% relative gain in token-level recall while preserving the LLM's generalization capabilities.
检索增强生成(RAG)已成为将领域知识融入大规模语言模型(LLMs)的一种显著方法。尽管RAG通过在上下文中引入检索到的领域知识来提高响应的相关性,但检索错误仍可能导致幻觉和不正确的答案。为了从检索器故障中恢复过来,通过微调模型以生成正确答案来进行领域知识注入,即使存在检索错误也不例外。然而,我们观察到,在没有系统性的知识增强的情况下,经过微调的LLMs可能会记住新的信息,但仍无法提取相关的领域知识,从而导致性能不佳。 在本文工作中,我们提出了一种新颖的框架,通过两种方式增强训练数据来显著改进微调过程——上下文增强和知识改写。在上下文增强中,我们通过改变检索到的信息的相关性为给定的问答对创建多个训练样本,教导模型何时忽略和何时依赖于检索内容。而在知识改写中,我们使用同一问题的不同答案进行微调,使LLMs更好地内化专业知识。 为了缓解因微调而产生的灾难性遗忘问题,我们在一个问题上添加了特定领域的标识符,并利用包含通用问答对的重播缓存区来增强模型的能力。实验结果表明,我们的方法在现有技术的基础上表现出色,在保持LLM泛化能力的同时,达到了高达10%的相对增益于令牌级召回率。
https://arxiv.org/abs/2502.08356
Context-aware compression techniques have gained increasing attention as model sizes continue to grow, introducing computational bottlenecks that hinder efficient deployment. A structured encoding approach was proposed to selectively eliminate redundant parameter groups while ensuring that representational fidelity was preserved across multiple layers. Contextual Compression Encoding (CCE) introduced a multi-stage encoding mechanism that dynamically restructured parameter distributions, allowing for significant reductions in memory footprint and computational complexity. Experimental evaluations demonstrated that models compressed through CCE retained linguistic expressivity and coherence, maintaining accuracy across a range of text generation and classification tasks. Layer-wise analysis revealed that middle-network layers exhibited higher compression ratios, aligning with the observation that self-attention and feed-forward transformations contained redundancies that could be reorganized without impairing functional capacity. Comparisons against conventional quantization and pruning methods confirmed that CCE provided a more balanced trade-off between efficiency and model retention, achieving reductions in energy consumption and inference latency without requiring extensive retraining. Computational efficiency improvements were particularly evident in deployment scenarios involving resource-constrained environments, where reductions in memory usage enabled more scalable implementations. Further analyses of internal network behavior showed that compressed models exhibited stable activation distributions and adapted dynamically to input variations, reinforcing the viability of structured compression strategies for optimizing large-scale architectures.
上下文感知压缩技术随着模型规模的不断扩大而越来越受到重视,因为这导致了计算瓶颈问题,阻碍了高效的部署。一种结构化编码方法被提出,用于选择性地消除冗余参数组,并确保在多层中保持表示保真度。上下文压缩编码(CCE)引入了一种多层次的编码机制,能够动态重构参数分布,从而实现了显著的记忆足迹和计算复杂性的减少。 实验评估表明,通过CCE进行压缩的模型保留了语言表达性和连贯性,在一系列文本生成和分类任务中保持了准确性。逐层分析显示中间网络层具有更高的压缩比率,这与观察到的事实一致:自注意力机制和前馈转换包含可以重新组织而不会影响功能容量的冗余。 相对于传统的量化和修剪方法,CCE提供了效率与模型保留之间的更平衡权衡,在不需大量再训练的情况下实现了能耗减少及推理延迟降低。在资源受限环境中的部署场景中,计算效率改进尤为明显:内存使用量的减少使得更加可扩展的实施成为可能。 对网络内部行为进一步分析显示,压缩后的模型表现出稳定的激活分布,并能够动态地适应输入变化,这强化了结构化压缩策略用于优化大规模架构的有效性。
https://arxiv.org/abs/2502.08323
Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
以下是给定文本的中文翻译: 宣传是一种历史上长期使用的说服形式,其目的是通过修辞和心理劝说技巧来影响人们的观点以达到特定的目的。尽管阿拉伯语在互联网上排名第四常用语言,但在英语以外的语言(尤其是阿拉伯语)中用于检测宣传的资源仍然极为有限。为了填补这一空白,首次推出了针对阿拉伯语的多标签宣传、情感和情绪(MultiProSE)数据集。MultiProSE是现有阿拉伯语宣传数据集ArPro的一个开源扩展版本,并为每条文本添加了情感和情绪注释。该数据集包含8,000篇经过标注的新闻文章,这是迄今为止最大的宣传数据集。对于每个任务,开发人员使用大型语言模型(如GPT-4o-mini)和预训练语言模型(包括三种基于BERT的模型)建立了多个基准模型。该数据集、注释指南及源代码均公开发布,以促进阿拉伯语语言模型的未来研究和发展,并有助于深入了解新闻媒体中各种观点维度之间的相互作用。
https://arxiv.org/abs/2502.08319
Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.
空间关系错觉一直是大型视觉-语言模型(LVLMs)面临的一个持久挑战,导致这些模型在生成关于图像中对象位置和空间配置的预测时出现错误。为了解决这个问题,我们提出了一种基于约束感知提示框架的方法,旨在减少空间关系错觉。具体来说,我们引入了两种类型的约束:(1)双向约束,确保成对对象之间关系的一致性;(2)传递性约束,强制多个对象之间的相互依赖关系。通过结合这些约束条件,LVLMs能够生成更具一致性和连贯性的空间输出。 我们在三个广泛使用的空间关系数据集上评估了我们的方法,并展示了相对于现有方法的性能改进。此外,对各种双向关系分析选择和传递性参考选取进行的系统分析进一步突显了我们方法在引入约束以减少空间关系错觉方面的更大潜力。
https://arxiv.org/abs/2502.08317