In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.
在这份报告中,我们介绍了pplx-embed,这是一个多语言嵌入模型系列,它在扩散预训练的语言模型骨干上采用多层次对比学习来进行网页规模的检索。通过利用基于扩散的预训练中的双向注意力机制,我们的模型能够捕捉到段落内的全面双向上下文,这使得使用均值池化和延迟分块策略来更好地保留长文档中的全局上下文成为可能。我们发布了两种类型的模型:pplx-embed-v1用于标准检索任务,而pplx-embed-context-v1则提供了包含全文档背景信息的语境嵌入。 在MTEB(多语言v2)、MTEB(代码)、MIRACL、BERGEN和ToolRet等检索基准测试中,pplx-embed-v1模型表现出了竞争性的性能;而pplx-embed-context-v1则在ConTEB基准测试上创造了新的纪录。除此之外,在我们专注于实际大规模搜索场景的内部评估套件中(涉及数千万文档),pplx-embed-v1也展示了强大的性能。 这些结果验证了这些模型在生产环境中,特别是在检索质量与效率成为关键因素的大规模环境中的有效性。
https://arxiv.org/abs/2602.11151
Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.
监督微调(SFT)在链式思维数据上的训练是推理语言模型的一个关键的后期步骤。标准的机器学习直觉表明,使用更多的独特训练样本可以获得更好的泛化能力。然而,令人意外的是,我们发现重复训练是有益的:在一个固定的更新预算下,在较小的数据集上进行更多轮次(epoch)的训练优于在较大的数据集上单轮训练。具体而言,在AIME'24/25和GPQA基准测试中,Olmo3-7B模型在400个样本上经过128轮微调的表现超过了同样模型在51200个样本上单轮微调的表现,提升了12至26个百分点,并且没有出现灾难性的遗忘现象。我们发现,随着更多的训练回合进行,标记准确率(token accuracy)可以可靠地指示何时重复达到了饱和点;当达到完全记忆化时,进一步增加训练轮次带来的改进便趋于平稳。 这些发现为推理SFT提供了一个实用的方法:在使用标记准确率为停止标准的情况下,扩展训练轮次可以替代昂贵的数据规模扩大。我们提出了一个开放问题——即充分的记忆化与泛化的改善相吻合的现象作为社区理解大规模语言模型的训练动态的一个新挑战。
https://arxiv.org/abs/2602.11149
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
偏好优化对于扩散模型和流匹配模型依赖于既具有判别性又计算效率高的奖励函数。视觉-语言模型(VLMs)已成为主要的奖励提供者,通过利用其丰富的多模态先验知识来指导对齐过程。然而,它们的计算成本和内存消耗可能相当大,并且在像素空间中通过奖励优化潜在扩散生成器会导致域不匹配的问题,从而增加对齐难度。在这篇论文中,我们提出了DiNa-LRM(Diffusion-Native Latent Reward Model),这是一种原生的潜在奖励模型,直接在噪声扩散状态上定义偏好学习过程。我们的方法引入了经过校准的Thurstone似然函数,并根据扩散噪音依赖性来估计不确定性。DiNa-LRM利用了一个预训练的潜在扩散主干网络和一个条件于时间步长的奖励头部,并支持推理时的噪声集成,从而提供了一种测试时扩缩和稳健奖励的原生机制。在图像对齐基准上,DiNa-LRM显著优于现有的基于扩散模型的奖励基线,在计算成本大幅减少的情况下达到了与最先进的视觉-语言模型相当的表现水平。在偏好优化方面,我们展示了DiNa-LRM改进了偏好优化动力学过程,使得模型能够更快速且资源效率更高地进行对齐。 总结来说,这项工作提出了一种新颖的方法来解决扩散模型和流匹配模型中的奖励函数设计问题,并证明其在实际应用中具有显著的性能优势。
https://arxiv.org/abs/2602.11146
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
当前大型语言模型(LLM)开发的主流范式是先预训练一个基础模型,然后进行进一步训练以提升性能和模型表现。然而,超参数优化和缩放规律的研究主要从基础模型验证损失的角度出发,忽略了下游任务的适应能力。在本研究中,我们从模型可塑性的角度来探讨预训练的问题,即基础模型通过微调成功适应下游任务的能力。我们重点研究了权重衰减的作用,这是预训练期间的一个关键正则化参数。通过系统性实验,我们发现使用较大权重衰减值进行训练的模型更具可塑性,这意味着它们在微调到下游任务时表现出更大的性能提升。这种现象可能导致一些反直觉的权衡:即经过预训练后表现较差的基础模型,在微调后的表现反而更好。进一步研究权重衰减对模型行为机制效应揭示,它鼓励线性可分表示、正则化注意力矩阵,并减少对训练数据的过拟合。总之,这项工作展示了在超参数优化中使用交叉熵损失之外的评估指标的重要性,并阐明了单个优化超参数在塑造模型行为中的多方面作用。
https://arxiv.org/abs/2602.11137
Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.
扩散语言模型通过迭代细化生成文本,这一过程通常在计算上是低效的,因为许多标记在最终去噪步骤之前很久就已经达到了稳定状态。我们提出了一种无需训练、基于令牌级别的提前停止方法,在每个位置独立识别收敛点。我们的方法利用了从模型预测和局部上下文中得出的轻量级信号,动态确定各个标记何时可以被最终化。这使得每个标记都能根据情况灵活冻结,而不需要针对特定任务进行微调,从而大幅减少了所需的扩散步骤总数。 在涵盖数学推理、通用问题回答以及科学理解等多方面基准测试中,我们的方法实现了最先进的效率提升,同时保持了生成文本的质量。
https://arxiv.org/abs/2602.11133
Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.
错误信息检测是一项关键任务,可以通过整合外部知识得到显著改进,就像手动事实核查一样。在本工作中,我们提出了一种新颖的方法来表示文本文档,以便从知识库中引入信息。我们的方法称为“基于图的文本编码”(Text Encoding with Graph, TEG),通过提取结构化的图形形式的信息并同时对文本和图形进行编码,以用于分类目的,从而处理文档。通过广泛的实验,我们证明了这种混合表示在错误信息检测性能上比单独使用语言模型要更加优越。此外,我们还介绍了TEGRA框架的扩展版本,该版本整合了特定领域的知识,在大多数情况下进一步提高了分类准确性。
https://arxiv.org/abs/2602.11106
Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.
大型语言模型(LLMs)中的偏差指的是在安全、价值和文化维度上无法同时满足需求,导致其行为与现实世界设置中这些维度共存时的人类期望相悖。现有的评估基准,如侧重于安全性的SAFETUNEBED、注重价值观的VALUEBENCH以及关注文化的WORLDVIEW-BENCH,主要单独评价这些维度,因此对于它们之间的相互作用和权衡提供的见解有限。最近的一些努力,包括基于机械可解释性而建立的MIB和INTERPRETABILITY BENCHMARK,为模型失败提供了有价值的视角;然而,它们仍然不足以系统地描述跨维度的权衡问题。 为了弥补这一不足,我们引入了MisAlign-Profile,这是一个受机械剖析启发、用于测量偏差权衡的统一基准。首先,我们构建了一个名为MISALIGNTRADE的数据集,该数据集涵盖了112个规范领域分类中的不一致与一致性实例,其中包括14个安全域、56个价值域和42个文化域。除了领域标签外,每个提示还使用Gemma-2-9B-it和通过Qwen3-30B-A3B-Instruct-2507扩展并利用基于SimHash的指纹技术来避免重复的内容进行分类,将它们分为三种正交语义类型之一:对象、属性或关系不一致。每个提示与两个阶段拒绝采样生成的一致和不一致响应配对,以确保质量。 其次,在MISALIGNTRADE上评估了通用型、微调型以及开放权重的大型语言模型(LLMs),结果揭示了跨维度12%-34%的偏差权衡。
https://arxiv.org/abs/2602.11091
In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
在大型语言模型(LLM)当前的发展格局中,大规模高质量训练数据的策划是提高模型性能的主要驱动力。其中的关键因素是一个称为“数据配方”(data recipe)的概念,它包含了一套用于将原始来源转换为训练语料的数据处理流程。尽管越来越多地使用LLMs来自动化单个数据处理步骤(如数据合成和过滤),但设计整个数据配方的过程仍然主要依赖人工且劳动密集型,需要大量的人类专业知识和反复试验。为了弥合这一差距,我们提出了针对LLM适应的“端到端数据配方生成”方法。给定一个目标基准和一组可用的数据源,模型需输出一个完整的数据配方,以将基础LLM适配到特定的任务中。 在此框架下,我们介绍了DataChef-32B,它使用代理奖励进行在线强化学习,该代理奖励可以预测候选数据配方的下游性能。在六个独立任务上,DataChef-32B生成的数据配方达到了与人类专家策划方案相当的下游性能。尤为值得注意的是,由DataChef-32B生成的配方将Qwen3-1.7B-Base模型适配到数学领域,在AIME'25上的得分达到66.7,超过了原生的Qwen3-1.7B版本。 这项工作为自动化LLM训练和开发自我演进的人工智能系统提供了新的视角。
https://arxiv.org/abs/2602.11089
Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at this https URL.
大型语言模型(LLMs)展现了强大的通用推理和语言理解能力,但在受严格形式规则、精确术语以及法律约束结构制约的领域中表现不佳。税法是这些挑战的一个典型例子,正确的答案需要准确引用法律规定、具有结构性的法律论证以及在严格的评分方案下保持数值准确性。我们通过算法生成了SteuerEx,这是首个源自德国大学真实税务法学考试数据的公开基准测试。SteuerEx包含115道专家验证过的涵盖六个核心税法领域的考试题目,并覆盖多个学术层次,采用了陈述级别的部分信用评价框架,这与实际考试实践非常接近。 此外,我们还提出了SteuerLLM,这是一个专门为德国税法训练的领域适应型大型语言模型。该模型是在使用受控检索增强管道生成的大规模合成数据集上进行训练的。经过280亿参数量级的SteuerLLM在多个情况下比同等规模的一般性指令调优模型以及一些更大规模的系统表现更佳,这表明领域特定的数据和架构调整对于现实中的法律推理任务性能来说要比参数规模更加关键。 所有基准测试数据、训练数据集、模型权重及评估代码均已公开发布,以支持在领域特定法律人工智能研究方面的可重复性。SteuerLLM的在线演示可在上述链接中找到。
https://arxiv.org/abs/2602.11081
We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.
我们提出了一种基于激活的数据归因方法,该方法能够追踪经过训练的语言模型中行为变化的责任训练数据点。通过计算测试提示和偏好对的激活差异向量,并根据余弦相似度进行排序,我们可以识别导致特定行为的数据点,并通过使用修改后的数据重新训练来进行因果验证这些归因结果。通过对行为-数据点相似性矩阵进行聚类分析,还可以实现新兴行为的无监督发现。当我们将其应用于OLMo 2的生产DPO(区分偏好优化)训练时,我们发现了由干扰项触发的合规行为:一种有害的行为,在这种情况下,当危险请求后面附加了良性格式化指令后,模型会对此进行响应。 通过过滤顶级排名的数据点可以减少这种行为63%,而改变这些数据点的标签则可以使该行为减少78%。我们的方法优于基于梯度的归因和大语言模型(LLM)判断基准,并且比这两种方法便宜超过10倍。这个在现实环境中出现的“模型生物”——源于污染偏好的数据而非故意注入——为安全技术提供了一个实际的评估标准。
https://arxiv.org/abs/2602.11079
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
当前的大规模视觉-语言模型(LVLMs)通常依赖于基于单次视觉编码的纯文本推理,这往往导致细粒度视觉信息的丢失。最近提出的“通过图像思考”方法试图通过使用外部工具或代码来操作图像以缓解这一限制;然而,由此产生的视觉状态常常未能充分扎根于语言语义之中,从而影响了跨模态对齐的效果——特别是在需要跨越遥远区域或多张图片进行视觉语义推理和几何关系分析时。为了解决这些挑战,我们提出了“与图像对话”,这是一种新框架,将视觉操作重新定义为由表达性语言提示引导的特征调制。在丰富语言提示的指导下,模型可以动态地对多个图像区域执行联合再编码,从而实现了语言推理与视觉状态更新之间的更紧密耦合。 我们将这一理念实现在ViLaVT中,这是一种新型LVLM,配备了专为这种交互式视觉推理设计的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练来促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT取得了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现出特别突出的优势。
https://arxiv.org/abs/2602.11073
We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves 81.3% token accuracy and 0.87 cosine similarity.
我们将嵌入逆向问题视为条件掩码扩散过程,通过迭代去噪并行恢复所有标记,而不是采用顺序自回归生成。一个掩码扩散语言模型通过自适应层归一化来适应目标嵌入,在无需访问目标编码器的情况下,仅需8次前向传播就能通过含有78M参数的模型完成任务。在三个不同的嵌入模型上进行的32个标记序列实验中,该方法达到了81.3%的标记准确率和0.87的余弦相似度。
https://arxiv.org/abs/2602.11047
Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).
尽管有关语言模型(Language Models, LM)的研究不断涌现,但很少有方法分析LM的逆向可解性。也就是说,给定一个LM和期望的目标输出序列,确定哪些输入提示可以产生该目标输出仍然是一个悬而未决的问题。我们将这个问题表述为经典的基于梯度的优化问题。首先,我们提出了一种简单的算法,以实现给定的(冻结的)LM在端到端上的可微分性,然后通过梯度下降法找到最优的输入提示。我们的核心见解是将LM视为作用于令牌分布序列上的函数(而不是传统的将其视为作用于令牌序列上的函数)。实验和消融研究表明,在几个白盒语言模型中,我们基于DLM的逆向方法可以可靠且高效地优化长度为10和80的输入提示,以实现目标输出长度为20的目标。
https://arxiv.org/abs/2602.11044
Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
在手术室(OR)中对手术器械的准确计数是确保患者安全的关键前提。尽管大型视觉语言模型和代理AI近期取得了进展,但在复杂且密集的情景下精确计数仍然极具挑战性,尤其是在手术器械紧密堆积的情况下。为了解决这个问题,我们引入了一种名为“Chain-of-Look”的新型视觉推理框架。该框架模仿了人类的顺序计数过程,并通过施加一种结构化的视觉链来实现这一目标,而不是依赖于传统的无序对象检测方法。这种视觉链指导模型沿着连贯的空间轨迹进行计数,从而提高了复杂场景中的准确性。 为了进一步确保视觉链的物理合理性,我们引入了一种相邻损失函数(neighboring loss function),它明确地建模了密集排列手术器械所固有的空间约束条件。 此外,我们还推出了SurgCount-HD,这是一个包含1,464张高密度手术器械图像的新数据集。通过广泛的实验,我们的方法在具有挑战性的、复杂且密集的手术器械计数任务中超越了现有最佳的方法(如CountGD和REC),以及多模态大型语言模型(如Qwen和ChatGPT)。
https://arxiv.org/abs/2602.11024
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
开发高效的GPU内核对于扩展现代AI系统至关重要,但由于复杂的硬件架构和对专业优化技能的需求,这仍然是一个复杂任务。尽管大型语言模型(LLMs)在生成通用顺序代码方面表现出强大的能力,但在生成GPU代码时却面临重大挑战,原因是高质量标记训练数据的稀缺、合成解决方案生成中的编译器偏见以及跨硬件世代泛化的能力有限。这些因素阻碍了监督微调(SFT)成为改进当前LLM的有效可扩展方法。相比之下,强化学习(RL)提供了一种数据效率高且适应性强的替代方案,但需要访问相关工具、仔细选择训练问题,并建立一个稳健的评估环境。 我们提出了Makora的环境和工具,用于前沿模型的强化学习微调,并报告了在Triton代码生成方面对GPT-5进行微调的结果。在单次尝试设置下,我们的微调模型将内核正确性从43.7%提高到77.0%(+33.3个百分点),并将优于PyTorch TorchInductor的问题比例从14.8%提升至21.8%(+7个百分点),与基线GPT-5相比,同时在KernelBench上超过了之前的最先进模型。当集成到完整的编程代理中时,它可以解决扩展后的KernelBench套件中的97.4%的问题,并且在72.9%的问题上优于PyTorch TorchInductor编译器,几何平均速度提升达到了2.12倍。 我们的工作证明了,在数据可用性限制传统监督学习能力的高度专业化技术领域中,使用强化学习进行有针对性的后训练能够解锁LLM的能力。这为AI辅助加速器编程开辟了新的途径。
https://arxiv.org/abs/2602.11000
Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.
由大型语言模型(LLM)驱动的代理在软件行业中被越来越多地采用,它们作为代码合作者甚至自主开发者发挥作用。随着其重要性的提升,评估这些代理当前编码能力边界变得愈发关键。然而,现有的代理编程基准测试涵盖的任务范围有限,例如仅限于单个拉取请求(PR)内的错误修复,并且通常依赖于非可执行的评价方法或缺乏自动化的方法来持续更新评价覆盖范围。 为了解决这些问题,我们提出了FeatureBench——一个旨在评估以功能为导向的软件开发过程中代理编程性能的基准测试。FeatureBench整合了一个基于执行的评估协议和一种可扩展的驱动式测试方法,该方法能够从代码仓库中自动衍生任务,并且只需最少的人力介入。通过沿依赖关系图追踪单元测试,我们的方法可以识别跨多个提交和PR分布于开发时间线上的功能级编码任务,同时确保分离后其他功能的正常运行。 借助这一框架,在我们基准测试的第一个版本中,我们从24个开源仓库收集了200项具有挑战性的评估任务及3825个可执行环境。实证评价显示,最先进的代理模型如Claude 4.5 Opus在SWE-bench上实现了74.4%的解决率,但在我们的测试中仅成功完成了11.0%的任务,这为推进代理编程的进步提供了新的机遇。 此外,得益于我们自动化的任务收集工具包,FeatureBench可以轻松扩展并随着时间更新以减少数据泄露的风险。由于构造环境的内在可验证性,我们的方法也有可能对代理训练具有潜在价值。
https://arxiv.org/abs/2602.10975
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.
视觉语言模型(VLMs)扩展了大型语言模型(LLMs)在跨模态设置中的推理能力,但它们仍然极易受到多模态越狱攻击。现有的防御措施主要依赖于安全微调或激进的令牌操作,这会带来高昂的训练成本或者显著降低实用性。最近的研究表明,LLMs本质上能够识别文本中不安全的内容,并且在VLMs中加入视觉输入通常会稀释与风险相关的信号。 鉴于此,我们提出了风险意识注入(RAI),这是一种轻量级且无需训练的安全校准框架,通过放大VLM中的不安全信号来恢复LLM类似的风险识别能力。具体而言,RAI构建了一个基于语言嵌入的“不安全原型子空间”,并对选定的高风险视觉令牌进行目标调制,在跨模态特征空间中明确激活了关键的安全信号。这种调制恢复了模型从视觉输入检测不安全内容的能力,同时保留了原生令牌在跨模态推理中的语义完整性。 广泛的实验研究表明,RAI能够显著降低攻击成功率而不影响任务性能,适用于多个越狱和实用性的基准测试。
https://arxiv.org/abs/2602.03402
Large Language Models (LLMs) are increasingly used to generate and shape cultural content, ranging from narrative writing to artistic production. While these models demonstrate impressive fluency and generative capacity, prior work has shown that they also exhibit systematic cultural biases, raising concerns about stereotyping, homogenization, and the erasure of culturally specific forms of expression. Understanding whether LLMs can meaningfully align with diverse cultures beyond the dominant ones remains a critical challenge. In this paper, we study cultural adaptation in LLMs through the lens of cooking recipes, a domain in which culture, tradition, and creativity are tightly intertwined. We build on the \textit{GlobalFusion} dataset, which pairs human recipes from different countries according to established measures of cultural distance. Using the same country pairs, we generate culturally adapted recipes with multiple LLMs, enabling a direct comparison between human and LLM behavior in cross-cultural content creation. Our analysis shows that LLMs fail to produce culturally representative adaptations. Unlike humans, the divergence of their generated recipes does not correlate with cultural distance. We further provide explanations for this gap. We show that cultural information is weakly preserved in internal model representations, that models inflate novelty in their production by misunderstanding notions such as creativity and tradition, and that they fail to identify adaptation with its associated countries and to ground it in culturally salient elements such as ingredients. These findings highlight fundamental limitations of current LLMs for culturally oriented generation and have important implications for their use in culturally sensitive applications.
大型语言模型(LLMs)被越来越多地用于生成和塑造文化内容,从叙事写作到艺术创作。虽然这些模型展示了令人印象深刻的流畅性和生成能力,但先前的工作表明它们也表现出系统性的文化偏见,这引发了关于刻板印象、同质化以及特定文化表达形式的消亡等方面的担忧。了解LLMs是否能够有意义地与主流之外的各种文化对齐仍然是一个关键挑战。在本文中,我们通过烹饪食谱这一领域来研究LLMs的文化适应性,在这个领域里,文化、传统和创造力紧密交织在一起。我们基于\textit{GlobalFusion}数据集进行研究,该数据集将来自不同国家的人类食谱按照公认的衡量文化距离的标准进行配对。使用相同的国家组合,我们利用多个大型语言模型生成文化适应性食谱,从而能够直接比较人类和LLM在跨文化交流创作中的行为表现。我们的分析表明,LLMs无法产生具有代表性的文化改编作品。与人类不同,它们生成的菜谱与其对应的文化距离之间没有相关性。我们进一步解释了这一差距的原因:文化信息在内部模型表示中弱化保存;模型通过误解诸如创意和传统等概念来夸大其生产的创新性;并且它们不能识别出适应所需的国家关联或将其扎根于如食材这样的文化显著元素上。这些发现揭示了当前LLMs用于文化导向生成的基本局限,并对其在文化敏感应用中的使用具有重要影响。
https://arxiv.org/abs/2602.10964
Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions through multiplicative rotations, yet their behavior at long context lengths remains poorly characterized. In this work, we reinterpret RoPE as phase modulation applied to a bank of complex oscillators, enabling analysis through classical signal processing theory. Under this formulation, we derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length. These include a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability bound that constrains phase drift in low-frequency positional modes. We further extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment, tightening the base requirement as depth increases. Complementing these results, we derive a precision-dependent upper bound on the RoPE base arising from finite floating-point resolution. Beyond this limit, incremental phase updates become numerically indistinguishable, leading to positional erasure even in the absence of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers. We validate the framework through a comprehensive case study of state-of-the-art models, including LLaMA, Mistral, and DeepSeek variants, showing that observed successes, failures, and community retrofits align closely with the predicted bounds. Notably, models that violate the stability bound exhibit attention collapse and long-range degradation, while attempts to scale beyond one million tokens encounter a hard precision wall independent of architecture or training.
旋转位置嵌入(RoPE)在大型语言模型中广泛用于通过乘法旋转变换编码标记的位置信息,但它们在长上下文长度下的行为特性尚未得到充分描述。在这项工作中,我们将RoPE重新解释为应用于一组复振荡器的相位调制,并借此利用经典信号处理理论进行分析。在此框架下,我们推导出了一组原则性的下限,这些下限是保持目标上下文长度内位置一致所必需的旋转基数参数值。这包括一个类似于奈奎斯特极限的基本混叠界限,以及一种约束低频位置模式相位漂移的直流分量稳定性界限。 此外,我们将这一分析扩展到了深层变压器模型中,表明在多层重复应用旋转变换时,角向偏差会累积起来,在深度增加的情况下进一步收紧基础要求。与此同时,我们还推导出了一个与有限浮点精度相关的RoPE基数上限,该上限取决于精度水平。当超过这个限制后,增量相位更新将变得数值上不可区分,即使在没有混叠的情况下也会导致位置信息的消失。 上述上下限共同定义了一个根据精度和深度依赖的程度确定长上下文长度变压器模型可行性的“适中区域”。我们通过一系列最先进的模型案例研究(包括LLaMA、Mistral和DeepSeek变体)验证了这一框架的有效性,显示观察到的成功、失败以及社区修复措施与预测的界限高度吻合。值得注意的是,违反稳定度界限的模型会出现注意力崩溃和长距离退化现象,而试图超越一百万令牌规模则会遇到一个独立于架构或训练的硬精度限制墙。
https://arxiv.org/abs/2602.10959
Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.
扩散语言模型(DLM)通过迭代去噪一个被屏蔽的序列来生成文本,在每个步骤中反复决定在哪些位置进行解码。标准解码遵循贪婪规则:解除最有信心位置的掩码,但这局部选择可能会使模型陷入次优的解码顺序,特别是在需要大量推理的任务提示上尤为明显。 我们提出了一种名为SOAR的训练无关解码算法,该算法可以根据模型的不确定性来调整自身的行为。当模型自信度较低时,SOAR会暂时扩展搜索范围以探索其他可能的解码决策,从而避免过早承诺;而当自信度较高时,则缩小搜索范围,并行地解开许多位置,减少去噪迭代次数。 在数学推理和代码生成基准测试(GSM8K、MBPP、HumanEval)中,在Dream-7B和LLADA-8B模型上,SOAR提高了生成质量并保持了竞争性的推断速度。这为在DLM解码过程中平衡质量和效率提供了一种实用方法。
https://arxiv.org/abs/2602.10953