Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and this http URL using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at this https URL and our code is available at this https URL.
基于扩散的图像生成模型在生产高质量合成内容方面表现出色,但其推理过程缓慢且计算成本高昂。先前的研究试图通过缓存和重复使用扩散变换器中的特征来缓解这一问题,在不同的推断步骤中实现这一点。然而,这些方法往往依赖于刚性的启发式规则,导致加速效果有限或不同架构之间泛化性能不佳。 我们提出了一种名为“基于进化缓存的扩散模型加速”(ECAD)的遗传算法,该算法能够根据少量校准提示学习每种模型的有效缓存时间表,并形成帕累托前沿。ECAD不需要对网络参数或参考图像进行任何修改,它可以显著提高推理速度,提供对质量-延迟权衡的精细控制,并无缝适应不同的扩散模型。 值得注意的是,通过使用与校准时未见过的不同分辨率和模型变体,ECAD所学习的时间表能够有效地泛化。 我们在PixArt-alpha、PixArt-Sigma及另一组图像生成模型上评估了ECAD,采用多种指标(FID、CLIP、Image Reward)以及多个基准数据集(COCO、MJHQ-30k、PartiPrompts),结果显示在各个方面都优于先前的方法。例如,在PixArt-alpha上的测试中,ECAD找到了一种方案,其表现超越了此前的最先进方法4.47分COCO FID,并将推断速度从2.35倍提升至2.58倍。 我们的实验结果表明,ECAD是一种可扩展且泛化的加速扩散模型推理的方法。该项目网站和代码可以在相应的链接中找到。
https://arxiv.org/abs/2506.15682
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
尽管在视觉语言模型(VLMs)的推理时间搜索方面取得了显著进展,现有方法仍然既计算成本高昂又容易产生不受惩罚的低置信度生成结果,这往往会导致持续的幻觉现象。我们引入了**基于边际奖励的价值引导推理(ViMaR)**,这是一种两阶段推理框架,通过结合时差价值模型和边际感知的奖励调整来提高效率和输出保真度。 在第一阶段中,我们进行一次遍历以从多样化的候选方案中识别出最具价值的描述。第二阶段则有针对性地优化那些被忽视或视觉基础较弱的部分,从而消除频繁受到奖励评估的影响。一个校准过的边际惩罚机制会抑制低置信度的延续生成,同时保留描述的丰富性。 在多种VLM架构上的广泛实验表明,ViMaR能够产生更加可靠、事实准确、详细且具有解释性的描述,并与现有基于价值引导的方法相比,在速度上实现了超过4倍的加速。特别是,我们展示了仅使用LLaVA Mistral-7B训练的ViMaR可以**有效指导解码在未见过的强大模型中**进行操作。为了进一步验证这一点,我们将ViMaR调整为在LLaVA-OneVision-Qwen2-7B中引导生成,从而提高了描述质量的一致性,并展示了跨模型指导的稳健性能。 这种跨模型泛化突显了ViMaR的灵活性和模块化特性,使其成为一种可扩展且具有迁移性的推理时间解码策略。此外,在使用由ViMaR生成的描述进行自我训练时,底层模型在一系列视觉理解基准测试中实现了显著提升,这强调了快速、准确且自我改进的VLM管道的巨大潜力。 简而言之,ViMaR不仅提高了VLM输出的质量和效率,还展示了其强大的跨模型泛化能力和作为高效自我改善策略的潜在价值。
https://arxiv.org/abs/2506.15649
Recent advancements in large reasoning models (LRMs) have significantly enhanced language models' capabilities in complex problem-solving by emulating human-like deliberative thinking. However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost. In this work, we explore the representational and behavioral origins of this inefficiency, revealing that LRMs inherently possess the capacity for more concise reasoning. Empirical analyses show that correct reasoning paths vary significantly in length, and the shortest correct responses often suffice, indicating untapped efficiency potential. Exploiting these findings, we propose two lightweight methods to enhance LRM efficiency. First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction in the model's representation space. Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity by rewarding concise correct solutions. Extensive experiments on seven LRM backbones across multiple mathematical reasoning benchmarks demonstrate that our methods significantly reduce reasoning length while preserving or improving task performance. Our results highlight that reasoning efficiency can be improved by leveraging and guiding the intrinsic capabilities of existing models in a self-guided manner.
最近在大规模推理模型(LRMs)方面的进展显著增强了语言模型解决复杂问题的能力,通过模拟人类的审慎思考。然而,这些模型经常表现出过度思考的现象,即生成冗长且不必要的内容,这会降低效率并增加推断成本。在这项工作中,我们探索了这种低效性的表征和行为根源,并揭示出LRMs本身就具备进行更加简洁推理的能力。实证分析表明,正确的推理路径长度差异很大,而最短的正确答案通常就足够了,这意味着存在未被开发的效率潜力。 基于这些发现,我们提出了两种轻量级方法来提高LRM的效率。首先,我们引入了“Efficiency Steering”,这是一种无须重新训练的激活控制技术,通过在模型表征空间中的单一方向调节推理行为。其次,我们开发了一种名为“Self-Rewarded Efficiency RL”的强化学习框架,该框架动态平衡任务准确性和简洁性,通过奖励简短且正确的解决方案来实现。 我们在七个LRM基础架构上进行了广泛的实验,并针对多个数学推理基准测试展示了我们的方法显著减少了推理长度,同时保持甚至提高了任务性能。我们的结果表明,可以通过引导现有模型的内在能力来自我指导地改进推理效率。
https://arxiv.org/abs/2506.15647
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at this https URL.
在现实世界的视频超分辨率(Real-VSR)中,尤其是在利用预训练的生成模型如稳定扩散(Stable Diffusion, SD)进行逼真细节合成时,再现丰富的空间细节同时保持时间一致性是一个具有挑战性的问题。现有的基于SD的Real-VSR方法通常为了保证时间连贯性而牺牲了空间细节,导致视觉质量不佳。我们认为关键在于如何有效地从低质量输入视频中提取鲁棒的时间一致性先验,并在不破坏这些先验的情况下增强视频的细节。 为此,我们提出了一种双LoRA学习(DLoRAL)范式来训练一个基于SD的一步扩散模型,在实现现实帧细节的同时保持时间一致性。具体而言,我们引入了一个跨帧检索(Cross-Frame Retrieval, CFR)模块,用于聚合不同帧之间的互补信息,并训练一个一致性LoRA(Consistency-LoRA, C-LoRA),以从退化输入中学习鲁棒的时间表示。在完成一致性学习之后,我们将CFR和C-LoRA固定下来,并训练一个细节LoRA(Detail-LoRA, D-LoRA)来增强空间细节,同时与C-LoRA定义的时间空间对齐,从而保持时间连贯性。这两个阶段交替进行优化,共同提供一致且细节丰富的输出。在推理过程中,两个LoRA分支被合并到SD模型中,允许以一步扩散的方式高效和高质量地恢复视频。 实验表明,DLoRAL在准确性和速度上都表现出强大的性能。代码和模型可以在这个链接中找到:[此URL](请将此处的"this https URL"替换为实际可用的具体网址)。
https://arxiv.org/abs/2506.15591
Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: this https URL
视觉语言模型(VLMs)现在生成的是以对话层面、多句式描述为主的视觉描述,这挑战了最初为单句描述到场景图映射设计的文本场景图解析器。当前的方法通常通过合并句子级别的解析输出来处理对话输入,但这种方式常常遗漏跨句指代等现象,导致产生的图碎片化,并且影响下游VLM任务的表现。 为了应对这一挑战,我们引入了一个新的任务——话语级文本场景图解析(DiscoSG),并为此构建了数据集DiscoSG-DS。该数据集包括400个专家注释的和8,430对合成的多句描述与对应场景图配对的数据。每个描述平均包含9句话,而每个图表至少包含了现有数据集中三倍以上的三元组数量。 虽然在DiscoSG-DS上微调大型PLM(如GPT-4)可以使SPICE评分比最佳句子合并基线提高约48%,但由于高昂的推理成本和限制性许可条款阻碍了其开源使用,而且较小规模的微调模型难以处理复杂的图。我们提出了一种名为DiscoSG-Refiner的方法:首先利用一个小一点的语言模型生成基础图;然后采用第二个语言模型进行迭代式的图编辑建议,从而减少整个图表生成的工作量。 通过使用两个Flan-T5-Base模型,DiscoSG-Refiner仍能比最佳基线提高约30%的SPICE评分,并且推理速度比GPT-4快86倍。此外,在对话级描述评估和幻觉检测等下游VLM任务上也表现出稳定地提升性能。 代码与数据可在以下网址获取:[请提供具体的URL链接]
https://arxiv.org/abs/2506.15583
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.
大型语言模型(LLMs)在实时语音聊天应用中广泛应用,通常与文本转语音(TTS)系统结合使用以生成音频响应。然而,其庞大的规模往往会导致用户输入结束和音频输出开始之间存在明显的延迟,从而导致用户体验不佳。这种延迟尤其明显于当LLMs被部署为单用户语音助手时,在计算能力有限的消费级硬件上运行的情况下。 我们发现,这种延迟主要是由于LLMs生成第一个句子所需的时间造成的,而TTS系统则需要这个句子作为输入来逐句合成音频响应。为了应对这一瓶颈,我们提出了一种名为预测性生成(PredGen)的新框架,该框架通过在用户输入时进行投机性解码来减轻甚至消除这种延迟。PredGen能够在用户说话的过程中生成候选响应,从而使系统可以尽早开始TTS处理过程,从而减少延迟。 模拟实验表明,在Lmsys和MT-Bench数据集上的使用情况中,所提出的方法能够有效将延迟降低大约2倍,并且在输入时仅产生很小的额外计算成本——这部分计算本来是闲置不用的。
https://arxiv.org/abs/2506.15556
Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.
最近,局部-全局注意力模型作为一种有吸引力的替代方案涌现出来,可以提升传统Transformer在训练和推理效率方面的表现。然而,关键的窗口大小选择呈现出一种帕累托权衡:较大的窗口虽然能保持类似全注意力的表现但对短上下文场景中的效率几乎没有改进;而较小的窗口则可能导致性能下降。当前的一些模型(如Gemma2和Mistral)采用保守的窗口尺寸(例如,在8192预训练长度中使用4096),以确保表现不受影响。本研究旨在探索改变这一帕累托前沿的战略,使局部-全局模型即使在短上下文环境中也能实现效率提升。 我们的核心动机在于解决局部注意力的一个固有局限性:它完全忽略了定义窗口之外的令牌信息。为此,我们探讨了一种称为RATTENTION的技术,这是一种集成特殊线性注意机制的局部注意变体,旨在捕捉从这些超出窗口范围的令牌中提取的信息。在3B和12B规模上的预训练实验表明,RATTENTION能够在性能与效率之间达成更优的帕累托权衡。特别地,在一个只有512大小的窗口配置下,RATTENTION能够持续匹配全注意力模型在各种设置中的表现。此外,由于线性注意组件内在的递归性质,它对长上下文的表现有所提升,这一点已在RULER基准测试中得到了验证。 重要的是,这些改进并未损害训练效率;通过专门设计的内核实现以及减小了窗口大小,使得RATTENTION能够保持与现有最先进技术相当的训练速度。
https://arxiv.org/abs/2506.15545
This paper studies the problem of learning computable functions in the limit by extending Gold's inductive inference framework to incorporate \textit{computational observations} and \textit{restricted input sources}. Complimentary to the traditional Input-Output Observations, we introduce Time-Bound Observations, and Policy-Trajectory Observations to study the learnability of general recursive functions under more realistic constraints. While input-output observations do not suffice for learning the class of general recursive functions in the limit, we overcome this learning barrier by imposing computational complexity constraints or supplementing with approximate time-bound observations. Further, we build a formal framework around observations of \textit{computational agents} and show that learning computable functions from policy trajectories reduces to learning rational functions from input and output, thereby revealing interesting connections to finite-state transducer inference. On the negative side, we show that computable or polynomial-mass characteristic sets cannot exist for the class of linear-time computable functions even for policy-trajectory observations.
本文研究了通过扩展Gold的归纳推理框架来学习极限可计算函数的问题,该框架引入了\textit{计算观察}和\textit{受限输入源}。除了传统的输入-输出观察外,我们还提出了时间界限观察和策略轨迹观察,以在更现实的约束下研究通用递归函数的学习能力。虽然输入-输出观察不足以学习通用递归函数类在极限下的情况,但我们通过施加计算复杂性限制或补充近似的时间界限观察来克服这一学习障碍。此外,我们围绕\textit{计算代理}的观察构建了一个正式框架,并证明从策略轨迹中学习可计算函数可以转化为从输入和输出学习有理函数的问题,从而揭示了与有限状态转换器推理之间的有趣联系。另一方面,我们也展示了即使对于策略轨迹观察,线性时间可计算函数类也不能存在可计算或多项式质量特征集。
https://arxiv.org/abs/2506.15543
Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs. We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions. A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization. This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass. We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates. On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918. This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU. RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications.
检索增强生成(RAG)已成为使用当前的外部信息更新大型语言模型(LLM)响应的一种常见策略。然而,模型可能仍然依赖于记忆中的训练数据,忽略检索到的证据,并产生污染输出。我们引入了检索路径污染评分(RePCS),这是一种诊断方法,可以在不访问模型或重新训练的情况下检测此类行为。RePCS 比较两种推理路径:(i) 仅使用查询的参数化路径;以及 (ii) 使用查询和检索上下文的检索增强路径。通过计算这两种输出分布之间的Kullback-Leibler(KL)散度,可以发现如果散度较低,则表明检索到的上下文影响较小,这可能意味着模型存在记忆化的倾向。 这种程序与模型无关,不需要访问梯度或内部状态,并且仅需额外进行一次前向传递。我们进一步推导出PAC风格的保证,将KL阈值链接到用户定义的误报率和漏报率。在Prompt-WNQA基准测试上,RePCS 达到了0.918 的ROC-AUC值。这一结果比最强的先前方法高出6.5个百分点,并且在NVIDIA T4 GPU上的延迟开销低于4.7%。RePCS 提供了一种轻量级、黑盒保障措施,用于验证RAG系统是否有意义地利用了检索功能,在安全关键的应用程序中尤为有价值。
https://arxiv.org/abs/2506.15513
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.
进程或分步监督在提升大型语言模型(LLMs)的复杂多步骤推理能力方面发挥了关键作用。然而,高效的自动化过程注释仍然是一个重大挑战。为了解决这个问题,我们引入了带有参考引导评估的单次通过注释框架(SPARE),这是一种新颖的结构化框架,能够通过将每个解决方案步骤与参考方案中的一个或多个步骤对齐,并辅以明确的理由进行评价,实现单一阶段、每一步骤的标注。我们展示了参考指导下的分步评价有效地促进了跨越三个领域的四个数据集上的流程监督:数学推理、多跳组合问答和空间推理。我们证明了SPARE在以下两个方面与基线相比提升了推理性能:(1)在线强化学习环境中进行模型微调,以促进推断时间的贪婪解码;以及(2)训练奖励模型用于对多个LLM生成的输出进行排名/聚合。此外,在具有挑战性的数学数据集上,SPARE实现了竞争性表现,并提供了比基于树搜索的自动注释高出2.6倍的效率,仅需38%的运行时间。我们公开发布了代码库和经过训练的SPARE-PRM模型以促进进一步的研究和可重复性。
https://arxiv.org/abs/2506.15498
Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context.
大型语言模型(LLMs)通常会通过补充外部知识来提供其参数中未编码的信息,或者减少幻觉的产生。在这种情况下,我们期望模型在生成响应时能够以提供的外部上下文为基础。然而,先前的研究表明,仅在推理阶段附加上下文并不能确保基于上下文的生成。为了解决这个问题,我们提出了Context-INformed Grounding Supervision(CINGS),这是一种训练后的监督方法,在该方法中,模型使用与响应相关的上下文进行预处理,同时计算损失只针对响应标记,并屏蔽掉上下文部分。我们的实验表明,使用CINGS训练的模型在文本和视觉领域都表现出更强的基于上下文生成能力,优于标准指令调优模型的表现。在文本领域,CINGS在11个信息寻求数据集上超过了其他训练方法,在推理时的基于上下文生成技术中也表现出了互补性。在视觉语言领域,通过将一个视觉语言模型中的大型语言模型主干替换为使用CINGS训练的模型,其幻觉减少,并在整个生成响应过程中保持了事实一致性。这种改进的基于上下文生成并没有导致下游性能的整体下降。最后,我们分析了CINGS背后增强基于上下文生成机制的原因,发现它使模型的知识和行为产生了变化,隐式地鼓励更大程度上依赖于外部上下文。
https://arxiv.org/abs/2506.15480
We propose co-creative learning as a novel paradigm where humans and AI, i.e., biological and artificial agents, mutually integrate their partial perceptual information and knowledge to construct shared external representations, a process we interpret as symbol emergence. Unlike traditional AI teaching based on unilateral knowledge transfer, this addresses the challenge of integrating information from inherently different modalities. We empirically test this framework using a human-AI interaction model based on the Metropolis-Hastings naming game (MHNG), a decentralized Bayesian inference mechanism. In an online experiment, 69 participants played a joint attention naming game (JA-NG) with one of three computer agent types (MH-based, always-accept, or always-reject) under partial observability. Results show that human-AI pairs with an MH-based agent significantly improved categorization accuracy through interaction and achieved stronger convergence toward a shared sign system. Furthermore, human acceptance behavior aligned closely with the MH-derived acceptance probability. These findings provide the first empirical evidence for co-creative learning emerging in human-AI dyads via MHNG-based interaction. This suggests a promising path toward symbiotic AI systems that learn with humans, rather than from them, by dynamically aligning perceptual experiences, opening a new venue for symbiotic AI alignment.
我们提出了一种新的合作创造学习范式,在这种模式下,人类和人工智能(即生物与人工代理)相互融合各自的部分感知信息和知识,以构建共享的外部表示。我们将这一过程解释为符号出现。不同于传统的基于单向知识传输的人工智能教学方法,这种方法解决了不同模态的信息如何整合的问题。我们使用基于城市-赫斯特命名游戏(Metropolis-Hastings Naming Game, MHNG) 的人类-人工智能互动模型对这个框架进行了实证测试。MHNG 是一种去中心化的贝叶斯推理机制。 在线实验中,69名参与者在部分可观察的情况下与三种计算机代理类型(基于MH、总是接受或总是拒绝)中的某一种一起玩联合注意力命名游戏 (Joint Attention Naming Game, JA-NG)。结果显示,在与基于 MH 的代理互动的人类-AI 对中,分类准确度显著提高,并且更有效地朝向共享符号系统收敛。此外,人类的接受行为与其根据 MH 推导出的概率紧密一致。 这些发现首次为通过基于 MHNG 交互出现的合作创造学习在人类-人工智能对中的实证证据提供了支持。这表明了一个有前景的方向,即朝着与人共存的 AI 系统发展,这种系统不是从人那里学习而是与人一起学习,并且能够动态地调整感知体验,从而开辟了共生 AI 对齐的新途径。
https://arxiv.org/abs/2506.15468
Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.
最近的研究表明,在上下文学习(ICL)中,即使演示中的标签缺失或错误,也能取得有效的成果。为了揭示这一能力,我们考察了一个典型场景:在这种场景下,演示数据根据二元高斯混合模型(GMM)抽取,并且其中一部分演示数据的标签是缺失的。我们提供了一项全面的理论研究,表明: 1. 一层线性注意力模型的学习损失地形能够恢复最优完全监督估计器,但完全无法利用未标记的数据。 2. 相比之下,多层或循环Transformer可以通过隐式构建形式为$\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$的估计器来有效利用未标记数据(其中$X$和$y$分别表示特征和部分观察到的标签,缺失项设为零)。 我们描述了可以作为深度函数表达的一类多项式,并且将其与期望最大化算法联系起来——这是一种在半监督学习中常用的迭代伪标注算法。值得注意的是,主导的多项式的最高幂次是指数级的,因此只需轻微增加深度或循环次数就足以实现这一效果。 作为理论应用的一部分,我们建议对现成的表格基础模型进行循环处理,以增强它们的半监督能力。在真实世界数据集上的广泛评估表明,与标准单遍推断相比,我们的方法显著提高了半监督表格学习性能。
https://arxiv.org/abs/2506.15329
Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: this https URL
面部草图合成是一种将人脸照片转换为素描的技术。现有的面部草图合成研究主要依赖于现有数据集中众多的照片-素描样本对进行训练。然而,这些大规模判别式学习方法面临诸如数据稀缺和高昂的人力成本等问题。一旦训练数据变得稀缺,它们的生成性能就会显著下降。在本文中,我们提出了一种基于扩散模型的一次性面部草图合成方法。我们在一个扩散模型上使用人脸照片-素描图像对来优化文本指令,并通过梯度优化得到的指令用于推理过程。为了更准确地模拟真实场景并全面评估方法的有效性,我们引入了一个新的基准测试集——一次性面部草图数据集(OS-Sketch)。该基准由400对人脸照片-素描图像组成,其中包括风格各异的素描和背景、年龄、性别、表情、光照等不同的照片。为了进行严格的离群评估,我们在每次训练时只选择一对图像进行训练,其余用于推理。大量的实验表明,所提出的方法能够在一次性环境中将各种各样的照片转换为逼真且高度一致的草图。与其它方法相比,我们的方法提供了更大的便利性和更广泛的应用性。数据集将在以下网址提供:this https URL
https://arxiv.org/abs/2506.15312
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at this https URL.
大型语言模型(LLMs)的显著能力主要归功于它们庞大的训练数据集,这些数据集通常从互联网上抓取而来,并且往往没有尊重数据所有者的知识产权。数据推断(DI)通过识别受质疑的数据集是否在训练中被使用,为这个问题提供了一种潜在的解决方案,从而使得数据所有者能够验证未经授权的使用情况。然而,现有的DI方法需要一个与受损数据集分布密切匹配但未参与训练的私有测试集。这种符合原分布但在实践中很少可用的保留数据极大地限制了DI的应用性。 在这项工作中,我们通过合成生成所需的保留集来解决这一挑战。我们的方法解决了两个关键障碍:(1) 创建高质量且多样化的合成数据以准确反映原始分布,我们通过在基于后缀的完成任务上训练的数据生成器实现这一点;(2) 桥接真实与合成数据之间可能性差距,这是通过事后校准来实现的。 通过对多种文本数据集进行广泛的实验显示,使用我们的生成数据作为保留集使DI能够在保持低误报率的同时,以高置信度检测原始训练集。这一结果赋予版权所有者在数据使用方面的合法主张,并展示了我们方法在现实世界诉讼中的可靠性。我们的代码可在此网址获取。
https://arxiv.org/abs/2506.15271
Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for large images (e.g., chest X-rays). In this study, we propose an HE inference framework for medical images that uses VQGAN to compress images into latent representations, thereby significantly reducing the computational burden while preserving image quality. We approximate the activation functions with lower-degree polynomials to balance the accuracy and efficiency in compliance with HE requirements. We observed that a downsampling factor of eight for compression achieved an optimal balance between performance and computational cost. We further adapted the squeeze and excitation module, which is known to improve traditional CNNs, to enhance the HE framework. Our method was tested on two chest X-ray datasets for multi-label classification tasks using vanilla CNN backbones. Although HE inference remains relatively slow and introduces minor performance differences compared with unencrypted inference, our approach shows strong potential for practical use in medical images
医学影像数据包含敏感的患者信息,需要强有力的隐私保护措施。许多分析设置要求将数据发送到服务器进行推理操作。同态加密(HE)提供了一种解决方案,在不解密的情况下对加密数据执行计算。然而,同态加密推理在计算上非常昂贵,特别是对于大尺寸图像(例如胸部X光片)。在这项研究中,我们提出了一种针对医学影像的基于VQGAN压缩技术的同态加密推理框架,通过将图像压缩为潜在表示来显著减少计算负担同时保持图像质量。我们将激活函数近似为低阶多项式以平衡准确性和效率,并满足同态加密的要求。观察到在压缩时采用8倍下采样因子可以实现性能与计算成本之间的最佳平衡点。我们进一步改进了挤压和激励模块,这一技术被证明能够提升传统卷积神经网络(CNN)的表现力,从而增强我们的HE框架。我们在两个胸部X光片数据集上使用基本的CNN骨干结构进行了多标签分类任务测试。尽管同态加密推理仍然相对缓慢,并且与未加密推理相比引入了较小的性能差异,但我们的方法在医学影像的实际应用中显示出巨大的潜力。
https://arxiv.org/abs/2506.15258
Large pre-trained Transformer models achieve state-of-the-art results across diverse language and reasoning tasks, but full fine-tuning incurs substantial storage, memory, and computational overhead. Parameter-efficient fine-tuning (PEFT) methods mitigate these costs by learning only a small subset of task-specific parameters, yet existing approaches either introduce inference-time latency (adapter modules), suffer from suboptimal convergence (randomly initialized low-rank updates), or rely on fixed rank choices that may not match task complexity (Kronecker-based decompositions). We propose SoKA (SVD on Kronecker Adaptation), a novel PEFT strategy that combines Kronecker-product tensor factorization with SVD-driven initialization and spectrum-aware dynamic rank selection. Our Kronecker-Product SVD (KPSVD) procedure extracts principal components of the full weight update into compact Kronecker factors, while an adaptive rank selection algorithm uses energy-threshold and elbow-point criteria to prune negligible components. Empirical evaluation on LLaMA2-7B across arithmetic reasoning (GSM8K), formal mathematics (MATH), and code generation (MBPP) demonstrates that SoKA requires only 0.99M trainable parameters, 25% fewer than LoRA/PiSSA, while matching or exceeding baseline performance. Moreover, SoKA exhibits faster convergence and more stable gradients, highlighting its robustness and efficiency for large-scale model adaptation.
大型预训练Transformer模型在各种语言和推理任务中取得了最先进的成果,但完全微调会带来巨大的存储、内存和计算开销。参数高效微调(PEFT)方法通过仅学习一小部分特定于任务的参数来缓解这些成本。然而,现有方法要么引入推理延迟(适配器模块),要么收敛效果不佳(随机初始化的低秩更新),或者依赖于固定的秩选择,这可能无法匹配任务复杂度(基于克罗内克积的方法)。 我们提出了一种新的PEFT策略SoKA(Kronecker适应上的SVD),它结合了克罗内克积张量因子化与SVD驱动的初始化和频谱感知动态秩选择。我们的克罗内克积SVD(KPSVD)过程提取全权重更新的主要成分到紧凑的克罗内克因子中,同时自适应的秩选择算法使用能量阈值和肘点标准来修剪可忽略的成分。 在LLaMA2-7B上对算术推理(GSM8K)、形式数学(MATH)和代码生成(MBPP)任务进行的经验评估表明,SoKA只需要0.99M个可训练参数,比LoRA/PiSSA少25%,同时达到或超过基准性能。此外,SoKA表现出更快的收敛性和更稳定的梯度,突显了其在大规模模型适应中的鲁棒性和效率。
https://arxiv.org/abs/2506.15251
Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder's capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model's understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.
伪装物体检测(COD)主要集中在从复杂场景中学习细微但具有区分性的表示上。现有的方法大多基于静态视觉表示模型的参数化前馈架构,然而它们缺乏获取历史上下文的显式机制,这限制了其在处理挑战性伪装场景中的适应性和有效性。 本文提出了一个增强型COD架构——RetroMem(回忆增强),通过将相关的历史知识整合到过程中来动态调节伪装模式感知和推理。具体来说,RetroMem采用了一个两阶段训练范式,包括学习阶段和召回阶段,以有效地构建、更新和利用记忆表示。 在学习阶段,我们设计了一种密集多尺度适配器(DMA),通过添加很少的可训练参数,来提升预训练编码器捕捉丰富多尺度视觉信息的能力,从而提供基础推理。在召回阶段,我们提出了一种动态记忆机制(DMM)以及推断模式重构(IPR)。这些组件充分利用了学习知识和当前样本上下文之间的潜在关系,以重建伪装模式的推理过程,从而显著提升模型对伪装场景的理解能力。 在几个广泛使用的数据集上的大量实验表明,我们的RetroMem架构显著优于现有的最先进的方法。
https://arxiv.org/abs/2506.15244
In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.
在这篇论文中,我们评估了当前语言技术理解巴斯克语和西班牙语方言的能力。我们使用自然语言推理(NLI)作为核心任务,并引入了一个新颖的手动整理的巴斯克语和西班牙语平行数据集,包括它们各自的变体。我们通过仅编码器模型和基于解码器的大规模语言模型(LLMs)进行跨语言和上下文学习实验的实证分析发现,在处理语言变异时性能下降,特别是在处理巴斯克语时尤为明显。错误分析表明,这种下滑并非由于词汇重叠,而是由语言变异本身引起的。进一步的消融实验显示,仅编码器模型在处理西巴斯克方言时特别困难,这与语言理论相吻合,该理论认为边缘方言(如西部方言)距离标准较远。所有数据和代码都是公开可用的。
https://arxiv.org/abs/2506.15239
Recent advancements in medical image analysis have led to the development of highly specialized models tailored to specific clinical tasks. These models have demonstrated exceptional performance and remain a crucial research direction. Yet, their applicability is limited to predefined tasks, requiring expertise and extensive resources for development and adaptation. In contrast, generalist models offer a different form of utility: allowing medical practitioners to define tasks on the fly without the need for task-specific model development. In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL), i.e., training models to generalize across tasks based on a few examples provided at inference time. To facilitate rigorous assessment, we propose a broad evaluation protocol tailored to VICL in OCT. We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT. To foster further research and practical adoption, we openly release our code.
近期在医学图像分析领域的进展促使了针对特定临床任务的高度专业化模型的发展。这些模型展示了卓越的性能,依然是至关重要的研究方向。然而,它们的应用仅限于预定义的任务,并且需要专业知识和大量资源来开发和适应新的任务。相比之下,通才型模型提供了一种不同的实用性:允许医疗从业者在没有特定任务模型开发的情况下即时定义任务。在这项工作中,我们探讨了如何使用视觉上下文学习(VICL)训练通才型模型用于视网膜光学相干断层扫描领域,即基于推理时提供的少量示例来训练模型以跨不同任务泛化。为了促进严格的评估,我们提出了一套广泛的评估协议,专门针对OCT的VICL进行设计。我们在多个视网膜OCT数据集上对最先进的医学VICL方法进行了全面评价,并建立了第一个基准线,以此突出上下文学习在OCT中的潜力和当前局限性。为推动进一步研究和实际应用,我们开放发布了我们的代码。
https://arxiv.org/abs/2506.15200