Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
尽管安全对齐已经被大多数大型语言模型(LLM)采用,但LLM服务提供商通常会在实际产品中部署后续的审核作为外部的安全保障。现有的审核人员主要执行传统的全检测方法,即根据完整的LLM输出来确定其危害性,这导致了较高的服务延迟。近期的研究更关注于部分检测,在生成过程中中途监督并提前停止有害内容的输出,但它们直接将使用全检测范式训练的审核员应用于不完整输出中,造成了训练与推理之间的差距,并降低了性能表现。 本文探讨如何形成一种原生支持部分检测的数据和模型解决方案。在数据方面,我们构建了FineHarm数据集,包含29,000个带有细粒度注释的提示-响应对,以提供合理监督,从而进行令牌级训练。然后,我们提出了流式内容监控器(SCM),它通过响应级别和令牌级别的双重视频标签进行训练,并能够跟随LLM输出流,及时判断有害性。 实验表明,SCM在仅查看平均20%的响应令牌的情况下,在宏F1得分上获得了比全检测方法相当甚至更高的性能表现(提高幅度为0.95+)。此外,SCM可以作为伪危害注释员来改进安全对齐,并导致相比DPO更高的无害性评分。
https://arxiv.org/abs/2506.09996
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL
医疗视觉问答(MedVQA)是开发临床决策支持系统的一个有前景的领域,然而进展常常受到可用数据集的限制,这些数据集可能缺乏临床复杂性和视觉多样性。为了解决这些问题,我们介绍了Kvasir-VQA-x1,这是一个新的大规模数据集,用于胃肠内镜检查。我们的工作在原有基础上大幅扩展了Kvasir-VQA,新增加了159,549个问题-答案对,旨在测试更深层次的临床推理能力。我们使用大型语言模型开发了一种系统化的方法来生成这些问题,并按复杂度进行了分层以更好地评估模型的推断能力。为了确保我们的数据集能够使模型为现实世界的临床场景做好准备,我们也引入了多种视觉增强措施,模拟常见的成像伪影。该数据集结构化支持两个主要的评价途径:一个是标准VQA性能评价,另一个是测试模型面对这些视觉干扰时的稳健性。通过提供一个更具挑战性和临床相关的基准,Kvasir-VQA-x1旨在加速开发更可靠和有效的多模态AI系统在临床上的应用。数据集完全开放并遵守FAIR数据原则,使其成为广大研究社区的重要资源。 代码和数据: [链接] 和 [链接]
https://arxiv.org/abs/2506.09958
Generating images in a consistent reference visual style remains a challenging computer vision task. State-of-the-art methods aiming for style-consistent generation struggle to effectively separate semantic content from stylistic elements, leading to content leakage from the image provided as a reference to the targets. To address this challenge, we propose Only-Style: a method designed to mitigate content leakage in a semantically coherent manner while preserving stylistic consistency. Only-Style works by localizing content leakage during inference, allowing the adaptive tuning of a parameter that controls the style alignment process, specifically within the image patches containing the subject in the reference image. This adaptive process best balances stylistic consistency with leakage elimination. Moreover, the localization of content leakage can function as a standalone component, given a reference-target image pair, allowing the adaptive tuning of any method-specific parameter that provides control over the impact of the stylistic reference. In addition, we propose a novel evaluation framework to quantify the success of style-consistent generations in avoiding undesired content leakage. Our approach demonstrates a significant improvement over state-of-the-art methods through extensive evaluation across diverse instances, consistently achieving robust stylistic consistency without undesired content leakage.
在一致的参考视觉风格下生成图像仍然是一个具有挑战性的计算机视觉任务。最先进的旨在实现风格一致性生成的方法难以有效地区分语义内容和风格元素,导致从提供的参考图片中向目标泄露不必要的内容信息。为了解决这一挑战,我们提出了Only-Style:一种方法,旨在以语义连贯的方式减少内容泄漏,同时保持风格的一致性。Only-Style通过在推理过程中定位内容泄漏来工作,并允许自适应调整控制样式对齐过程的参数(特别是在参考图像中包含主体的图块内)。这种自适应处理最能平衡风格一致性与泄漏消除之间的关系。此外,定位内容泄露的功能可以作为独立组件存在,给定一个参考目标图片对,允许任何特定方法的参数进行自适应调节,从而控制样式参照的影响。 我们还提出了一种新的评估框架,以量化风格一致生成在避免不需要的内容泄漏方面的成功程度。通过广泛的测试和各种实例,我们的方法明显优于最先进的方法,在保持强大风格一致性的同时,始终能有效地防止不需要的内容泄露。
https://arxiv.org/abs/2506.09916
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
链式思考(CoT)提示在赋予大型语言模型复杂推理能力方面扮演着不可或缺的角色。然而,当前的CoT面临着两个基本挑战:(1) 充分性,即确保生成的中间推理步骤全面覆盖并支持最终结论;和 (2) 必要性,即识别出对于结果答案的有效性真正必不可少的推理步骤。我们提出了一种因果框架,通过充分性和必要性的双重视角来描述CoT推理过程。引入因果概率的充分性和必要性不仅能够确定哪些步骤在预测结果中是逻辑上足够或必要的,还能量化它们在不同干预情景下对最终推理结果的实际影响,从而实现自动添加缺失步骤和删除冗余步骤的功能。在各种数学和常识推理基准上的广泛实验结果显示,在不牺牲准确性的情况下,提高了推理效率并减少了标记使用量。我们的工作为改进LLM的推理性能和成本效益提供了有希望的方向。
https://arxiv.org/abs/2506.09853
Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at this https URL.
手写文本识别旨在将视觉输入转换为机器可读的文本,但由于手写的演变和上下文依赖性,这一过程仍然具有挑战性。字符集会随时间变化,不同历史时期或地区的字符频率分布也会有所不同,这通常会导致在广泛、异质语料库上训练的模型在其特定子集中表现不佳。为此,我们提出了一种新的损失函数,该函数结合了预测文本中的字符频率分布与从训练数据中经验性推导出的目标分布之间的Wasserstein距离。通过惩罚偏离预期分布的行为,我们的方法可以提高模型在时间变化和上下文内的数据集内部变化情况下的准确性和鲁棒性。此外,我们还证明,在推理阶段将字符分布对齐整合为指导解码方案中的评分函数,可以在不重新训练的情况下改善现有模型的性能。通过多个数据集和架构上的实验结果证实了我们的方法在提高泛化能力和性能方面的有效性。我们将源代码开源于此 [URL](请用实际链接替换)。
https://arxiv.org/abs/2506.09846
Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.
音频-视觉目标说话人分离(AV-TSE)模型主要依赖于目标的视觉线索来从其他人声中隔离出目标说话人的声音。我们知道,人类利用语言知识,如句法和语义,来支持语音感知。受到这一启发,我们探索了预训练的语音-语言模型(PSLMs)和预训练的语言模型(PLMs)作为AV-TSE辅助知识来源的潜力。在本研究中,我们提出将从PSLM或PLM获取的语言约束用作AV-TSE模型的额外监督信号。无需增加推理过程中的计算成本,所提出的方案可以持续提高语音质量和可理解性。此外,我们在多语言设置和视觉线索受损场景下评估了我们的方法,并展示了稳健的性能提升。
https://arxiv.org/abs/2506.09792
Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.
从RGBD数据估算物体的6D姿态是计算机视觉中的一个基本问题,具有在机器人技术和增强现实领域的应用。其中一个关键挑战是在训练过程中未见过的新物体上的泛化能力。大多数现有方法通过增加合成数据(针对特定任务定制)的训练来解决这一问题,这需要大量的计算资源。然而,对于新对象的6D姿态估计而言,是否真的需要特定于任务的培训就能达到准确且高效的程度?为了回答“不”的观点,我们引入了FreeZeV2——这是FreeZe的第二代产品:一种无需训练的方法,通过利用预训练在无关数据上的几何和视觉基础模型来实现对未见过物体的强大泛化能力。相较于最初的FreeZe,FreeZeV2通过以下三个关键贡献提高了准确性和效率: (i) 一种稀疏特征提取策略,在不牺牲准确性的情况下减少了推理时间的计算量; (ii) 一个基于特征感知评分机制,改进了RANSAC(随机采样一致性)基础的3D注册过程中姿态选择以及最终姿势候选者排名的质量; (iii) 一个模块化设计,支持实例分割模型的集合,增强了对分割掩码错误的鲁棒性。 我们在BOP基准测试的七个核心数据集上评估了FreeZeV2,在这些数据集中,它建立了6D姿态估计的新最先进水平。在使用相同的分割掩码时,FreeZeV2比FreeZe的速度提高了8倍,并且准确度提高了5%。当使用集合的分割模型时,FreeZeV2获得了额外8%的准确性,同时仍然比FreeZe快2.5倍。FreeZeV2在BOP挑战赛2024中荣获最佳整体方法奖。
https://arxiv.org/abs/2506.09784
The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.
段落翻译如下: Segment Anything Model 2(SAM2)作为一种可提示的图像和视频分割基础方法,已经获得了广泛关注。然而,其高昂的计算和内存消耗给资源受限场景的应用带来了严峻挑战。在本文中,我们提出了一种针对高效SAM2的有效低比特量化方法,称为Q-SAM2。为了解决量化过程中权重和激活分布中的奇异值所导致的性能下降问题,Q-SAM2引入了两项创新技术贡献。 首先,我们介绍了一种线性层校准方法,用于在低比特环境下初始化SAM2,并通过最小化小批量图像上的弗罗贝尼乌斯范数来重新定位权重分布以优化量化效果。其次,我们提出了一种量化感知训练(QAT)管道,该管道应用剪辑操作抑制异常值,并允许网络在训练过程中适应量化阈值。 我们的全面实验表明,Q-SAM2能够在显著提高效率的同时提供高度准确的推理结果。无论是定量还是定性结果都显示,与现有的最先进的通用量化方案相比,特别是对于极低比特(如2位)量化的情况下,我们的Q-SAM2表现更优。尽管该校准技术是为量化感知训练而设计的,但它在非训练后量化中也表现出色,相较于未进行校准模型,在mIoU准确性上提升了高达66%。
https://arxiv.org/abs/2506.09782
Large Language Models (LLMs) and other neural architectures have achieved impressive results across a variety of generative and classification tasks. However, they remain fundamentally ill-equipped to ensure that their outputs satisfy temporal constraints, such as those expressible in Linear Temporal Logic over finite traces (LTLf). In this paper, we introduce TRIDENT: a general and model-agnostic inference-time algorithm that guarantees compliance with such constraints without requiring any retraining. TRIDENT compiles LTLf formulas into a Deterministic Finite Automaton (DFA), which is used to guide a constrained variant of beam search. At each decoding step, transitions that would lead to constraint violations are masked, while remaining paths are dynamically re-ranked based on both the model's probabilities and the DFA's acceptance structure. We formally prove that the resulting sequences are guaranteed to satisfy the given LTLf constraints, and we empirically demonstrate that TRIDENT also improves output quality. We validate our approach on two distinct tasks: temporally constrained image-stream classification and controlled text generation. In both settings, TRIDENT achieves perfect constraint satisfaction, while comparison with the state of the art shows improved efficiency and high standard quality metrics.
大型语言模型(LLMs)和其他神经网络架构在各种生成和分类任务中取得了令人印象深刻的结果。然而,它们在确保输出满足时间约束方面仍然存在根本性的不足,例如那些可以在线性时态逻辑有限迹线(LTLf)中表达的约束。本文介绍了TRIDENT:一种通用且与模型无关的推理算法,该算法可以在不重新训练的情况下保证遵守此类约束。 TRIDENT将LTLf公式编译成确定性有限自动机(DFA),并使用它来指导受限变种的束搜索。在每个解码步骤中,会屏蔽会导致违反约束的转换路径,并根据模型的概率和DFA的接受结构动态重新排序其余路径。我们正式证明了生成的序列保证符合给定的LTLf约束,并且通过实验表明TRIDENT也提高了输出质量。 我们在两个不同的任务上验证了我们的方法:具有时间限制的图像流分类和受控文本生成。在两种情况下,TRIDENT都实现了完美的约束满足度,与现有技术相比,它不仅提高了效率,还表现出高质量的标准指标。
https://arxiv.org/abs/2506.09701
Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.
双编码器视觉-语言模型(VLM)如CLIP被广泛应用于图像-文本检索任务中。然而,这些模型在处理组合性问题时表现出色差,例如它们展现出类似“词袋”的行为,这限制了其检索性能。为提高这类模型的视觉-语言组合能力,许多不同的训练方法已被提出。相比之下,在推理时间的技术却很少被关注。 本文提出了一个简单的结构化推断技术,该技术在给定图像和标题时执行以下步骤:i) 将图像分割成较小的部分;ii) 提取文本片段,捕捉对象、属性和关系;iii) 使用VLM找到与这些文本片段最佳对齐的图像部分,并获取匹配结果;iv) 计算最终的图像-文本相似度,通过聚合每个匹配项的单独相似性来实现。 基于各种流行的双编码器VLM模型,我们在控制数据集和自然数据集中评估了该方法在视觉语言组合性的性能。我们发现,这种方法能够显著提升所有被评估的VLM模型的表现,并且无需额外训练,这表明了推理时间技术的巨大潜力。尤其是,在控制数据集中的属性-对象绑定方面表现尤为出色。 通过广泛的分析,我们的研究还发现了两个关键点:i) 处理图像部分对于观察到的性能改进至关重要;ii) 我们识别出了可以进一步提升推理时方法的具体领域。
https://arxiv.org/abs/2506.09691
It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.
大型语言模型了解其知识边界并能够识别已知和未知查询的机制是非常重要的。这种意识有助于模型进行自适应推理,比如调用检索增强生成(RAG)、开展慢而深入的思考或采用弃权机制,这有利于高效且可信的人工智能的发展。在这项工作中,我们提出了一种通过查询级别的不确定性来检测知识边界的的方法,旨在确定模型是否能够在不生成任何标记的情况下解决给定的查询。为此,我们引入了一个新颖且无需训练的方法——“内部置信度”,它利用了跨层和标记的自我评估。 在事实问答和数学推理任务上的实证结果表明,我们的内部置信度方法能够超越多个基准线。此外,我们展示了所提出的方法可以用于高效的RAG和模型级联,从而降低推断成本的同时保持性能水平。
https://arxiv.org/abs/2506.09669
End-to-end deep learning exhibits unmatched performance for detecting malware, but such an achievement is reached by exploiting spurious correlations -- features with high relevance at inference time, but known to be useless through domain knowledge. While previous work highlighted that deep networks mainly focus on metadata, none investigated the phenomenon further, without quantifying their impact on the decision. In this work, we deepen our understanding of how spurious correlation affects deep learning for malware detection by highlighting how much models rely on empty spaces left by the compiler, which diminishes the relevance of the compiled code. Through our seminal analysis on a small-scale balanced dataset, we introduce a ranking of two end-to-end models to better understand which is more suitable to be put in production.
端到端深度学习在检测恶意软件方面表现出卓越的性能,但这种成就主要是通过利用虚假相关性实现的——即那些在推理时具有高相关性的特征,在实际应用中却因领域知识表明其无用。尽管先前的工作强调了深度网络主要关注元数据,但没有进一步深入研究这一现象,并且未量化其对决策的影响。在这项工作中,我们通过对一个小型平衡数据集进行开创性分析,深化了对于虚假关联如何影响恶意软件检测中的深度学习的理解。我们重点展示了模型依赖于编译器留下的空白区域的程度,这降低了编译代码的相关性。通过这项研究,我们为两个端到端模型引入了一个排名系统,以更好地理解哪个模型更适合投入生产使用。
https://arxiv.org/abs/2506.09662
Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method's performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.
大型语言模型(LLMs)提供了一种低成本且强大的文本标注方式,但与专家相比时常常会出现不一致的情况。这些错误可能会导致人口参数估计偏差,例如回归系数和因果效应的偏差。为了减轻这种偏差,研究人员开发了诸如基于设计的学习(DSL)和预测增强推断(PPI)等去偏方法,这些方法通过结合大型语言模型标注与少量昂贵的人工专家标注来实现有效的估计。尽管在理论上假设下这两种方法都能产生一致的估计值,但在实际应用研究中遇到的具体样本量下的表现如何尚不清楚。 我们在此提出了两项贡献: 第一项是研究每种去偏方法的表现随专家标注数量变化的情况,并重点指出语言模型偏差或有限的人工标注显著影响结果的情形。 第二项是在一系列任务上比较DSL和PPI,发现尽管两者在大规模数据集上都可实现低偏差,但DSL通常在减少偏差及实证效率方面优于PPI,不过其表现的稳定性不如PPI一致。 我们的研究结果显示,在去偏方法层面上存在偏差-方差权衡问题,这需要进一步的研究来开发衡量它们在有限样本中的有效性的指标。
https://arxiv.org/abs/2506.09627
This article presents a method for verifying RDF triples using LLMs, with an emphasis on providing traceable arguments. Because the LLMs cannot currently reliably identify the origin of the information used to construct the response to the user prompt, our approach is to avoid using internal LLM factual knowledge altogether. Instead, verified RDF statements are compared to chunks of external documents retrieved through a web search or Wikipedia. To assess the possible application of this retrieval augmented generation (RAG) workflow on biosciences content, we evaluated 1,719 positive statements from the BioRED dataset and the same number of newly generated negative statements. The resulting precision is 88 %, and recall is 44 %. This indicates that the method requires human oversight. We also evaluated the method on the SNLI dataset, which allowed us to compare our approach with models specifically tuned for the natural language inference task. We demonstrate the method on Wikidata, where a SPARQL query is used to automatically retrieve statements needing verification. Overall, the results suggest that LLMs could be used for large-scale verification of statements in KGs, a task previously unfeasible due to human annotation costs.
本文提出了一种使用大型语言模型(LLMs)验证RDF三元组的方法,并特别强调提供可追溯的论证。由于当前LLMs无法可靠地识别用于构建响应用户提示的信息来源,我们的方法完全避免了使用内部LLM事实知识。相反,通过网络搜索或维基百科检索到的外部文档片段与经过验证的RDF陈述进行比较。 为了评估这种基于检索增强生成(RAG)工作流在生物科学内容上的潜在应用,我们对BioRED数据集中1,719个正面声明和同样数量的新生成负面声明进行了评价。由此产生的精度为88%,召回率为44%。这表明该方法需要人工监督。 我们也使用SNLI数据集对该方法进行了评估,从而可以将我们的方法与专门针对自然语言推理任务调整的模型进行比较。我们通过在Wikidata上应用此方法来演示其功能,在其中使用SPARQL查询自动检索待验证的声明。 总体而言,这些结果表明LLMs可以在大规模知识图谱(KGs)中实现陈述验证的任务,这在过去由于人工标注成本过高而难以实施。
https://arxiv.org/abs/2409.07507
Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.
检索增强生成(RAG)通过将响应基于外部知识来提高事实准确性。然而,现有方法通常依赖单一来源,无论是非结构化文本还是结构化知识,并且缺乏认知启发式的机制来激活相关知识。为解决这些问题,我们提出了KG-Infused RAG框架,该框架将知识图谱(KGs)整合到RAG系统中,以实现信息扩散激活过程,这一认知过程能够促进概念关联和推理。KG-Infused RAG检索KG事实、相应扩展查询,并通过结合语料库段落与结构化事实来增强生成过程,从而使多源检索具有可解释性和基于语义结构。 我们进一步通过对管道中采样的关键阶段进行偏好学习来改进KG-Infused RAG。在五个问答基准上的实验表明,KG-Infused RAG始终优于原始RAG(性能提升3.8%至13.8%)。此外,在与Self-RAG集成时,KG-Infused RAG还能带来进一步的性能增益,证明了其作为基于语料库RAG方法插件增强模块的有效性和灵活性。
https://arxiv.org/abs/2506.09542
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.
大型视觉-语言模型(LVLM)通过将视觉感知与语言理解相结合,在各种多模态任务中展示了卓越的性能。然而,传统的解码策略通常无法有效利用视觉信息,导致生成的回答缺乏视觉依据。尽管已提出了多种方法来解决这一限制问题,但这些方法往往需要额外训练、多步推理过程或外部模型依赖。 本文介绍了ReVisiT,这是一种简单而有效的解码方法,通过引用视觉标记引导大型视觉-语言模型中的文本生成过程。我们的方法通过将视觉标记投影到文本标记分布空间中来利用其语义信息,并通过约束发散最小化动态选择每个解码步骤中最相关的视觉标记。然后使用选定的视觉标记来细化输出分布,使其更好地融入视觉语义。 在两个最近发布的大型视觉-语言模型上进行的三项LVLM幻觉基准测试实验表明,ReVisiT能够以极低的计算开销持续增强视觉依据。此外,在减少高达两倍的计算成本的情况下,我们的方法相对于最先进的基线取得了竞争性或优越的结果。
https://arxiv.org/abs/2506.09522
Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongruity in their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance. To address this impediment, we propose a unified rotary position embedding (\textbf{\ourRoPE}) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this \ourRoPE, we introduce \textbf{\model}, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4K sequence length, \model exhibits training and inference speeds that are \textbf{42.3\% and 29.5\% faster}, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4\% on language modeling benchmarks. \model furthermore scales more effectively: \model-1.3B gains \textbf{7.22\%} in average accuracy over its 320M version (versus about 6\% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.
Transformer模型在捕捉长距离依赖关系方面表现出色,而状态空间模型(SSMs)则能以线性时间完成序列建模。尽管这两种架构存在协同潜力,但它们的融合却面临重大挑战,这主要是由于各自的位置编码机制的基本不兼容:Transformers依靠显式的旋转位置嵌入(RoPE),而SSMs通过卷积来利用隐式的位置表示。这种差异往往导致模型在集成时出现断层和性能不佳的问题。 为解决这一难题,我们提出了一种统一的旋转位置嵌入(\textbf{\ourRoPE})方法,从而建立了适用于自注意力机制和状态空间组件的一致性位置编码框架。基于此\ourRoPe,我们引入了混合架构\textbf{\model},它在统一的位置编码方案下连贯地整合了Transformer层与SSM层。在4K序列长度的情况下,\model的训练和推理速度分别比标准Transformer模型快42.3% 和 29.5%。此外,在类似设置条件下,\model 在语言建模基准上超越了Transformer基线模型超过4% 的准确性。 \model 还更有效地扩展:相比等效的Transformers或SSMs,\model-1.3B版本在平均准确度方面比其320M版本高出7.22%,而其他模型仅提高约6%。我们的实验结果表明,统一的位置编码解决了混合模型中的位置兼容性问题,使其能够高效、高性能地进行长上下文建模。 通过这些改进和优化,\model在处理大规模序列数据时展现出了显著的优势,并且为未来结合不同深度学习架构的设计提供了新的思路。
https://arxiv.org/abs/2506.09507
Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at this https URL.
大型语言模型(LLM)如今在多个领域中都发挥着重要作用,并展示了令人印象深刻的表现。然而,进展的前提是基准评分既准确又可重复。我们发现 LLM 的性能再现性非常脆弱:改变系统配置如评估批次大小、GPU 数量和 GPU 类型可以导致生成响应之间存在显著差异。这一问题尤其在推理模型中明显,在早期标记中的微小舍入误差可能引发思想链的分歧,最终影响准确性。例如,在 bfloat16 精度下使用贪婪解码时,一个推理模型如 DeepSeek-R1-Distill-Qwen-7B 可能因为 GPU 数量、类型和评估批次大小的不同而表现出高达 9% 的准确率变化以及响应长度相差达 9000 token。 我们追溯这种变异性根源,发现其在于浮点数运算在数值精度有限的情况下不具备结合律的性质。这项工作首次系统地调查了数值精度如何影响 LLM 推断中的再现性。通过跨越各种硬件、软件和精度设置的精心控制实验,我们量化了模型输出何时以及如何发生分歧。我们的分析揭示,尽管浮点数精度对于再现性至关重要,但往往在评估实践中被忽略。 受到这一发现的启发,我们开发了一种轻量级推理管线,称为 LayerCast,在其中权重以 16 位精度存储,但在所有计算中使用 FP32 进行运算,从而平衡了内存效率与数值稳定性。代码可在以下链接获取:[请在此处插入实际链接]。
https://arxiv.org/abs/2506.09501
Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.
最近,扩散模型作为一种强大的轨迹规划方法崭露头角。然而,由于它们本质上是非顺序的特性,在测试时对于长期推理任务的有效性受到限制。最近提出的蒙特卡洛树扩散(MCTD)通过结合扩散与基于树的搜索提供了一个有希望的解决方案,并在复杂规划问题上达到了最先进的性能水平。尽管其具有优势,但我们的分析显示,由于基于树搜索的顺序性质以及迭代去噪的成本,MCTD带来了显著的计算开销。为了解决这个问题,我们提出了Fast-MCTD,这是一种更高效的变体,在保持MCTD优点的同时大大提高了速度和可扩展性。 Fast-MCTD整合了两种技术:平行MCTD(Parallel MCTD)允许通过延迟树更新和冗余感知选择实现并行回溯;稀疏MCTD(Sparse MCTD)则通过轨迹简化减少回溯长度。实验结果表明,与标准的MCTD相比,Fast-MCTD在保持或提高规划性能的同时实现了高达100倍的速度提升。尤为值得注意的是,在某些任务上,尽管Diffuser不需要搜索并且提供的解决方案较弱,但Fast-MCTD仍然在推理速度方面超越了它。 这些结果表明,Fast-MCTD为基于扩散的推理时间推理提供了一个实用且可扩展的解决方案。
https://arxiv.org/abs/2506.09498
We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fréchet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.
我们介绍了TransDiff,这是一种结合了自回归(Autoregressive, AR)Transformer和扩散模型的图像生成模型。在这一联合建模框架中,TransDiff将标签和图像编码为高级语义特征,并使用扩散模型来估计图像样本的概率分布。在ImageNet 256x256基准测试上,TransDiff显著优于仅基于AR Transformer或扩散模型的其他图像生成模型。具体而言,TransDiff达到了1.61的Fréchet Inception Distance (FID)和293.4的Inception Score (IS),并提供了比最先进的基于AR Transformer的方法快两倍的推理延迟,以及比纯扩散模型快112倍的推理速度。 此外,在TransDiff模型的基础上,我们引入了一种新的图像生成范式,称为多参考自回归(Multi-Reference Autoregression, MRAR)。MRAR通过预测下一个图像来进行自回归生成。MRAR使模型能够引用多个先前生成的图像,从而有助于学习更多样化的表示,并在后续迭代中提高生成图像的质量。应用MRAR后,TransDiff的性能得到了提升,FID从1.61降至1.42。 我们期望TransDiff将在图像生成领域开辟新的研究前沿。
https://arxiv.org/abs/2506.09482