Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
大型语言模型(LLMs)通常能够准确地用自然语言描述概率分布,但它们在从这些分布中生成忠实样本时仍存在困难。这种不匹配限制了它们在需要可靠随机性的任务中的应用,例如蒙特卡洛方法、基于代理的仿真和随机决策制定。我们在此背景下研究伯努利分布的知识与采样之间的差距,并提出了语言化的拒绝抽样(Verbalized Rejection Sampling, VRS),这是一种经典拒绝抽样的自然语言版本,它促使LLM对提出的样本进行推理并接受或拒绝这些样本。尽管VRS在内部依赖于相同的伯努利机制,但它显著减少了不同模型的采样偏差。 我们提供理论分析表明,在适度假设下,VRS优于直接抽样,并且改进来自于算法本身以及提示设计。更广泛地说,我们的研究结果展示了如何将经典的概率工具语言化并嵌入到LLM工作流程中以提高可靠性,而无需访问模型内部或进行复杂的提示工程。
https://arxiv.org/abs/2506.09998
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
尽管安全对齐已经被大多数大型语言模型(LLM)采用,但LLM服务提供商通常会在实际产品中部署后续的审核作为外部的安全保障。现有的审核人员主要执行传统的全检测方法,即根据完整的LLM输出来确定其危害性,这导致了较高的服务延迟。近期的研究更关注于部分检测,在生成过程中中途监督并提前停止有害内容的输出,但它们直接将使用全检测范式训练的审核员应用于不完整输出中,造成了训练与推理之间的差距,并降低了性能表现。 本文探讨如何形成一种原生支持部分检测的数据和模型解决方案。在数据方面,我们构建了FineHarm数据集,包含29,000个带有细粒度注释的提示-响应对,以提供合理监督,从而进行令牌级训练。然后,我们提出了流式内容监控器(SCM),它通过响应级别和令牌级别的双重视频标签进行训练,并能够跟随LLM输出流,及时判断有害性。 实验表明,SCM在仅查看平均20%的响应令牌的情况下,在宏F1得分上获得了比全检测方法相当甚至更高的性能表现(提高幅度为0.95+)。此外,SCM可以作为伪危害注释员来改进安全对齐,并导致相比DPO更高的无害性评分。
https://arxiv.org/abs/2506.09996
Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.
在线有毒语言会造成实际伤害,特别是在那些缺乏监管工具的地区。在这项研究中,我们评估了大型语言模型在处理塞尔维亚语、克罗地亚语和波斯尼亚语中的有害评论的能力,这些语言由于数据标注资源有限而难以应对。为此,我们构建并手动标注了一个包含4500条YouTube和TikTok评论的数据集,这些评论来自各种类别的视频内容,包括音乐、政治、体育、模特展示、网红内容以及关于性别歧视的讨论等。 四种模型(GPT-3.5 Turbo、GPT-4.1、Gemini 1.5 Pro 和 Claude 3 Opus)在两种模式下进行了测试:零样本模式和上下文增强模式。我们测量了这些模型的精度、召回率、F1分数、准确性和假阳性率。 加入简短的上下文片段平均提高了召回率约0.12,并将F1分数最多提高到0.10,尽管有时会增加假阳性的数量。在上下文增强模式下,Gemini表现最佳,达到了F1分数和准确率为0.82,而零样本GPT-4.1则在精度上领先,并且具有最低的误报率。 我们展示了即使是在资源有限的情况下,添加少量的上下文信息也能提高有害语言检测的效果。此外,我们还提出了实用策略,例如改进提示设计和阈值校准,以进一步提升这些模型的表现。研究结果表明,仅靠优化提示设计就能在服务不足的巴尔干语社区中实现有意义的有害言论检测改进。
https://arxiv.org/abs/2506.09992
Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.
基于近期生成式人工智能的进步,文本引导的图像编辑技术正变得越来越普及。这一趋势凸显了建立一个全面框架来验证文本指导下的编辑并评估其质量的需求。为了解决这个需求,我们引入了一种新的基准测试工具——EditInspector,它通过使用广泛的模板收集的人类注释来进行文本引导的图像编辑评价。利用EditInspector,我们可以从准确性、伪影检测、视觉质量、与图像场景的无缝集成度、符合常识性以及描述编辑后变化的能力等多个维度评估最先进的(SoTA)视觉和语言模型在评定编辑上的表现。我们的研究发现表明,目前的模型难以全面评估这些编辑,并且常常会错误地描述所发生的更改。为了应对这一挑战,我们提出了两种新的方法,它们在这两个关键任务——伪影检测和差异文本生成上都优于现有的SoTA模型。
https://arxiv.org/abs/2506.09988
Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair -- a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.
现有的评估视频语言模型在时空理解与推理能力方面的基准容易受到基于表面视觉或文本线索的捷径解决方案的影响,从而导致评分膨胀。为了解决准确评估模型性能所面临的挑战,本文引入了Minimal Video Pairs(MVP)基准测试,这是一个简单的、具有意识的捷径解决方案的视频问答基准测试,用于评估视频语言模型对物理世界的理解能力。该基准包含55,000个高质量的选择题形式的视频问答示例,着重于物理世界理解方面。这些样本来自九种不同的视频数据来源,包括第一人称视角的视频、第三人称视角的视频、机器人互动数据以及认知科学中的直观物理学基准测试。 为了减少依赖表面视觉或文本线索和偏差的捷径解决方案的影响,MVP中的每个样本都有一个最小变化对——即一个与原始视频在视觉上非常相似但答案相反的问题。要正确回答问题,模型必须为最小变化对中的两个示例都提供正确的答案;因此,仅依靠视觉或文本偏见进行推理的模型将表现得不如随机猜测好。 人类在MVP上的表现为92.9%,而最先进的开源视频语言模型的表现仅为40.2%(相比之下,随机性能为25%)。
https://arxiv.org/abs/2506.09987
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
现代人工智能面临的一个主要挑战是通过观察来学习理解世界和采取行动。本文探讨了一种结合互联网规模视频数据与少量交互数据(机器人轨迹)的自监督方法,以开发能够在物理世界中理解和规划模型。我们首先在一个包含超过100万小时互联网视频图像数据集上对一个无动作的动作联合嵌入预测架构V-JEPA 2进行预训练。V-JEPA 2在运动理解方面表现出色(Something-Something v2的top-1准确率为77.3),并且在人类行为预测方面达到最先进的水平(Epic-Kitchens-100上recall-at-5为39.7,超过了之前特定任务模型的表现)。此外,在将V-JEPA 2与大型语言模型对齐后,我们展示了在视频问答任务上的最先进性能,规模达80亿参数级(例如,在PerceptionTest上得分84.0,在TempCompass上得分为76.9)。最后,我们展示如何通过后续训练一个隐含的动作条件世界模型V-JEPA 2-AC来将自监督学习应用于机器人规划任务,该模型使用来自Droid数据集的不到62小时未标记的机器人视频进行训练。我们将V-JEPA 2-AC零样本部署到两个实验室中的Franka机械臂上,并能够利用图像目标规划实现物体的选择和放置功能。值得注意的是,这一成果是在没有从这些环境中收集任何机器人数据以及没有特定任务培训或奖励的情况下完成的。这项工作展示了如何通过网络规模的数据和少量机器人交互数据进行自监督学习,从而开发出一种能够在物理世界中进行规划的世界模型。
https://arxiv.org/abs/2506.09985
Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
最近在大型语言模型(LLM)方面取得的进展使得它们在各种任务中表现出色。然而,标准提示方法通常难以生成结构正确且准确的结果,尤其是在依存句法分析中。我们提出了一种新颖的逐步指令策略,在这种策略中,通用词性标注先于句法头和依存关系标签的预测,并采用一种简化的类似CoNLL-U格式输出,我们的方法在涵盖17种语言的Universal Dependencies数据集上实现了无幻觉或污染的最佳准确率。此外,我们还证明了多语言微调同时提高了跨语言泛化性能。我们的结果突显了基于LLM解析中明确推理步骤的有效性,并为基于括号的方法提供了一种可扩展且格式一致的替代方案。
https://arxiv.org/abs/2506.09983
How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
我们如何通过利用语言模型的底层表示来有效地激发其推理能力?我们用Resa回答了这个问题,这是一个通过一种新颖且高效的稀疏自编码器调优(SAE-Tuning)过程训练的一系列15亿参数规模的推理模型。该方法首先使用源模型数据训练一个稀疏自编码器以捕捉推理能力,然后利用训练好的自编码器指导标准监督微调过程,在目标模型中激发这些能力,整个过程仅需经过验证的问题-答案数据而无需任何推理痕迹。 值得注意的是,在某些基础模型上应用SAE-Tuning,并在进一步的强化学习后训练之前进行处理时,它能够保留超过97%的与之相对应的强化学习训练后的推理性能,同时将训练成本减少超过2000倍至大约1美元,并将训练时间减少超过450倍至约20分钟。此外,在对经过轻度强化学习训练的模型(例如在两块GPU上花费不到一小时)应用时,它能够实现如AIME24中Pass@1得分为43.33%,AMC23中Pass@1得分为90%等推理性能,并且仅增加约1美元的成本。 令人惊讶的是,通过自编码器提取的推理能力可能是通用和模块化的。泛化意味着从一个数据集中提取的能力仍然可以提升更大、更相关数据集上的表现;而模块化则表示可以从Qwen或Qwen-Math中抽取的能力可以在测试时附加到R1-Distill模型上,并且无需任何重新训练即可获得类似的效果。 一系列消融实验验证了上述发现,所有相关成果均已完全开源。
https://arxiv.org/abs/2506.09967
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
随着大型语言模型(LLM)在文本推理方面取得了显著进展,增强大型视觉-语言模型(LVLM)的多模态推理能力的兴趣也随之增加。然而,现有的方法主要以直接、文本中心的方式处理多模态推理,在这种方式中,无论是推理还是答案推导都完全通过文本进行,唯一的区别在于存在多模态输入。因此,这些方法在需要精确几何理解及连续空间跟踪的任务上(人类通常通过心理可视化和操作来实现这些能力)往往遇到根本性的局限性。 为了解决这些问题,我们提出了一种新的范式——“空间绘图推理”,使LVLM可以通过基本的绘制操作在视觉空间中进行推理。通过赋予模型诸如标注边界框及绘制辅助线等基础绘图操作的能力,它们能够直接通过视觉操控表达和分析空间关系,并且避免了之前工具整合推理方法中存在的专业感知工具性能上限问题。 为了培养这种能力,我们开发了一个三阶段的训练框架:使用合成数据进行冷启动训练以建立基本绘图技能;采用反射拒绝采样增强自我反思行为;以及直接针对目标奖励优化的强化学习。广泛的实验表明,我们的模型VILASR在包括迷宫导航、静态空间推理、基于视频的推理和多视角推理任务在内的多样化空间推理基准测试中均显著优于现有方法,平均提升了18.4%。
https://arxiv.org/abs/2506.09965
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL
医疗视觉问答(MedVQA)是开发临床决策支持系统的一个有前景的领域,然而进展常常受到可用数据集的限制,这些数据集可能缺乏临床复杂性和视觉多样性。为了解决这些问题,我们介绍了Kvasir-VQA-x1,这是一个新的大规模数据集,用于胃肠内镜检查。我们的工作在原有基础上大幅扩展了Kvasir-VQA,新增加了159,549个问题-答案对,旨在测试更深层次的临床推理能力。我们使用大型语言模型开发了一种系统化的方法来生成这些问题,并按复杂度进行了分层以更好地评估模型的推断能力。为了确保我们的数据集能够使模型为现实世界的临床场景做好准备,我们也引入了多种视觉增强措施,模拟常见的成像伪影。该数据集结构化支持两个主要的评价途径:一个是标准VQA性能评价,另一个是测试模型面对这些视觉干扰时的稳健性。通过提供一个更具挑战性和临床相关的基准,Kvasir-VQA-x1旨在加速开发更可靠和有效的多模态AI系统在临床上的应用。数据集完全开放并遵守FAIR数据原则,使其成为广大研究社区的重要资源。 代码和数据: [链接] 和 [链接]
https://arxiv.org/abs/2506.09958
Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vulnerable. We present the results of LLMail-Inject, a public challenge simulating a realistic scenario in which participants adaptively attempted to inject malicious instructions into emails in order to trigger unauthorized tool calls in an LLM-based email assistant. The challenge spanned multiple defense strategies, LLM architectures, and retrieval configurations, resulting in a dataset of 208,095 unique attack submissions from 839 participants. We release the challenge code, the full dataset of submissions, and our analysis demonstrating how this data can provide new insights into the instruction-data separation problem. We hope this will serve as a foundation for future research towards practical structural solutions to prompt injection.
间接提示注入攻击利用了大型语言模型(LLM)在区分输入中的指令和数据方面的内在限制。尽管提出了许多防御建议,但针对适应性对手的系统评估仍有限制,即使成功的攻击可能具有广泛的安保和隐私影响,许多基于LLM的真实世界应用程序仍然易受攻击。我们展示了LLMail-Inject的结果,这是一项公开挑战,模拟了一个现实场景,在该场景中参与者尝试将恶意指令注入电子邮件以触发LLM基础邮件助手的未经授权的工具调用。此挑战涵盖了多种防御策略、LLM架构和检索配置,并产生了来自839名参与者的208,095个独特攻击提交的数据集。我们发布了挑战代码、完整的提交数据集以及我们的分析,展示如何利用这些数据为指令与数据分离问题提供新的见解。我们希望这将成为未来研究的基础,以寻求解决提示注入的实用结构化解决方案。
https://arxiv.org/abs/2506.09956
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
最近的研究(Wu et al., 2025b)发现了一类称为检索头(retrieval heads)的注意力机制子集,这些头部在长上下文语言模型中负责从大量文本信息中提取关键内容,这一结论是通过它们在针刺干草堆任务中的复制粘贴行为得出的。在此论文中,我们提出了QRHEAD(查询聚焦检索头),这是一种改进后的注意力机制集合,旨在增强从长文本中进行有效检索的能力。我们通过聚集与输入查询相关的注意力得分,并结合真实世界任务(如长上下文问答)示例来识别QRHEAD。 此外,我们引入了QR- RETRIEVER,这是一个高效且有效的检索器,它使用QRHEAD累积的注意力质量作为检索分数。我们在多跳推理任务LongMemEval和CLIPPER中利用QR-RETRIEVER进行长文本推理,通过选择具有最高检索分数的相关部分来实现这一目标。相比全面考虑上下文的方法,这种方法在这些任务上带来了超过10%的性能提升,并且优于其他密集型检索器。 我们还在BEIR基准测试中将QR-RETRIEVER作为重排序器进行了评估,并发现它实现了强大的零样本性能,超过了其他基于大语言模型(LLM)的重排序器,如RankGPT。进一步分析表明,查询上下文注意力评分以及任务选择对于识别具有强大下游效用的QRHEAD至关重要。 总体而言,我们的工作为长文本推理贡献了一种通用检索方法,并提供了关于LMs在处理长上下文时能力的理解和解释性见解。
https://arxiv.org/abs/2506.09944
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.
带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLM)的关键技术,其中验证工程发挥了核心作用。然而,用于指令遵循的最佳强化学习实践尚未得到充分探索。在这项工作中,我们探讨了在指令跟随中实现RL所面临的验证挑战,并提出了一种名为VerIF的方法,该方法结合了基于规则的代码验证和大型推理模型(如QwQ-32B)中的LLM验证。为了支持这种方法,我们构建了一个高质量的指令遵循数据集VerInstruct,其中包括约22,000个实例及其相关的验证信号。我们将使用VerIF进行RL训练应用于两个模型,并在几个代表性的指令跟随基准测试中实现了显著改进。经过训练的模型在同类大小的模型中达到了最先进的性能水平,并且对未见过的约束具有良好的泛化能力。我们进一步观察到,它们的一般能力并未受到影响,这表明带有VerIF的RL可以整合到现有的RL配方中以提升整体模型性能。我们在[此链接](https://example.com/)发布了我们的数据集、代码和模型,以促进未来的研究。
https://arxiv.org/abs/2506.09942
One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at this https URL
视觉-语言-行动(VLA)模型相对于传统的机器人模仿学习的一个承诺是,它们可以利用大型视觉-语言模型(VLMs)的广泛泛化能力来生成多功能、通用型的机器人策略。然而,目前对VLA的评估仍然不足。传统的人类演示模仿学习基准测试由于缺乏语言指令而不适用;而新兴的一些将语言融入考量中的VLA基准测试往往包含的任务有限,并且没有意图深入研究VLM预训练究竟在多大程度上促进了下游机器人策略的泛化能力提升。与此同时,许多研究依赖于由不同机构独立设计的真实世界机器人设置,这为再现性和可访问性设置了障碍。 为了填补这一空白,我们介绍了一套统一的基于模拟任务的评估工具包,包含50项跨10个子类别的任务,这些类别涵盖了语言指令、视觉和物体。我们系统地在该工具包上对几种最先进的VLA架构进行了评测,以理解它们的泛化能力。我们的结果显示,虽然VLM骨干网络赋予了VLA强大的感知理解和高层次规划能力——即所谓的良好意图,但这种优势并不能可靠地转化为精确的动作执行:当面临分布外(OOD)观察时,尽管策略往往表现出一致的目标意图,但在动作执行方面却常常表现不佳。此外,在行动数据上的微调会侵蚀原始VLM的通用推理能力。 为了作为未来VLA研究的标准基准,并推动跨越感知与行动鸿沟的研究进展,我们公开了我们的任务套件和评估代码。更多信息(包括源代码)可在此链接访问:[此URL应为原文中提供的具体网址,请自行查找]
https://arxiv.org/abs/2506.09930
Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
大型语言模型(LLMs)已经提升了对话式人工智能助手的能力。然而,系统性地评估这些助手在完成任务时如何应用个性化——即根据个人用户的偏好进行调整——仍然是一个挑战。现有的个性化基准测试主要集中在闲聊、非对话型任务或狭窄的领域上,无法捕捉到个性化任务导向辅助服务的复杂性。为此,我们引入了PersonaLens,这是一个全面的评估体系,用于评价面向任务的人工智能助手在个性化方面的表现。 我们的评估体系包括配备了丰富偏好和互动历史的多样用户档案,以及两个专门针对LLM(大型语言模型)设计的代理:一个与AI助手进行真实任务导向对话的用户代理;另一个使用“将LLM作为评判者”模式来评估个性化的质量、响应质量和任务成功率的评判员代理。通过与当前各种大型语言模型助手在多种任务上的广泛实验,我们揭示了它们在个性化能力方面的显著差异,并为推动对话式人工智能系统的进步提供了重要的见解。
https://arxiv.org/abs/2506.09902
As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may "think" in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model's ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.
随着大型语言模型(LLM)的不断进步,它们在多种语言环境中有效运行的能力显著提升。初步研究表明,即使响应非英语提示时,LLMs 的隐藏激活状态也往往类似英语。这导致了普遍认为这些模型可能以“英语思维”进行工作的假设。然而,近期结果显示,在特定任务中,LLMs 在其他语言上的表现甚至超过在英语中的表现,这一发现挑战了上述观点。 在这项研究中,我们发现 LLMS 随时间逐步形成一个核心的无语言依赖参数空间——这是一个相对较小但至关重要的参数子集,其失活会导致所有语言的表现显著下降。这个紧凑而关键的参数集合是模型能够超越单一语言进行泛化的基础,并支持抽象思维的出现,这种思维并不依附于任何特定的语言系统。 具体而言,我们识别出了与语言相关的神经元——那些在处理特定语言时始终被激活的神经元,并将其分类为共享(跨多种语言活跃)或专属(专属于一种语言)。随着LLMs随时间继续发展,我们可以观察到共享神经元的比例和功能重要性显著增加,而专属神经元的影响则逐渐减弱。这些共享神经元构成了核心无语言依赖参数空间的基础,支持抽象思维的出现。 基于这些见解,我们提出了针对不同发展阶段中LLM的无语言依赖层级的特定神经元训练策略。来自多种LLM家族的实验结果支持了我们的方法。
https://arxiv.org/abs/2506.09890
We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.
我们提出了一种通过分析提示和响应隐藏状态分布之间的概率差异来检测大型语言模型(LLM)中幻觉的新方法。出人意料的是,我们发现幻觉回应与其提示相比表现出较小的偏差,这表明幻觉往往源于表面性的改写而非实质性的推理。基于这一洞察,我们提出了一种利用分布距离作为原则性幻觉评分的内在模型检测方法,这种方法无需外部知识或辅助模型的支持。为了提高灵敏度,我们采用了深度可学习核函数来自动适应并捕捉分布在细微几何差异上的变化。我们的方法在多项基准测试中超越了现有基线,在幻觉检测方面展示了最先进的性能。即使不进行内核训练,该方法仍然具有竞争力,提供了一种稳健且可扩展的解决方案用于检测幻觉。
https://arxiv.org/abs/2506.09886
Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.
视觉-语言模型(VLMs)在各种视觉和语言任务中展现了卓越的性能,但它们仍然在理解三维空间结构方面存在根本性的局限。我们提出了一种轻量级且无需标注的微调框架——几何蒸馏(Geometric Distillation),该框架可以在不改变预训练VLM架构的情况下,将人类启发式的几何线索注入模型。通过从现成的3D基础模型(如MASt3R、VGGT)中提取并传递稀疏对应关系、相对深度关系和密集成本体积等信息,我们的方法使表示能够具备空间感知能力,并且保持与自然图像-文本输入兼容。 在广泛的三维视觉语言推理和三维感知基准测试中,我们所提出的方法始终优于先前的方法,在显著降低计算成本的同时实现了更好的三维空间推理性能。这项工作展示了将二维训练的VLMs与三维理解相结合的一种可扩展且高效的路径,并为基于空间的任务提供了更广泛的应用前景。
https://arxiv.org/abs/2506.09883
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
链式思考(CoT)提示在赋予大型语言模型复杂推理能力方面扮演着不可或缺的角色。然而,当前的CoT面临着两个基本挑战:(1) 充分性,即确保生成的中间推理步骤全面覆盖并支持最终结论;和 (2) 必要性,即识别出对于结果答案的有效性真正必不可少的推理步骤。我们提出了一种因果框架,通过充分性和必要性的双重视角来描述CoT推理过程。引入因果概率的充分性和必要性不仅能够确定哪些步骤在预测结果中是逻辑上足够或必要的,还能量化它们在不同干预情景下对最终推理结果的实际影响,从而实现自动添加缺失步骤和删除冗余步骤的功能。在各种数学和常识推理基准上的广泛实验结果显示,在不牺牲准确性的情况下,提高了推理效率并减少了标记使用量。我们的工作为改进LLM的推理性能和成本效益提供了有希望的方向。
https://arxiv.org/abs/2506.09853
Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.
出处不明和被误归属性的图像目前是当今错误信息和虚假信息环境中媒体操纵的主要形式。现有尝试检测此类行为的方法通常只考虑图像语义是否与文本叙述相匹配,只要所描绘的对象或场景在一定程度上符合叙述内容,就会认为没有进行过篡改。为解决这一问题,我们引入了“新闻媒体出处数据集”,这是一个包含带有出处标签的新闻文章和图片的数据集。我们在该数据集上制定了两项任务:位置相关性(LOR)和时间日期相关性(DTOR),并在六种大型语言模型(LLMs)上展示了基准结果。我们发现,虽然在零样本设置下LOR的表现令人满意,但DTOR的性能有待提高,这表明未来需要专门的架构设计和技术研究来进一步改进这方面的工作。
https://arxiv.org/abs/2506.09847