Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
大型语言模型(LLMs)通常能够准确地用自然语言描述概率分布,但它们在从这些分布中生成忠实样本时仍存在困难。这种不匹配限制了它们在需要可靠随机性的任务中的应用,例如蒙特卡洛方法、基于代理的仿真和随机决策制定。我们在此背景下研究伯努利分布的知识与采样之间的差距,并提出了语言化的拒绝抽样(Verbalized Rejection Sampling, VRS),这是一种经典拒绝抽样的自然语言版本,它促使LLM对提出的样本进行推理并接受或拒绝这些样本。尽管VRS在内部依赖于相同的伯努利机制,但它显著减少了不同模型的采样偏差。 我们提供理论分析表明,在适度假设下,VRS优于直接抽样,并且改进来自于算法本身以及提示设计。更广泛地说,我们的研究结果展示了如何将经典的概率工具语言化并嵌入到LLM工作流程中以提高可靠性,而无需访问模型内部或进行复杂的提示工程。
https://arxiv.org/abs/2506.09998
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
尽管安全对齐已经被大多数大型语言模型(LLM)采用,但LLM服务提供商通常会在实际产品中部署后续的审核作为外部的安全保障。现有的审核人员主要执行传统的全检测方法,即根据完整的LLM输出来确定其危害性,这导致了较高的服务延迟。近期的研究更关注于部分检测,在生成过程中中途监督并提前停止有害内容的输出,但它们直接将使用全检测范式训练的审核员应用于不完整输出中,造成了训练与推理之间的差距,并降低了性能表现。 本文探讨如何形成一种原生支持部分检测的数据和模型解决方案。在数据方面,我们构建了FineHarm数据集,包含29,000个带有细粒度注释的提示-响应对,以提供合理监督,从而进行令牌级训练。然后,我们提出了流式内容监控器(SCM),它通过响应级别和令牌级别的双重视频标签进行训练,并能够跟随LLM输出流,及时判断有害性。 实验表明,SCM在仅查看平均20%的响应令牌的情况下,在宏F1得分上获得了比全检测方法相当甚至更高的性能表现(提高幅度为0.95+)。此外,SCM可以作为伪危害注释员来改进安全对齐,并导致相比DPO更高的无害性评分。
https://arxiv.org/abs/2506.09996
Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
最近在大型语言模型(LLM)方面取得的进展使得它们在各种任务中表现出色。然而,标准提示方法通常难以生成结构正确且准确的结果,尤其是在依存句法分析中。我们提出了一种新颖的逐步指令策略,在这种策略中,通用词性标注先于句法头和依存关系标签的预测,并采用一种简化的类似CoNLL-U格式输出,我们的方法在涵盖17种语言的Universal Dependencies数据集上实现了无幻觉或污染的最佳准确率。此外,我们还证明了多语言微调同时提高了跨语言泛化性能。我们的结果突显了基于LLM解析中明确推理步骤的有效性,并为基于括号的方法提供了一种可扩展且格式一致的替代方案。
https://arxiv.org/abs/2506.09983
Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
检测人工智能生成的文本本身就是一个难题;而在社交媒体上检测这类文本则更加困难,因为短文本长度和互联网特有的非正式、个性化的语言使得这一任务更为复杂。尽管如此,解决这个问题仍然非常重要,因为在网络影响力活动中,社交媒体代表了一个重要的攻击途径,通过大规模生产支持(或反对)特定政策、决策或事件的人工智能生成帖子可以增强这种活动的力度。 我们以一个较为复杂的威胁行为者的思维方式和资源来应对这个问题,并创建了一套数据集,其中包含来自开源、闭源以及经过微调的语言模型生成的505,159条社交媒体帖子,这些帖子涵盖了11个有争议的话题。研究表明,在典型的科研假设下(即研究者对生成文本的模型具有一定的了解和访问权限),可以检测到这些帖子;但在更为现实的情况下,若攻击者不会将其微调后的模型公开,则可检测性会大幅下降。这项结果也通过一项人类实验得到了确认。 消融实验进一步揭示了各种检测算法对于经过微调的语言模型存在明显的脆弱性。这一发现对所有领域的检测工作都有重要的影响,因为微调是大型语言模型的一种普遍适用且现实的应用场景。
https://arxiv.org/abs/2506.09975
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
随着大型语言模型(LLM)在文本推理方面取得了显著进展,增强大型视觉-语言模型(LVLM)的多模态推理能力的兴趣也随之增加。然而,现有的方法主要以直接、文本中心的方式处理多模态推理,在这种方式中,无论是推理还是答案推导都完全通过文本进行,唯一的区别在于存在多模态输入。因此,这些方法在需要精确几何理解及连续空间跟踪的任务上(人类通常通过心理可视化和操作来实现这些能力)往往遇到根本性的局限性。 为了解决这些问题,我们提出了一种新的范式——“空间绘图推理”,使LVLM可以通过基本的绘制操作在视觉空间中进行推理。通过赋予模型诸如标注边界框及绘制辅助线等基础绘图操作的能力,它们能够直接通过视觉操控表达和分析空间关系,并且避免了之前工具整合推理方法中存在的专业感知工具性能上限问题。 为了培养这种能力,我们开发了一个三阶段的训练框架:使用合成数据进行冷启动训练以建立基本绘图技能;采用反射拒绝采样增强自我反思行为;以及直接针对目标奖励优化的强化学习。广泛的实验表明,我们的模型VILASR在包括迷宫导航、静态空间推理、基于视频的推理和多视角推理任务在内的多样化空间推理基准测试中均显著优于现有方法,平均提升了18.4%。
https://arxiv.org/abs/2506.09965
Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vulnerable. We present the results of LLMail-Inject, a public challenge simulating a realistic scenario in which participants adaptively attempted to inject malicious instructions into emails in order to trigger unauthorized tool calls in an LLM-based email assistant. The challenge spanned multiple defense strategies, LLM architectures, and retrieval configurations, resulting in a dataset of 208,095 unique attack submissions from 839 participants. We release the challenge code, the full dataset of submissions, and our analysis demonstrating how this data can provide new insights into the instruction-data separation problem. We hope this will serve as a foundation for future research towards practical structural solutions to prompt injection.
间接提示注入攻击利用了大型语言模型(LLM)在区分输入中的指令和数据方面的内在限制。尽管提出了许多防御建议,但针对适应性对手的系统评估仍有限制,即使成功的攻击可能具有广泛的安保和隐私影响,许多基于LLM的真实世界应用程序仍然易受攻击。我们展示了LLMail-Inject的结果,这是一项公开挑战,模拟了一个现实场景,在该场景中参与者尝试将恶意指令注入电子邮件以触发LLM基础邮件助手的未经授权的工具调用。此挑战涵盖了多种防御策略、LLM架构和检索配置,并产生了来自839名参与者的208,095个独特攻击提交的数据集。我们发布了挑战代码、完整的提交数据集以及我们的分析,展示如何利用这些数据为指令与数据分离问题提供新的见解。我们希望这将成为未来研究的基础,以寻求解决提示注入的实用结构化解决方案。
https://arxiv.org/abs/2506.09956
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
最近的研究(Wu et al., 2025b)发现了一类称为检索头(retrieval heads)的注意力机制子集,这些头部在长上下文语言模型中负责从大量文本信息中提取关键内容,这一结论是通过它们在针刺干草堆任务中的复制粘贴行为得出的。在此论文中,我们提出了QRHEAD(查询聚焦检索头),这是一种改进后的注意力机制集合,旨在增强从长文本中进行有效检索的能力。我们通过聚集与输入查询相关的注意力得分,并结合真实世界任务(如长上下文问答)示例来识别QRHEAD。 此外,我们引入了QR- RETRIEVER,这是一个高效且有效的检索器,它使用QRHEAD累积的注意力质量作为检索分数。我们在多跳推理任务LongMemEval和CLIPPER中利用QR-RETRIEVER进行长文本推理,通过选择具有最高检索分数的相关部分来实现这一目标。相比全面考虑上下文的方法,这种方法在这些任务上带来了超过10%的性能提升,并且优于其他密集型检索器。 我们还在BEIR基准测试中将QR-RETRIEVER作为重排序器进行了评估,并发现它实现了强大的零样本性能,超过了其他基于大语言模型(LLM)的重排序器,如RankGPT。进一步分析表明,查询上下文注意力评分以及任务选择对于识别具有强大下游效用的QRHEAD至关重要。 总体而言,我们的工作为长文本推理贡献了一种通用检索方法,并提供了关于LMs在处理长上下文时能力的理解和解释性见解。
https://arxiv.org/abs/2506.09944
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.
带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLM)的关键技术,其中验证工程发挥了核心作用。然而,用于指令遵循的最佳强化学习实践尚未得到充分探索。在这项工作中,我们探讨了在指令跟随中实现RL所面临的验证挑战,并提出了一种名为VerIF的方法,该方法结合了基于规则的代码验证和大型推理模型(如QwQ-32B)中的LLM验证。为了支持这种方法,我们构建了一个高质量的指令遵循数据集VerInstruct,其中包括约22,000个实例及其相关的验证信号。我们将使用VerIF进行RL训练应用于两个模型,并在几个代表性的指令跟随基准测试中实现了显著改进。经过训练的模型在同类大小的模型中达到了最先进的性能水平,并且对未见过的约束具有良好的泛化能力。我们进一步观察到,它们的一般能力并未受到影响,这表明带有VerIF的RL可以整合到现有的RL配方中以提升整体模型性能。我们在[此链接](https://example.com/)发布了我们的数据集、代码和模型,以促进未来的研究。
https://arxiv.org/abs/2506.09942
Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
大型语言模型(LLMs)已经提升了对话式人工智能助手的能力。然而,系统性地评估这些助手在完成任务时如何应用个性化——即根据个人用户的偏好进行调整——仍然是一个挑战。现有的个性化基准测试主要集中在闲聊、非对话型任务或狭窄的领域上,无法捕捉到个性化任务导向辅助服务的复杂性。为此,我们引入了PersonaLens,这是一个全面的评估体系,用于评价面向任务的人工智能助手在个性化方面的表现。 我们的评估体系包括配备了丰富偏好和互动历史的多样用户档案,以及两个专门针对LLM(大型语言模型)设计的代理:一个与AI助手进行真实任务导向对话的用户代理;另一个使用“将LLM作为评判者”模式来评估个性化的质量、响应质量和任务成功率的评判员代理。通过与当前各种大型语言模型助手在多种任务上的广泛实验,我们揭示了它们在个性化能力方面的显著差异,并为推动对话式人工智能系统的进步提供了重要的见解。
https://arxiv.org/abs/2506.09902
As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may "think" in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model's ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.
随着大型语言模型(LLM)的不断进步,它们在多种语言环境中有效运行的能力显著提升。初步研究表明,即使响应非英语提示时,LLMs 的隐藏激活状态也往往类似英语。这导致了普遍认为这些模型可能以“英语思维”进行工作的假设。然而,近期结果显示,在特定任务中,LLMs 在其他语言上的表现甚至超过在英语中的表现,这一发现挑战了上述观点。 在这项研究中,我们发现 LLMS 随时间逐步形成一个核心的无语言依赖参数空间——这是一个相对较小但至关重要的参数子集,其失活会导致所有语言的表现显著下降。这个紧凑而关键的参数集合是模型能够超越单一语言进行泛化的基础,并支持抽象思维的出现,这种思维并不依附于任何特定的语言系统。 具体而言,我们识别出了与语言相关的神经元——那些在处理特定语言时始终被激活的神经元,并将其分类为共享(跨多种语言活跃)或专属(专属于一种语言)。随着LLMs随时间继续发展,我们可以观察到共享神经元的比例和功能重要性显著增加,而专属神经元的影响则逐渐减弱。这些共享神经元构成了核心无语言依赖参数空间的基础,支持抽象思维的出现。 基于这些见解,我们提出了针对不同发展阶段中LLM的无语言依赖层级的特定神经元训练策略。来自多种LLM家族的实验结果支持了我们的方法。
https://arxiv.org/abs/2506.09890
We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.
我们提出了一种通过分析提示和响应隐藏状态分布之间的概率差异来检测大型语言模型(LLM)中幻觉的新方法。出人意料的是,我们发现幻觉回应与其提示相比表现出较小的偏差,这表明幻觉往往源于表面性的改写而非实质性的推理。基于这一洞察,我们提出了一种利用分布距离作为原则性幻觉评分的内在模型检测方法,这种方法无需外部知识或辅助模型的支持。为了提高灵敏度,我们采用了深度可学习核函数来自动适应并捕捉分布在细微几何差异上的变化。我们的方法在多项基准测试中超越了现有基线,在幻觉检测方面展示了最先进的性能。即使不进行内核训练,该方法仍然具有竞争力,提供了一种稳健且可扩展的解决方案用于检测幻觉。
https://arxiv.org/abs/2506.09886
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
链式思考(CoT)提示在赋予大型语言模型复杂推理能力方面扮演着不可或缺的角色。然而,当前的CoT面临着两个基本挑战:(1) 充分性,即确保生成的中间推理步骤全面覆盖并支持最终结论;和 (2) 必要性,即识别出对于结果答案的有效性真正必不可少的推理步骤。我们提出了一种因果框架,通过充分性和必要性的双重视角来描述CoT推理过程。引入因果概率的充分性和必要性不仅能够确定哪些步骤在预测结果中是逻辑上足够或必要的,还能量化它们在不同干预情景下对最终推理结果的实际影响,从而实现自动添加缺失步骤和删除冗余步骤的功能。在各种数学和常识推理基准上的广泛实验结果显示,在不牺牲准确性的情况下,提高了推理效率并减少了标记使用量。我们的工作为改进LLM的推理性能和成本效益提供了有希望的方向。
https://arxiv.org/abs/2506.09853
Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.
出处不明和被误归属性的图像目前是当今错误信息和虚假信息环境中媒体操纵的主要形式。现有尝试检测此类行为的方法通常只考虑图像语义是否与文本叙述相匹配,只要所描绘的对象或场景在一定程度上符合叙述内容,就会认为没有进行过篡改。为解决这一问题,我们引入了“新闻媒体出处数据集”,这是一个包含带有出处标签的新闻文章和图片的数据集。我们在该数据集上制定了两项任务:位置相关性(LOR)和时间日期相关性(DTOR),并在六种大型语言模型(LLMs)上展示了基准结果。我们发现,虽然在零样本设置下LOR的表现令人满意,但DTOR的性能有待提高,这表明未来需要专门的架构设计和技术研究来进一步改进这方面的工作。
https://arxiv.org/abs/2506.09847
Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.
嵌入式导航是更广泛的嵌入式人工智能研究领域中的重要基石。然而,先前的导航研究被划分为不同的任务/能力(例如,ObjNav、ImgNav和VLN),这些任务在目标和模态方面有所不同,导致数据集和方法的设计往往是独立进行的。在这项工作中,我们朝着能够遵循包含任意多模态与多功能组合的自由形式指令的一般化导航代理迈进了一步。为了实现这一目标,我们提出了一套大规模基准测试及相应的方法,称为OctoNav-Bench和OctoNav-R1。 具体来说,OctoNav-Bench具备连续环境特性,并通过设计的注释流程构建而成。我们在该环境中精心制作了指令-轨迹配对数据集,其中指令以多样化的自由形式呈现,并且其模态与能力可以是任意组合。此外,在OctoNav-Bench内,我们还构造了一个“思考前行动”(TBA-CoT)的数据集来提供背后的操作思维过程。 对于OctoNav-R1,我们基于大规模语言模型(MLLMs)构建了它,并将其改编为一种视觉-语言-动作类型(VLA)的模型,该模型仅根据2D视觉观测数据就能产生低级别的行动。此外,为了适应这种多任务处理需求,我们设计了一个包含三个阶段的混合训练范式(HTP),即:Action-/TBA-SFT、Nav-GPRO和在线强化学习阶段。每个阶段都包含了专门设计的学习策略与奖励机制。 特别是,在TBA-SFT和Nav-GRPO的设计中,受到了OpenAI-o1及DeepSeek-R1的启发,这些模型展示了通过“思考前行动”方式产生出色推理能力的特点。因此,我们旨在研究如何在嵌入式导航领域实现“思考前行动”,以此提高模型向一般化发展的推理能力。具体而言,我们提出了TBA-SFT来利用TBA-CoT数据集对模型进行微调,以作为冷启动阶段,并通过Nav-GPRO进一步提升其思维能力。 最终,OctoNav-R1在与先前方法的比较中显示出了优越的性能表现。
https://arxiv.org/abs/2506.09839
Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
了解受试者在教育评估中如何回答问题对于测试开发、评价题目质量和提高测试有效性至关重要。然而,这一过程通常需要进行广泛的试点研究并招募人类参与者。如果大型语言模型(LLMs)表现出与人类相似的答题行为,则可以考虑使用它们作为试点参与者来加速测试开发进程。本文通过采用两种常用的心理测量理论框架——经典测验理论和项目反应理论——对18个经过指令调优的LLM在两个公开发布的涵盖阅读、美国历史和经济学三门学科的选择题数据集上的回答进行评估,考察其人类行为相似性或心理测量合理性。 实验结果显示,虽然较大的模型过于自信,但在应用温度缩放校准后,它们的回答分布会更接近于人类。此外,我们发现LLM在阅读理解题目中与人类的相关性较好,而其他学科的关联度较低。然而,总体而言,相关性的强度并不高,这表明不应直接将LLM用于教育评估的零样本试点测试(即不进行任何特殊调整的情况下)。
https://arxiv.org/abs/2506.09796
Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge-based reasoning capabilities, and open new paths and avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by agentic AI systems. We review the historical evolution of ID across four distinct stages: rule-based expert systems, task-specific machine learning models, large-scale foundation AI models, and the recent emerging paradigm of multi-agent collaboration. We propose a conceptual framework for ID 4.0 and discuss its potential to support end-to-end automation of engineering design processes through coordinated, autonomous multi-agent-based systems. Furthermore, we discuss future perspectives to enhance and fully realize ID 4.0's potential, including more complex design scenarios, more practical design implementations, novel agent coordination mechanisms, and autonomous design goal-setting with better human value alignment. In sum, these insights lay a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing increasingly complex design challenges.
近年来,智能设计(ID)领域的研究与实践在工程创新、效率、质量和生产力方面取得了显著进步,并彻底改变了工程师的设计思维、行为方式以及他们与设计流程的互动模式。最近出现的基础模型(FMs),特别是大型语言模型(LLMs),已经展示了基于广泛知识的推理能力,并为工程设计进一步转型开辟了新的路径和方向。在此背景下,本文介绍了智能设计4.0(ID 4.0)作为由代理型人工智能系统赋能的新范式。文章回顾了智能设计在四个不同阶段的历史演变:规则基础专家系统、特定任务机器学习模型、大规模基础AI模型以及最近出现的多代理协作新范式。我们提出了一个概念框架来定义ID 4.0,并讨论了其通过自主协调的多代理系统支持工程设计过程全流程自动化的能力。此外,本文还探讨了增强和充分实现ID 4.0潜力的未来展望,包括更为复杂的设景、更具实用性的设计方案、新颖的代理协作机制以及与人类价值观更好地对齐的自主设计目标设定。 总的来说,这些见解为智能设计向更加适应性、自主性和有效性的方向发展奠定了基础,以应对日益复杂的设计挑战。
https://arxiv.org/abs/2506.09755
In complex engineering systems, the interdependencies among components or development activities are often modeled and analyzed using Design Structure Matrix (DSM). Reorganizing elements within a DSM to minimize feedback loops and enhance modularity or process efficiency constitutes a challenging combinatorial optimization (CO) problem in engineering design and operations. As problem sizes increase and dependency networks become more intricate, traditional optimization methods that solely use mathematical heuristics often fail to capture the contextual nuances and struggle to deliver effective solutions. In this study, we explore the potential of Large Language Models (LLMs) for helping solve such CO problems by leveraging their capabilities for advanced reasoning and contextual understanding. We propose a novel LLM-based framework that integrates network topology with contextual domain knowledge for iterative optimization of DSM element sequencing - a common CO problem. Experiments on various DSM cases show that our method consistently achieves faster convergence and superior solution quality compared to both stochastic and deterministic baselines. Notably, we find that incorporating contextual domain knowledge significantly enhances optimization performance regardless of the chosen LLM backbone. These findings highlight the potential of LLMs to solve complex engineering CO problems by combining semantic and mathematical reasoning. This approach paves the way towards a new paradigm in LLM-based engineering design optimization.
在复杂的工程系统中,组件或开发活动之间的相互依赖性通常通过设计结构矩阵(DSM)进行建模和分析。重新组织 DSM 中的元素以减少反馈循环并提高模块化或流程效率构成了一个具有挑战性的组合优化 (CO) 问题,在工程设计与运营中尤为突出。随着问题规模的增长及依赖网络变得更加复杂,传统的仅使用数学启发式方法的优化手段往往难以捕捉到上下文中的细微差别,并且很难提供有效的解决方案。 在本研究中,我们探索了大型语言模型(LLMs)在此类 CO 问题解决中的潜在能力,利用其高级推理和语境理解的能力。我们提出了一种基于 LLM 的新型框架,将网络拓扑与领域知识相结合,以迭代优化 DSM 元素的排序——这是一个常见的 CO 问题。在不同 DSM 案例上的实验表明,我们的方法相比随机及确定性的基线方法,在收敛速度和解决方案质量方面均表现出色。 值得注意的是,我们发现结合上下文领域的知识可以显著提升优化性能,无论所选择的 LLM 基础架构如何。这些研究结果突显了通过将语义与数学推理相结合的方式利用 LLM 来解决复杂的工程 CO 问题的巨大潜力。这种方法为基于 LLM 的工程设计优化开启了新的范例。
https://arxiv.org/abs/2506.09749
Monitoring Machine Learning (ML) models in production environments is crucial, yet traditional approaches often yield verbose, low-interpretability outputs that hinder effective decision-making. We propose a cognitive architecture for ML monitoring that applies feature engineering principles to agents based on Large Language Models (LLMs), significantly enhancing the interpretability of monitoring outputs. Central to our approach is a Decision Procedure module that simulates feature engineering through three key steps: Refactor, Break Down, and Compile. The Refactor step improves data representation to better capture feature semantics, allowing the LLM to focus on salient aspects of the monitoring data while reducing noise and irrelevant information. Break Down decomposes complex information for detailed analysis, and Compile integrates sub-insights into clear, interpretable outputs. This process leads to a more deterministic planning approach, reducing dependence on LLM-generated planning, which can sometimes be inconsistent and overly general. The combination of feature engineering-driven planning and selective LLM utilization results in a robust decision support system, capable of providing highly interpretable and actionable insights. Experiments using multiple LLMs demonstrate the efficacy of our approach, achieving significantly higher accuracy compared to various baselines across several domains.
在生产环境中监控机器学习(ML)模型至关重要,但传统方法通常会产生冗长且难以理解的输出结果,这阻碍了有效的决策制定。我们提出了一种基于大语言模型(LLMs)的认知架构来改进这一过程,通过应用特征工程原则于这些代理上,显著提高了监控输出的可解释性。 我们的核心方法是一个决策程序模块,该模块通过三个关键步骤模拟特征工程:重构、分解和编译。重构阶段优化数据表示以更好地捕捉特征语义,使LLM能够关注监测数据中的重要方面并减少噪声和其他无关信息的影响。分解过程将复杂的信息细化为详细的分析部分,而编译则整合子洞察结果形成清晰且易于理解的输出。 这种方法促进了更确定性的规划方式,减少了对LLM生成的不一致和过于通用的计划的依赖。通过特征工程驱动的规划与选择性使用LLM相结合,我们创建了一个强大且具有高度可解释性和操作性的决策支持系统。 实验采用多个大语言模型验证了我们的方法的有效性,在几个领域内实现了比各种基准更高的准确性,展示了该架构在实际应用中的潜力和可靠性。
https://arxiv.org/abs/2506.09742
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at this https URL.
尽管多模态大型语言模型(MLLMs)取得了快速进展,但它们在视觉处理方面的重要性却被很大程度上忽视了。通过一个简单却发人深省的实验,我们发现仅依赖文本的语言模型,在提供图像描述的情况下,可以达到与直接使用原始视觉输入的多模态大语言模型(MLLMs)相当甚至更好的性能。这表明当前的MLLMs可能能够生成准确的视觉描述,但在推理过程中未能有效地整合这些信息。受此启发,我们提出了一种简单的视觉扰动框架,该框架可以在不进行算法修改或额外训练数据的情况下增强感知鲁棒性。我们的方法引入了三种有针对性的扰动:干扰物拼接、保持优势特征的mixup和随机旋转,这几种扰动可以轻松集成到现有的后训练流程中,如SFT(策略梯度微调)、DPO(去偏置优化)和GRPO(指导式重复政策优化)。通过在多个数据集上进行广泛实验,我们展示了在数学推理性能方面的持续改进,这些改进与算法变化带来的提升相当。此外,在使用视觉扰动训练Qwen2.5-VL-7B后,我们在开源的7B RL调优模型中实现了具有竞争力的表现。通过全面的消融研究,我们分析了不同扰动策略的有效性,揭示每种类型的扰动在不同的视觉推理方面有独特贡献。我们的发现强调了视觉扰动在多模态数学推理中的关键作用:更好的推理始于更好的观察。我们的代码可在[此处](https URL)获取。
https://arxiv.org/abs/2506.09736
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements -- identifying their start and stop times -- directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases -- therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) -- are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.
持久暴露疗法(PE)是治疗创伤后应激障碍(PTSD)的有效方法,但评估治疗师的忠实度仍然是一项劳动密集型工作,因为需要手动审查会话录音。我们提出了一种从会话音频和转录文本中直接自动定位关键PE忠实元素(确定其开始和结束时间)的方法。 我们的方法采用大规模预训练音频语言模型Qwen2-Audio,并通过低秩适应(LoRA)技术对特定时长的30秒窗口音频-转录输入进行微调。三个核心协议阶段——治疗师定向(P1)、想象暴露(P2)和后想象处理(P3)——的忠实度标签通过基于大型语言模型的提示生成,并由经过培训的评审员验证。模型被训练预测归一化的边界偏移量,使用特定任务引导的软监督进行指导。 在一个包含313个真实PE会话的数据集上,我们最佳配置(LoRA秩8,30秒窗口)在各任务中实现了5.3秒的平均绝对误差(MAE)。此外,我们还分析了窗口大小和LoRA秩的影响,强调上下文粒度和模型适应的重要性。 这项工作引入了一个可扩展的框架,在PE治疗中的忠实度跟踪方面具有潜力,有望支持临床医生的培训、监督和质量保证。
https://arxiv.org/abs/2506.09707