We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
我们介绍了LLM-in-Sandbox,这是一种使大型语言模型(LLMs)能够在代码沙盒(即虚拟计算机中)内探索的方法,以激发其在非代码领域中的通用智能。首先,我们展示了强大的LLMs无需额外训练即可表现出将代码沙盒用于非代码任务的一般化能力。例如,LLM会自发地访问外部资源来获取新知识,利用文件系统处理长上下文,并执行脚本来满足格式要求。此外,我们还表明通过仅使用非代理数据进行模型训练的LLM-in-Sandbox强化学习(LLM-in-Sandbox-RL),可以增强这些代理能力。实验结果证明,在无需训练和后训练设置下,LLM-in-Sandbox在数学、物理、化学、生物医学、长上下文理解和指令遵循等领域实现了稳健的一般化效果。最后,我们从计算和系统视角分析了LLM-in-Sandbox的效率,并将其开源为一个Python包,以促进其在现实世界中的部署。
https://arxiv.org/abs/2601.16206
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
动机推理——即个体在处理信息时可能有动机得出某种结论,不论这种结论是否准确或已预先确定——已被广泛研究为一种人类现象。然而,尚不清楚基础大语言模型(LLM)是否会模仿这些动机变化。通过复制4项先前的政治动机推理研究,我们发现基础LLM的行为并不与预期的人类行为一致。此外,不同模型中的基础LLM行为在某些方面存在相似性,例如标准差较小和对论点强度评估不准确。我们强调这些发现对于使用LLM来自动化诸如调查数据收集和论点评估等任务的研究人员的重要性。
https://arxiv.org/abs/2601.16130
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
微调特定任务的多语言大型语言模型(LLM)涉及在包含所有所需语言样本的多语言数据集上训练该模型。用额外的数据更新一个或多个支持的语言,或者添加对新语言的支持,则需要重新训练整个模型,这在计算效率方面是低效的,并且会形成严重的维护瓶颈。最近关于合并多任务多语言模型的研究显示出提高质量的潜力,但其计算和维护效率尚未被研究。 在这项工作中,我们首次从效率角度提供了这种合并策略的第一个集中分析,在三个独立的任务上进行了评估。我们证明了在保持质量一致的情况下实现了显著的效率提升:该合并方法将初始训练时间减少了高达50%。我们也展示了更新单个语言并重新合并作为模型维护的一部分可以比重新训练整个多语言模型节省超过60%的训练成本。我们在公开和专有的行业数据集上证明了这一点,确认这种方法不仅适用于学术研究已经探讨过的设置,也适合工业使用案例。 简而言之,本文通过评估一个特定的合并策略展示了提高效率的同时保持质量不变的优点,并且该方法在实际应用中展现出显著的成本节约效果。
https://arxiv.org/abs/2601.16127
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
像素级别的能力对于构建互动智能系统至关重要。然而,由于复杂的区域级编码器、专业的分割解码器和不兼容的训练目标,像素级别的多模态大模型(MLLMs)仍然难以扩展规模。为了解决这些挑战,我们提出了SAMTok,这是一种离散的掩码标记器,能够将任何区域掩码转换成两个特殊令牌,并使用这两个令牌以高保真度重建掩码。通过将掩码视为新的语言令牌,SAMTok使基础MLLM(如QwenVL系列)可以通过标准的下一个令牌预测和简单的强化学习来学习像素级别的能力,而无需进行架构修改或专门的损失设计。 基于SAM2,并使用一个掩码编码器和残差向量量化器对2.09亿个多样化的掩码进行训练,SAMTok能够生成离散、紧凑且信息丰富的令牌。通过500万个以SAMTok格式标记的理解与生成数据样本,QwenVL-SAMTok在区域描述、区域VQA(视觉问答)、基于参考的对话、指代分割、场景图解析以及多轮互动分割等任务上取得了当前最优或可比的结果。 我们进一步引入了一个文本答案匹配奖励机制,使掩码生成过程中的强化学习更加高效,在GRES和GCG基准测试中带来了显著改进。我们的结果表明,为MLLM提供强大的像素级别能力提供了一种可扩展且简单的方法。 我们的代码和模型已公开可用。
https://arxiv.org/abs/2601.16093
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at this https URL.
大型语言模型(LLM)可以在化学合成规划中发挥作用,但标准的提示方法往往会产生幻觉或过时的建议。我们通过将反应路径检索视为一个自然语言到图查询(Text2Cypher)生成问题来研究LLM与反应知识图之间的交互,并定义了一步和多步检索任务。我们将零样本提示与静态、随机以及基于嵌入的示例选择的一次性变体进行比较,同时评估了以检查表为驱动的验证/校正循环。为了评估我们的框架,我们考虑查询的有效性和检索准确性。我们发现使用对齐示例的一次性提示始终表现最佳。在零样本设置中,检查表式的自我修正循环主要提高了可执行性,并且一旦有良好的示例如何存在时,额外的检索增益就非常有限。 为了促进基于知识图的LLM在合成规划方面的进一步研究工作,我们提供了一个可重复的Text2Cypher评估环境。代码可在以下链接获取:[此URL](请将此占位符替换为实际提供的URL)。
https://arxiv.org/abs/2601.16038
Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
拒绝行为在对齐的大型语言模型(LLMs)中通常被视为特定于每个模型的现象,但我们假设这些行为源自一个跨越不同模型的通用、低维语义电路。为了验证这一假说,我们引入了通过概念基础重构进行轨迹回放的方法框架。该框架能够在捐赠者和目标模型之间转移拒绝干预措施,涵盖多样化的架构(如密集型到混合专家型)及不同的训练机制,并且无需在目标侧使用拒绝监督数据。 通过利用层次对齐的概念指纹以及基于共享“配方”的概念原子重构拒绝方向,我们能够将捐赠者的消融轨迹映射到目标模型的语义空间中。为了保持能力不受影响,我们引入了一种权重SVD稳定性保护机制,以防止干预措施进入高方差权重子空间,从而避免对其他性能造成不必要的损害。 我们的评估涵盖了8组模型配对(包括GPT-OSS-20B和GLM-4),结果表明这些转移的配方能够一致地减弱拒绝行为同时保持模型性能。这为安全对齐的语义普遍性提供了强有力的支持证据。
https://arxiv.org/abs/2601.16034
The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
直播流媒体的兴起已经改变了在线互动方式,它不仅促进了大规模的实时参与,也使平台面临诸如诈骗和协同恶意行为等复杂风险。由于有害行为常常逐渐积累并在看似无关的直播间中反复出现,因此检测这些风险颇具挑战性。为解决这一问题,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播流媒体的风险评估。 在CS-VAR架构中,一个轻量级、特定领域的模型执行快速的会话级别风险推断,并且在训练过程中由大型语言模型(LLM)指导。该大型语言模型通过检索跨会话的行为证据进行推理,并将其从局部到全局的理解传递给小型模型。这种设计使小型模型能够识别不同直播间中的重复模式,执行结构化的风险评估,并保持实时部署的效率。 我们通过对大规模工业数据集进行了广泛的离线实验,并结合在线验证,展示了CS-VAR在性能上的领先水平。此外,CS-VAR还提供了可解释且本地化的信号,有效地支持了实际直播流媒体内容管理中的监管工作。
https://arxiv.org/abs/2601.16027
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
现代的多模态大型语言模型(MLLMs)和视频世界模型在数学、常识以及视觉推理方面取得了显著进展,但它们对物理现象的理解仍然未得到充分探索。现有的评估这些能力的基准测试通常依赖于合成的视觉问答模板,或者关注感知上的视频质量,而这与衡量视频是否遵循物理定律关系不大。为了解决这种碎片化问题,我们引入了PhysicsMind,这是一个结合了真实和模拟环境的统一基准,用于评估在三个经典原则(质心、杠杆平衡以及牛顿第一定律)下的守恒推理和生成能力。 PhysicsMind包含两个主要任务: 1. 视觉问答(VQA)任务:测试模型是否能够从图像或短视频中推断并确定物理量和值。 2. 视频生成(VG)任务:评估预测的运动轨迹是否遵守与地面实况相同的质心、力矩以及惯性约束。 我们对一系列最近的多模态模型和视频生成模型进行了PhysicsMind基准测试,发现这些模型通常依赖于外观启发式方法,并且经常违反基本力学原理。这些差距表明当前的规模扩展和训练对于建立稳健的物理理解仍然不足,从而凸显了PhysicsMind作为物理感知型多模态模型集中测试平台的重要性。 我们的数据将在获得接受后公开发布。
https://arxiv.org/abs/2601.16007
Model merging (MM) offers an efficient mechanism for integrating multiple specialized models without access to original training data or costly retraining. While MM has demonstrated success in domains like computer vision, its role in recommender systems (RSs) remains largely unexplored. Recently, Generative Recommendation (GR) has emerged as a new paradigm in RSs, characterized by rapidly growing model scales and substantial computational costs, making MM particularly appealing for cost-sensitive deployment scenarios. In this work, we present the first systematic study of MM in GR through a contextual lens. We focus on a fundamental yet underexplored challenge in real-world: how to merge generative recommenders specialized to different real-world contexts, arising from temporal evolving user behaviors and heterogeneous application domains. To this end, we propose a unified framework MMGRid, a structured contextual grid of GR checkpoints that organizes models trained under diverse contexts induced by temporal evolution and domain diversity. All checkpoints are derived from a shared base LLM but fine-tuned on context-specific data, forming a realistic and controlled model space for systematically analyzing MM across GR paradigms and merging algorithms. Our investigation reveals several key insights. First, training GR models from LLMs can introduce parameter conflicts during merging due to token distribution shifts and objective disparities; such conflicts can be alleviated by disentangling task-aware and context-specific parameter changes via base model replacement. Second, incremental training across contexts induces recency bias, which can be effectively balanced through weighted contextual merging. Notably, we observe that optimal merging weights correlate with context-dependent interaction characteristics, offering practical guidance for weight selection in real-world deployments.
模型合并(MM)提供了一种高效机制,可在不访问原始训练数据或重新训练的情况下集成多个专业模型。尽管在计算机视觉等领域中MM已经展示了成功案例,但在推荐系统(RSs)中的作用仍然很大程度上未被探索。最近,生成式推荐(GR)作为推荐系统的新范例出现,其特点是由快速扩大的模型规模和显著的计算成本所驱动,这使得对于成本敏感部署场景而言,MM特别具有吸引力。 在此研究中,我们首次通过情境视角系统性地探讨了MM在GR中的应用。我们专注于现实世界中一个基本但鲜被探索的挑战:如何合并针对不同真实世界情景专业化的生成推荐器,这些情景源于用户行为的时间演变和异构的应用领域。为此,我们提出了一种统一框架MMGRid,这是一个由时间演化和领域多样性所诱导的不同情境中的GR检查点组成的结构化上下文网格,所有的检查点都源自于一个共享的基础LLM(大型语言模型),但经过特定情境数据的微调,形成了一个现实且可控的模型空间,可用于系统性地分析不同GR范式及合并算法间的MM。 我们的研究揭示了几个关键见解。首先,在从基础LLM训练GR模型时,由于标记分布偏移和目标差异等原因,在合并过程中可能会引入参数冲突;通过使用基于任务感知与上下文特定参数变化的基模型替换来拆分这些变化可以减轻这种冲突。其次,跨情境进行增量训练会导致近期偏差问题,可以通过加权上下文合并有效地平衡这种倾向。值得注意的是,我们观察到最佳合并权重会随着交互特征在不同上下文中依赖性而有所不同,这为实际部署中的权重选择提供了实用指导。 通过这项工作,我们旨在揭示生成式推荐系统领域模型合并的关键挑战与机遇,并为进一步优化其应用提供理论基础和实践建议。
https://arxiv.org/abs/2601.15930
Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
遵循自然语言指令的机器人通常要么使用手工设计的界面进行高层次规划,要么依赖于难以用于实时控制的大规模端到端模型。我们提出了一种名为TeNet(Text-to-Network)的框架,该框架可以从自然语言描述中直接生成紧凑且任务特定的机器人策略。在这一框架下,一个超网络根据预训练大型语言模型(LLM)产生的文本嵌入来生成完全可执行的策略,并随后仅基于低维状态输入以高频度进行控制操作。通过仅在策略实例化时使用一次自然语言,TeNet继承了预训练LLM的一般知识和同义句稳健性,同时保持了执行时的轻量级与高效性。 为了提高泛化能力,在训练过程中我们可选地通过将文本嵌入与演示动作对齐来使语言具体化于行为中,而在推理阶段则无需展示示例。在MuJoCo和Meta-World基准上的实验表明,TeNet产生的策略比基于序列的基线小几个数量级,同时在多任务设置和元学习设置中均表现出色,并支持高频控制。这些结果表明,以文本条件化的超网络提供了一种实用的方法来构建紧凑且语言驱动的控制器,适用于资源受限但需要实时响应的机器人控制任务。
https://arxiv.org/abs/2601.15912
Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
基于扩散的语言模型(DLLM)相较于自回归(AR)模型提供了非顺序的、块状生成和更丰富的数据重用,但现有的代码DLLM在类似预算下仍不及强大的AR基准。我们在一个受控研究中重新审视了这一情况,并引入了Stable-DiffCoder,这是一种采用Seed-Coder架构、数据及训练管道的块扩散代码模型。为了实现高效的知识学习和稳定的训练,我们结合了一个经过定制预热和分块裁剪噪声时间表增强的块扩散持续预训练(CPT)阶段。 在相同的数据集和架构下,Stable-DiffCoder整体上在一个广泛的代码基准测试中超越了它的AR对应模型。此外,仅依靠CPT和监督微调阶段,Stable-DiffCoder就能实现比一系列大约80亿参数的AR和DLLM更强的表现力,证明了基于扩散的训练能够提升代码建模质量超出单独使用AR训练的效果。 更进一步地,基于扩散的任何阶模型改进了结构化代码建模以用于编辑与推理,并且通过数据增强,有利于资源贫乏的编程语言。
https://arxiv.org/abs/2601.15892
Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities, evaluation methods have not advanced equally. Most current benchmarks focus solely on functional correctness of code completions based on given context, overlooking models' ability to follow user instructions during completion-a common scenario in LLM-assisted programming. To address this limitation, we present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench), comprising 2,195 carefully designed completion tasks. Through comprehensive evaluation of over 40 mainstream LLMs across C3-Bench and conventional benchmarks, we reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks. Moreover, we develop a straightforward data synthesis pipeline that leverages Qwen2.5-Coder to generate high-quality instruction-completion pairs for supervised fine-tuning (SFT). The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench. Our findings provide valuable insights for enhancing LLMs' code completion and instruction-following capabilities, establishing new directions for future research in code LLMs. To facilitate reproducibility and foster further research in code LLMs, we open-source all code, datasets, and models.
代码补全已成为软件工程中的核心任务,随着基于大型语言模型(LLM)的工具的兴起,这一领域的关注度显著提升。尽管近期的进步极大地提高了LLM的代码补全能力,但评估方法却没有同样地发展起来。目前大多数基准测试主要关注给定上下文下生成代码片段的功能正确性,而忽视了模型在代码补全过程中遵循用户指令的能力——这在LLM辅助编程中是一个常见的场景。为解决这一局限性,我们提出了首个基于指令的代码补全基准测试Controllable Code Completion Benchmark(C3-Bench),包含2,195个精心设计的任务。 通过跨C3-Bench和传统基准测试对40多个主流LLM进行全面评估,我们揭示了开源模型与高级专有模型在遵循用户指令进行代码补全任务时能力上的显著差距。此外,我们开发了一个简单有效的数据合成流水线,利用Qwen2.5-Coder生成高质量的指令-补全配对用于监督微调(SFT)。由此产生的模型Qwen2.5-Coder-C3,在C3-Bench上达到了最先进的性能水平。 我们的研究结果为提高LLM在代码补全和遵循用户指令能力方面提供了宝贵的见解,并为未来在代码LLM领域的研究开辟了新的方向。为了促进可重复性及推动代码LLM的研究,我们开源所有代码、数据集和模型。
https://arxiv.org/abs/2601.15879
This paper introduces the Generative Application Firewall (GAF), a new architectural layer for securing LLM applications. Existing defenses -- prompt filters, guardrails, and data-masking -- remain fragmented; GAF unifies them into a single enforcement point, much like a WAF coordinates defenses for web traffic, while also covering autonomous agents and their tool interactions.
本文介绍了生成应用防火墙(GAF),这是一种新的架构层次,用于保护大型语言模型(LLM)应用程序的安全。现有的防御措施——如提示过滤、护栏和数据屏蔽等仍然处于分散状态;而GAF将它们统一为一个单一的执行点,类似于Web应用防火墙协调网络流量中的防护措施,同时也能覆盖自主代理及其工具交互的安全需求。
https://arxiv.org/abs/2601.15824
Adaptive traffic signal control (TSC) has demonstrated strong effectiveness in managing dynamic traffic flows. However, conventional methods often struggle when unforeseen traffic incidents occur (e.g., accidents and road maintenance), which typically require labor-intensive and inefficient manual interventions by traffic police officers. Large Language Models (LLMs) appear to be a promising solution thanks to their remarkable reasoning and generalization capabilities. Nevertheless, existing works often propose to replace existing TSC systems with LLM-based systems, which can be (i) unreliable due to the inherent hallucinations of LLMs and (ii) costly due to the need for system replacement. To address the issues of existing works, we propose a hierarchical framework that augments existing TSC systems with LLMs, whereby a virtual traffic police agent at the upper level dynamically fine-tunes selected parameters of signal controllers at the lower level in response to real-time traffic incidents. To enhance domain-specific reliability in response to unforeseen traffic incidents, we devise a self-refined traffic language retrieval system (TLRS), whereby retrieval-augmented generation is employed to draw knowledge from a tailored traffic language database that encompasses traffic conditions and controller operation principles. Moreover, we devise an LLM-based verifier to update the TLRS continuously over the reasoning process. Our results show that LLMs can serve as trustworthy virtual traffic police officers that can adapt conventional TSC methods to unforeseen traffic incidents with significantly improved operational efficiency and reliability.
自适应交通信号控制(TSC)在管理动态交通流方面表现出很强的效果。然而,传统方法在面对突发的不可预见事件(如事故和道路维护)时往往力不从心,这些情况通常需要耗费人力且效率低下的警察手动干预。大型语言模型(LLM)因其卓越的推理能力和泛化能力而被视为可能的解决方案。不过,现有研究常常建议用基于LLM的系统完全取代现有的TSC系统,这可能会因为LLM本身的幻觉问题导致不可靠,并且由于需要更换整个系统而导致成本高昂。 为了解决这些问题,我们提出了一种分层框架,该框架通过增强现有的TSC系统来集成大型语言模型。在这一框架中,顶层的虚拟交通警察代理会根据实时的突发交通事件动态调整底层信号控制器的选择参数。为了提高面对不可预见事件时的专业可靠性,我们设计了一个自我完善的交通语言检索系统(TLRS),其中采用了检索增强生成技术,从一个特别定制化的包含交通状况和控制器操作原则的知识库中获取知识。 此外,我们还设计了一种基于LLM的验证器,能够通过推理过程不断更新TLRS。我们的研究结果表明,大型语言模型可以作为值得信赖的虚拟交通警察,在保持传统TSC方法的同时,显著提高应对突发交通事件的操作效率和可靠性。
https://arxiv.org/abs/2601.15816
Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.
大型语言模型(LLM)的基准测试可以告诉我们模型何时失败,但不能解释为何它们会失败。在推理数据集上给出错误答案的原因可能是格式问题、计算错误或数据噪声,而不仅仅是推理能力薄弱。如果不对这些原因进行区分,那么基准测试就无法完整地呈现模型的问题,并且无法可靠地指导模型改进。我们引入了ErrorMap,这是第一个用来绘制LLM失败根源的方法。它提取出模型特有的“失败签名”,明确了基准测试衡量的内容,并扩展了错误识别的范围以减少盲点。这有助于开发者调试模型、使基准目标与结果对齐,并支持有依据的选择模型。ErrorMap可以应用于任何模型或数据集,只需使用相同的逻辑。 通过将我们的方法应用到35个数据集和83个模型上,我们生成了ErrorAtlas——一个模型错误的分类体系,揭示了反复出现的失败模式。ErrorAtlas强调了一些目前在LLM研究中未被充分探讨的错误类型,例如输出缺少必要的细节以及对问题的误解。 通过将关注点从模型成功的原因转向它们为何会失败,ErrorMap和ErrorAtlas能够实现更高级别的评估——这不仅可以揭示隐藏的弱点,还可以指导进展。与通常由任务级指标衡量成功的标准不同,我们的方法引入了一个可以跨所有模型和任务使用的深层次评价层,为模型行为及其限制提供了更加丰富的见解。 我们公开发布了分类体系和代码,并计划定期更新ErrorAtlas以适应不断涌现的新基准和新模型。
https://arxiv.org/abs/2601.15812
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
最近的深度研究代理(DRAs)进展正在改变自动化知识发现和问题解决的方式。大多数现有的努力主要集中在通过后训练增强策略能力上,而我们提出了一种替代范式:通过迭代验证政策模型输出,由精心设计的标准引导,使代理人自我进化其能力。这种做法导致了推理时间的验证扩展,在此过程中,代理通过评估其生成的答案来产生迭代反馈和改进。我们的标准是基于一个自动构建的DRA失败分类法得出的,该分类法系统地将代理人的失败分为五大类和十三个子类别。 我们介绍DeepVerifier,这是一个基于标准的结果奖励验证器,它利用了验证过程中的不对称性,并且在元评估F1分数上超越了普通的代理人作为裁判员和LLM评判基线模型的性能(提高幅度为12%-48%)。为了实现实用性的自我进化,DeepVerifier被整合为一个即插即用模块,在推理时间测试时使用。验证器产生详细的基于标准反馈,并将其反馈给代理以进行迭代自举操作,无需额外训练即可细化响应。 这种推理时间的扩展在由强大且闭源的LLM驱动的情况下,对GAIA和XBench-DeepResearch中的具有挑战性的子集带来了8%-11%的准确性提升。最后,为了支持开源领域的进步,我们发布了DeepVerifier-4K,这是一个包含4,646个高质量代理步骤的监督微调数据集,专注于DRA验证。这些实例强调了反思和自我批评的重要性,使开放模型能够发展出强大的验证能力。 这一系列工作旨在推动研究社区向更加智能、自适应的人工智能系统迈进,特别是在增强代理人的内部检验能力和持续学习方面。
https://arxiv.org/abs/2601.15808