We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。
https://arxiv.org/abs/2601.16211
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
离散视频变分自编码器(VAEs)是现代文本到视频生成和视频理解系统的基础,然而现有的标记化方法通常在单一尺度上学习有限词汇量的视觉代码本,并且语言监督较浅层,导致跨模态对齐效果不佳及零样本迁移性能差。我们引入了PyraTok,这是一种与语言相匹配的金字塔式标记器,它能够在多个时空分辨率下学习语义结构化的离散潜在变量。PyraTok基于预训练的视频VAE以及一种新颖的语言一致金字塔量化(LaPQ)模块构建而成,该模块通过共享的大二进制代码本来自不同深度对编码特征进行离散化处理,生成紧凑且表达力强的视频标记序列。 为了将视觉标记与语言紧密耦合,PyraTok同时优化多尺度文本引导量化和整个令牌层次上的全局自回归目标。在十项基准测试中,PyraTok提供了最先进的(SOTA)视频重建效果,在文本到视频质量上持续改进,并在视频分割、时间动作定位以及视频理解的零样本性能方面设立新的SOTA标准,能够稳健地扩展至4K/8K分辨率。
https://arxiv.org/abs/2601.16210
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
我们介绍了LLM-in-Sandbox,这是一种使大型语言模型(LLMs)能够在代码沙盒(即虚拟计算机中)内探索的方法,以激发其在非代码领域中的通用智能。首先,我们展示了强大的LLMs无需额外训练即可表现出将代码沙盒用于非代码任务的一般化能力。例如,LLM会自发地访问外部资源来获取新知识,利用文件系统处理长上下文,并执行脚本来满足格式要求。此外,我们还表明通过仅使用非代理数据进行模型训练的LLM-in-Sandbox强化学习(LLM-in-Sandbox-RL),可以增强这些代理能力。实验结果证明,在无需训练和后训练设置下,LLM-in-Sandbox在数学、物理、化学、生物医学、长上下文理解和指令遵循等领域实现了稳健的一般化效果。最后,我们从计算和系统视角分析了LLM-in-Sandbox的效率,并将其开源为一个Python包,以促进其在现实世界中的部署。
https://arxiv.org/abs/2601.16206
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
我们提出了一种新的训练方法,称为反事实训练(counterfactual training),该方法利用反事实解释来增强模型的解释能力。反事实解释作为一种流行的事后解释方法已经为不透明的机器学习模型广泛使用:它们提供关于现实输入如何需要改变才能使模型产生所需输出的信息。为了在实际决策系统中发挥作用,反事实应该与底层数据相符,并且在特征可变性约束下具有操作性。因此,现有的许多研究都集中在开发能够生成符合这些标准的事后方法上。 然而,在这项工作中,我们直接让模型对其期望的目标负责:反事实训练通过在训练阶段使用反事实来最小化学习表示与合理、可行的解释之间的差异。我们从实证和理论上证明了所提出的方法有助于训练出自然提供具有内在价值的反事实解释的模型,并且这些模型还表现出改进后的对抗鲁棒性。
https://arxiv.org/abs/2601.16205
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
最先进的神经定理证明器,如DeepSeek-Prover-V1.5,结合了大型语言模型和强化学习,在经过复杂的训练后取得了令人印象深刻的结果。我们的问题是:这些高度训练的模型在推理时是否仍然能从简单的结构引导中获益?我们在miniF2F基准上评估了一种轻量级干预方法——一个固定的提示时间表,涵盖15个常见的策略框架。这种简单的方法相比于使用相同模型的标准采样(pass@16为15.2%)在相同的样本数量(k=16)和最大生成长度(1024令牌)下实现了21.7%的通过率,相对改进了43%。我们的结果表明,即使是能力较强的强化学习训练证明器也未能充分利用定理语言中的结构先验知识,在推理时简单的引导仍是一种低成本且互补的方法,可以进一步提升性能。
https://arxiv.org/abs/2601.16172
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at this https URL
最近的视频生成模型展示了捕捉复杂物理交互和随时间演变场景的强大能力。为了利用这些时空先验,机器人研究工作已经将视频模型应用于策略学习中,但这种方法通过引入多阶段后训练以及用于动作生成的新架构组件而增加了复杂性。在本工作中,我们介绍了Cosmos Policy,这是一种简单的方法,它可以通过在目标平台上收集的机器人演示数据上进行单阶段后训练,将大型预训练视频模型(Cosmos-Predict2)适应为有效的机器人策略,并且无需对架构进行任何修改。 Cosmos Policy 学习直接生成编码为视频模型潜在扩散过程中的潜在帧的机器人动作,利用该模型预先训练的先验知识和核心学习算法来捕捉复杂的动作分布。此外,Cosmos Policy 生成未来状态图像和值(预期累积奖励),这些同样被编码为潜在帧,在测试时进行行动轨迹规划,从而增加成功几率。 在我们的评估中,Cosmos Policy 在 LIBERO 和 RoboCasa 模拟基准上实现了最先进的性能 (平均成功率分别为98.5% 和 67.1%),并且在具有挑战性的现实世界双臂操作任务中获得了最高的平均分数,优于从头开始训练的强大扩散策略、基于视频模型的策略以及在同一机器人演示数据上微调的状态-of-the-art 视觉-语言-动作模型。此外,在给定策略回滚数据的情况下,Cosmos Policy 可以通过学习经验来改进其世界模型和价值函数,并利用基于模型的规划在具有挑战性的任务中实现更高的成功率。 我们将在该网址发布代码、模型以及训练数据:[请在此处插入URL]
https://arxiv.org/abs/2601.16163
Modern data systems increasingly operate under conditions of persistent legal, political, and analytic disagreement. In such settings, interoperability cannot rely on shared interpretation, negotiated semantics, or centralized authority. Instead, representations must function as neutral substrates that preserve stable reference across incompatible extensions. This paper investigates the structural constraints imposed on ontological design by this requirement. Building on a neutrality framework that treats interpretive non-commitment and stability under extension as explicit design constraints, we ask what minimal ontological structure is forced if accountability relationships are to remain referable and comparable under disagreement. Minimality here is not mere parsimony: a reduction is admissible only if it does not reintroduce stability-critical distinctions as hidden roles, flags, or contextual predicates. We establish a conditional lower-bound result: any ontology capable of supporting accountability under persistent disagreement must realize at least six distinct identity-and-persistence regimes. We further show that a construction with exactly six such regimes is sufficient to satisfy the stated requirements without embedding causal or normative commitments in the substrate. The result is not a proposal for a universal ontology, but a constraint on what is possible when neutrality and stable reference are treated as non-negotiable design goals.
现代数据系统越来越多地在持续存在的法律、政治和分析分歧条件下运行。在这种环境下,互操作性不能依赖于共享解释、协商语义或中央权威。相反,表示必须作为中立的基础结构来保持稳定引用,在不兼容的扩展下依然有效。本文探讨了这种需求对本体设计施加的结构性限制。 基于一个将解释非承诺和在扩展下的稳定性视为显式设计约束的中立框架,我们研究了如果问责关系要在分歧条件下仍然可参照且可比较的话,最小化的本体结构必须是什么样的。这里的“最小化”并非简单的简约性:只有当这种简化的结果不重新引入影响稳定性的关键区别作为隐藏角色、标志或上下文谓词时才是可以接受的。 我们证明了一个条件下的下限结果:任何能够支持持续分歧条件下问责制的本体都必须实现至少六种不同的身份和持久性制度。此外,我们还表明,具有恰好六种这样的制度结构足以满足既定要求而不将因果或规范承诺嵌入基础架构中。这一结果不是对通用本体的一种提议,而是在将中立性和稳定引用视为不可谈判的设计目标时可能实现的限制条件。
https://arxiv.org/abs/2601.16152
Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
旋律和声化,即为给定的旋律生成和声伴奏,在计算音乐生成中仍然是一个核心挑战。最近采用单一编码器变压器的方法将和声化问题视为屏蔽序列建模问题,但现有的训练课程(受离散扩散启发)通常会导致旋律与和声之间的弱交叉注意力。这导致了对旋律线索利用的限制,尤其是在域外上下文的情况下。 在这项工作中,我们引入了一种训练课程 FF (full-to-full),该方法在训练初期将所有和声音符保持屏蔽状态,并逐渐在整个序列训练过程中解除屏蔽,以加强旋律与和声之间的相互作用。我们在多个实验轴上系统地评估了这种方法与先前的课程效果,包括时间量化(四分音符 vs. 十六分音符)、小节级 vs. 节拍签名条件、旋律表示形式(全范围 vs. 音阶)以及推理时的解除屏蔽策略。模型在 HookTheory 数据集上进行训练,并且使用全面评估和声进程结构、和声-旋律对齐以及节奏一致性的指标,在域内及一个精选的爵士标准曲集合中进行了评估。 实验结果表明,我们提出的 FF 课程方案几乎在所有指标上都优于基线方法,特别是在需要适应新型旋律线索的域外评估中表现尤为突出。此外,四分音符量化、小节标记交织以及音阶表示形式被证明在 FF 设置下具有优势。我们的研究强调了训练课程在有效旋律调适中的重要性,并表明全面解除屏蔽策略为单一编码器和声生成提供了一种稳健的方法。
https://arxiv.org/abs/2601.16150
Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.
现有的AI生成图像的水印方法通常依赖于在像素空间中应用后处理技术,这可能会引入计算开销和潜在的视觉伪影。在这项工作中,我们探索了隐式空间水印,并提出了DistSeal,这是一种统一的方法,用于跨扩散模型和自回归模型进行隐式水印。我们的方法通过在生成模型的隐式空间中训练后处理水印模型来实现。我们证明这些隐式水印可以在生成模型本身或隐式解码器中被有效蒸馏,从而实现了模型内水印功能。与像素空间基准相比,所得的隐式水印不仅达到了具有竞争力的鲁棒性,还提供了相似的不可见性和高达20倍的速度提升。 实验进一步表明,蒸馏隐式水印优于蒸馏像素空间水印,提供了一种既高效又更稳健的解决方案。
https://arxiv.org/abs/2601.16140
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
动机推理——即个体在处理信息时可能有动机得出某种结论,不论这种结论是否准确或已预先确定——已被广泛研究为一种人类现象。然而,尚不清楚基础大语言模型(LLM)是否会模仿这些动机变化。通过复制4项先前的政治动机推理研究,我们发现基础LLM的行为并不与预期的人类行为一致。此外,不同模型中的基础LLM行为在某些方面存在相似性,例如标准差较小和对论点强度评估不准确。我们强调这些发现对于使用LLM来自动化诸如调查数据收集和论点评估等任务的研究人员的重要性。
https://arxiv.org/abs/2601.16130
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
微调特定任务的多语言大型语言模型(LLM)涉及在包含所有所需语言样本的多语言数据集上训练该模型。用额外的数据更新一个或多个支持的语言,或者添加对新语言的支持,则需要重新训练整个模型,这在计算效率方面是低效的,并且会形成严重的维护瓶颈。最近关于合并多任务多语言模型的研究显示出提高质量的潜力,但其计算和维护效率尚未被研究。 在这项工作中,我们首次从效率角度提供了这种合并策略的第一个集中分析,在三个独立的任务上进行了评估。我们证明了在保持质量一致的情况下实现了显著的效率提升:该合并方法将初始训练时间减少了高达50%。我们也展示了更新单个语言并重新合并作为模型维护的一部分可以比重新训练整个多语言模型节省超过60%的训练成本。我们在公开和专有的行业数据集上证明了这一点,确认这种方法不仅适用于学术研究已经探讨过的设置,也适合工业使用案例。 简而言之,本文通过评估一个特定的合并策略展示了提高效率的同时保持质量不变的优点,并且该方法在实际应用中展现出显著的成本节约效果。
https://arxiv.org/abs/2601.16127
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
在当今的数字世界中,气候不实信息已成为一个主要挑战,尤其是随着误导性的图片和视频在社交媒体上广泛传播。这些虚假声明往往极具说服力且难以察觉,这可能会延迟应对气候变化的行动。尽管视觉-语言模型(VLM)已被用于识别视觉上的不实信息,但它们仅依赖于训练时已有的知识。这种限制使得它们无法有效推理最近发生的事件或更新的情况。本文的主要目标是通过将VLM与外部知识相结合来克服这一局限性。通过检索最新的信息,如反向图像搜索结果、在线事实核查和可信的专家内容,系统能够更好地评估图片及其声明是否准确、具有误导性、虚假或无法验证。这种方法提高了模型处理现实世界中气候不实信息的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
https://arxiv.org/abs/2601.16108
Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in the same cluster are closer to each other than to those in other clusters. In this paper, we present a new framework for studying online non-centroid clustering with delays, where elements, that arrive one at a time as points in a finite metric space, should be assigned to clusters, but assignments need not be immediate. Specifically, upon arrival, each point's location is revealed, and an online algorithm has to irrevocably assign it to an existing cluster or create a new one containing, at this moment, only this point. However, we allow decisions to be postponed at a delay cost, instead of following the more common assumption of immediate decisions upon arrival. This poses a critical challenge: the goal is to minimize both the total distance costs between points in each cluster and the overall delay costs incurred by postponing assignments. In the classic worst-case arrival model, where points arrive in an arbitrary order, no algorithm has a competitive ratio better than sublogarithmic in the number of points. To overcome this strong impossibility, we focus on a stochastic arrival model, where points' locations are drawn independently across time from an unknown and fixed probability distribution over the finite metric space. We offer hope for beyond worst-case adversaries: we devise an algorithm that is constant competitive in the sense that, as the number of points grows, the ratio between the expected overall costs of the output clustering and an optimal offline clustering is bounded by a constant.
聚类是一种基本问题,目标是将一组元素(如代理或数据点)划分为若干集群,使得同一集群内的元素彼此之间的距离更近,而非同集群的元素之间距离较远。本文介绍了一个新的框架来研究带有延迟的在线非中心聚类,在这种情况下,每个元素以单个时间点的形式在一个有限度量空间中依次到达,并应被分配到一个集群中,但是分配不一定需要即时完成。 具体来说,在每个元素到达时,其位置会被揭示出来,而一个在线算法必须立即决定将该点分配给现有的某个集群或创建一个新的只包含这个点的集群。然而,我们允许决策可以延后进行,并为此付出延迟成本,这不同于通常假设的即时决策模型。这一设定带来了关键挑战:目标是同时最小化每个集群内部元素之间的总距离成本和由于推迟分配所导致的整体延迟成本。 在经典最坏情况到达模式中,即点以任意顺序到达时,没有任何算法的竞争比(competitive ratio)能够优于关于点数量对数的次线性值。为了克服这种强烈的不可能性结果,我们关注于随机到达模型,在此模型下,每个元素的位置独立地从一个未知但固定的概率分布在有限度量空间中抽取而来。针对这一情况,我们提出了一种算法,它在某种意义上是常数竞争性的:随着点的数量增加,输出聚类的期望总成本与最优离线聚类的成本之比被限制在一个常数值内。这种结果为超越最坏情形对手提供了希望。
https://arxiv.org/abs/2601.16091
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Computing the conditional mode of a distribution, better known as the $\mathit{maximum\ a\ posteriori}$ (MAP) assignment, is a fundamental task in probabilistic inference. However, MAP estimation is generally intractable, and remains hard even under many common structural constraints and approximation schemes. We introduce $\mathit{probably\ approximately\ correct}$ (PAC) algorithms for MAP inference that provide provably optimal solutions under variable and fixed computational budgets. We characterize tractability conditions for PAC-MAP using information theoretic measures that can be estimated from finite samples. Our PAC-MAP solvers are efficiently implemented using probabilistic circuits with appropriate architectures. The randomization strategies we develop can be used either as standalone MAP inference techniques or to improve on popular heuristics, fortifying their solutions with rigorous guarantees. Experiments confirm the benefits of our method in a range of benchmarks.
计算概率分布的条件模式,即所谓的最大后验估计(MAP),是概率推理中的一个基本任务。然而,MAP估计通常是不可行的,在许多常见的结构约束和近似方案下仍然难以解决。我们引入了“大概正确”(PAC)算法来进行MAP推断,这些算法在变量和固定计算预算下提供了理论上最优的解决方案。我们使用可以从有限样本中估算的信息理论度量来表征PAC-MAP的可操作性条件。我们的PAC-MAP求解器通过具有适当架构的概率电路高效实现。我们开发的随机化策略可以作为独立的MAP推断技术,或者用来改进流行的启发式方法,并为其解决方案提供严格的保证。实验确认了在一系列基准测试中我们的方法带来的益处。 以下是更加细化和明确的翻译: 计算分布的条件模式(即最大后验估计,MAP)是概率推理中的一个核心任务。然而,由于一般情况下该问题难以解决,在许多常见的结构约束和近似方案下仍然保持其复杂性。我们引入了“大概正确”(PAC)算法来执行MAP推断,这些算法可以在给定的变量或固定计算资源预算内提供最优解。为了确定PAC-MAP在哪些条件下的有效性,我们使用信息理论度量来进行评估,而这些度量可以从有限的数据样本中估计出来。 我们的PAC-MAP求解器通过设计良好的概率电路高效地实现。此外,我们开发的随机化策略不仅可以用作独立的MAP推断技术,并且还可以用来改进流行的启发式方法,从而提供更可靠的解决方案。实验结果表明,在一系列基准测试任务中,该方法能够带来显著的优势和效果。 这样的翻译保持了原文的技术性和专业性,同时也确保了语言表达的准确与通顺。
https://arxiv.org/abs/2601.16083
Designing faster algorithms for solving Mixed-Integer Linear Programming (MILP) problems is highly desired across numerous practical domains, as a vast array of complex real-world challenges can be effectively modeled as MILP formulations. Solving these problems typically employs the branch-and-bound algorithm, the core of which can be conceived as searching for a path of nodes (or sub-problems) that contains the optimal solution to the original MILP problem. Traditional approaches to finding this path rely heavily on hand-crafted, intuition-based heuristic strategies, which often suffer from unstable and unpredictable performance across different MILP problem instances. To address this limitation, we introduce DeepBound, a deep learning-based node selection algorithm that automates the learning of such human intuition from data. The core of DeepBound lies in learning to prioritize nodes containing the optimal solution, thereby improving solving efficiency. DeepBound introduces a multi-level feature fusion network to capture the node representations. To tackle the inherent node imbalance in branch-and-bound trees, DeepBound employs a pairwise training paradigm that enhances the model's ability to discriminate between nodes. Extensive experiments on three NP-hard MILP benchmarks demonstrate that DeepBound achieves superior solving efficiency over conventional heuristic rules and existing learning-based approaches, obtaining optimal feasible solutions with significantly reduced computation time. Moreover, DeepBound demonstrates strong generalization capability on large and complex instances. The analysis of its learned features reveals that the method can automatically discover more flexible and robust feature selection, which may effectively improve and potentially replace human-designed heuristic rules.
为求解混合整数线性规划(MILP)问题设计更快的算法,在众多实际领域中备受重视,因为许多复杂的现实挑战可以有效地用MILP模型来描述。解决这些问题通常使用分支定界算法,其核心在于搜索包含原MILP问题最优解路径节点(或子问题)。传统方法寻找这条路径主要依赖于基于直觉的手工设计启发式策略,这些策略在面对不同MILP问题实例时往往表现不稳定且难以预测。 为解决这一限制,我们引入了DeepBound,这是一种基于深度学习的节点选择算法,能够从数据中自动学习人类直观思维。DeepBound的核心在于学会优先处理包含最优解的节点,从而提高求解效率。为了捕捉节点表示,DeepBound引入了一种多级特征融合网络。此外,为应对分支定界树中的固有节点不平衡问题,DeepBound采用了一对一训练范式,以增强模型区分不同节点的能力。 在三个NP难题MILP基准上的广泛实验表明,与传统的启发式规则和现有的学习方法相比,DeepBound表现出显著的求解效率提升,并且能够在大幅减少计算时间的同时找到最优可行解。此外,在大型复杂实例中,DeepBound展示了强大的泛化能力。对其所学特征的分析揭示了该方法能够自动发现更为灵活、鲁棒性的特征选择机制,这可能有效改进并潜在地替代人工设计的启发式规则。
https://arxiv.org/abs/2601.16056
Accurate prediction of crop above-ground biomass (AGB) under water stress is critical for monitoring crop productivity, guiding irrigation, and supporting climate-resilient agriculture. Data-driven models scale well but often lack interpretability and degrade under distribution shift, whereas process-based crop models (e.g. DSSAT, APSIM, LINTUL5) require extensive calibration and are difficult to deploy over large spatial domains. To address these limitations, we propose AgriPINN, a process-informed neural network that integrates a biophysical crop-growth differential equation as a differentiable constraint within a deep learning backbone. This design encourages physiologically consistent biomass dynamics under water-stress conditions while preserving model scalability for spatially distributed AGB prediction. AgriPINN recovers latent physiological variables, including leaf area index (LAI), absorbed photosynthetically active radiation (PAR), radiation use efficiency (RUE), and water-stress factors, without requiring direct supervision. We pretrain AgriPINN on 60 years of historical data across 397 regions in Germany and fine-tune it on three years of field experiments under controlled water treatments. Results show that AgriPINN consistently outperforms state-of-the-art deep-learning baselines (ConvLSTM-ViT, SLTF, CNN-Transformer) and the process-based LINTUL5 model in terms of accuracy (RMSE reductions up to $43\%$) and computational efficiency. By combining the scalability of deep learning with the biophysical rigor of process-based modeling, AgriPINN provides a robust and interpretable framework for spatio-temporal AGB prediction, offering practical value for planning of irrigation infrastructure, yield forecasting, and climate-adaptation planning.
在水资源紧张的情况下,准确预测作物地上生物量(AGB)对于监测作物生产力、指导灌溉和支撑气候适应性农业至关重要。数据驱动模型虽然易于扩展,但通常缺乏可解释性,并且在数据分布变化时性能会下降;而基于过程的作物模型(如DSSAT、APSIM、LINTUL5等),尽管需要大量的校准工作,但在大范围空间领域部署起来却比较困难。为了解决这些局限性,我们提出了AgriPINN,这是一种受生理学启发的神经网络,它将生物物理作物生长微分方程作为可微约束融入深度学习架构中。这种设计在水资源压力条件下鼓励生物学上一致的生物量动态变化,并保持模型的空间扩展能力以预测分布式的AGB。 通过使用历史数据(长达60年)和德国397个地区的现场实验数据,我们预先训练了AgriPINN并对其进行微调,这些实验数据涵盖了为期三年的各种控制水处理条件。结果显示,与目前最先进的深度学习基线模型(包括ConvLSTM-ViT、SLTF、CNN-Transformer)以及基于过程的LINTUL5模型相比,AgriPINN在精度上显著优于它们(RMSE降低高达43%),并且计算效率更高。 通过将深度学习的扩展性与基于过程建模的生物物理严谨性相结合,AgriPINN为时空AGB预测提供了一个稳健且可解释的框架。这不仅对灌溉基础设施规划、产量预测和气候适应性规划具有实用价值,而且在应对气候变化挑战方面也展现出巨大潜力。
https://arxiv.org/abs/2601.16045
Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at this https URL.
大型语言模型(LLM)可以在化学合成规划中发挥作用,但标准的提示方法往往会产生幻觉或过时的建议。我们通过将反应路径检索视为一个自然语言到图查询(Text2Cypher)生成问题来研究LLM与反应知识图之间的交互,并定义了一步和多步检索任务。我们将零样本提示与静态、随机以及基于嵌入的示例选择的一次性变体进行比较,同时评估了以检查表为驱动的验证/校正循环。为了评估我们的框架,我们考虑查询的有效性和检索准确性。我们发现使用对齐示例的一次性提示始终表现最佳。在零样本设置中,检查表式的自我修正循环主要提高了可执行性,并且一旦有良好的示例如何存在时,额外的检索增益就非常有限。 为了促进基于知识图的LLM在合成规划方面的进一步研究工作,我们提供了一个可重复的Text2Cypher评估环境。代码可在以下链接获取:[此URL](请将此占位符替换为实际提供的URL)。
https://arxiv.org/abs/2601.16038