Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
视频扩散模型已经革新了生成式视频合成技术,但它们存在不精确、速度慢以及在生成过程中不够透明的问题——用户会因此长时间处于不知状态。为此,我们提出了DiffusionBrowser,这是一种与具体模型无关的轻量级解码框架,它允许用户在去噪过程中的任意时刻(时间步或变压器块)互动式地生成预览。我们的模型能够以超过4倍于实时的速度生成包括RGB和场景固有属性在内的多模态预览表示,这些预览可以提供与最终视频一致的外观和运动效果(对于四秒长的视频,在不到一秒的时间内完成)。通过训练好的解码器,我们展示了在中间噪声步骤中可以通过随机性再注入和模式引导来交互式地指导生成过程的新控制能力。此外,我们系统性地使用学习到的解码器对模型进行探查,揭示了场景、对象和其他细节是如何在原本是黑箱操作的去噪过程中构建和组合起来的。
https://arxiv.org/abs/2512.13690
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL
最近在图像到三维(image-to-3D)领域的进展为设计、增强现实/虚拟现实(AR/VR)和机器人技术开辟了巨大的可能性。然而,要在实际应用中使用人工智能生成的三维资产,一个关键的要求就是能够轻松编辑它们。我们提出了一种前馈方法Steer3D,它可以向图像到三维模型添加文本控制能力,从而支持利用语言对生成的三维资源进行编辑。我们的方法受到了ControlNet的启发,并将其适应于图像到三维生成,使得可以直接在一次正向传递中实现文本引导。 为了构建一个可扩展的数据引擎以自动生成数据,我们开发了一个两阶段的训练配方,该配方基于流匹配训练和直接偏好优化(DPO)。与竞争的方法相比,Steer3D更忠实地遵循语言指令,并且能够更好地保持与原始三维资产的一致性,同时速度快2.4到28.5倍。 此外,Steer3D展示了将新的模式(文本)添加到预训练的图像到三维生成模型中以引导生成过程是可能的,这只需要10万数据就可以实现。项目网站: [链接](this https URL)
https://arxiv.org/abs/2512.13678
As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
随着在线学习领域的不断发展,个性化教学的需求变得越来越明显。尽管教育资源日益丰富,但教师在选择既能与预期的学习成果相符合又能满足多样化学生需求的材料时面临着挑战。大型语言模型(LLMs)因其有潜力生成更支持个性化的学习资源而引起了越来越多的兴趣,但是验证这些资源是否覆盖了预期的教学目标仍然需要人工审核,这既昂贵又限制了可扩展性。我们提出了一种框架,以实现对教育材料与预期学习成果之间一致性的评估的低成本自动化。 使用由人类创建的材料作为基准,我们测试了基于大型语言模型生成的文字嵌入(text-embedding)模型,并发现最准确的模型(Voyage)在检测一致性方面达到了79%的准确性。接着,我们将最佳模型应用于LLM生成的学习资源中,并通过专家评估确认该模型能够可靠地评估与预期成果的一致性,其准确率为83%。最后,在一个包括360名学习者的三组实验中,我们发现更高的一致性得分与更好的学习表现呈正相关关系,卡方检验结果为chi-squared(2, N = 360) = 15.39,p < 0.001。 这些研究结果表明,基于嵌入的一致性评分可以通过确认教育材料与学习成果之间的一致性来促进可扩展的个性化教学。这使教师能够专注于根据多样化学生需求调整内容,从而进一步提升教学质量。
https://arxiv.org/abs/2512.13658
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
大型语言模型(LLMs)已被证明在问答之外的分类任务中会以多种方式响应,有时这些响应被称为“幻觉”,因为输出结果不符合预期。对于LLM中的记忆策略研究正在深入进行,目的是理解这些模型如何做出反应。我们针对美国最高法院(SCOTUS)裁决开展了一项基于分类任务的深度分析。由于句子长度长、法律术语复杂、结构非标准化以及专业词汇的存在,SCOTUS语料库成为了研究LLM记忆准确性的一个理想分类任务。 实验采用了最新的参数高效微调和检索式方法,例如参数高效的微调、自动建模等技术,在两个基于传统类别划分的SCOTUS分类任务上进行测试:一个是包含15个主题标签的任务,另一个则包含了279个。我们展示了带有记忆功能的提示驱动模型(如DeepSeek)在两项任务中的表现均优于先前的BERT基线模型,得分大约高出2分。 通过这种方法的研究表明,具有记忆机制和基于提示调整能力的大型语言模型能够更好地完成复杂法律文本分类任务,并且相较于传统的仅依赖微调的方法,这些新方法可以有效提升模型的表现。
https://arxiv.org/abs/2512.13654
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
灵巧操作具有挑战性,因为它要求理解微妙的手部动作如何通过与物体接触来影响环境。我们介绍了DexWM(灵巧操作世界模型),这是一种可以根据过去的状态和灵巧的操作预测下一个潜在环境状态的模型。为了克服灵巧操作数据集稀缺的问题,DexWM在超过900小时的人类和非灵巧机器人视频上进行了训练。 为了实现精细程度更高的灵巧性,我们发现仅预测视觉特征是不够的;因此,我们引入了一个辅助的手部一致性损失函数,以确保手部配置准确。与以前基于文本、导航及全身动作条件下的世界模型相比,DexWM在未来的状态预测方面表现出更佳性能。 当部署于配备Allegro机械手的Franka Panda机械臂上时,DexWM还能展示出强大的零样本泛化能力,适用于未见过的操作技巧。在抓取、放置和到达任务中,它比Diffusion Policy平均高出50%以上的表现。
https://arxiv.org/abs/2512.13644
The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.
通过稳健性评估来验证和确认人工智能(AI)模型的可靠性,对于确保智能系统在面对现实挑战时能够可靠运行至关重要。这些挑战包括图像失真,如噪声、模糊和天气变化等。尽管芒果(Mangifera indica L.)在全球范围内的重要性不言而喻,但对于其叶片疾病的诊断模型稳健性的研究却相对缺乏。 本文提出了一种评估卷积神经网络(CNNs)在不利条件下的方法。我们对MangoLeafDB数据集进行了适应性修改,生成了包含19种人工失真类型和五个严重程度级别的MangoLeafDB-C数据集。我们对比了五种架构:ResNet-50、ResNet-101、VGG-16、Xception以及LCNN(后者是专门为芒果叶片诊断设计的轻量级架构)。评估指标包括F1得分、失真误差(CE)和相对平均失真误差(relative mCE)。 实验结果表明,LCNN在现实场景中可能出现的失真条件下(如定焦模糊、运动模糊等),表现优于复杂的模型,并且其mCE也是最低的。尽管现代架构(例如ResNet-101)在理想条件下表现出高精度,但在数据受损的情况下性能下降明显。这些发现表明,在边缘设备的实际应用中,轻量级和专门设计的模型可能更为合适,因为在这些场景下,稳健性和效率至关重要。 该研究强调了在农业领域智能系统开发过程中纳入鲁棒性评估的重要性,特别是在技术条件有限的地区。
https://arxiv.org/abs/2512.13641
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
使用强化学习(RL)构建通用推理模型面临着显著的跨领域异质性挑战,包括推理时响应长度的巨大变化和验证延迟。这种变异性复杂化了RL基础设施,减慢了训练速度,并且使训练课程(例如,响应长度扩展)和超参数选择变得困难。 在本工作中,我们提出了级联领域的强化学习(Cascade RL),以开发能够在指令模式和深度思考模式下运行的通用推理模型Nemotron-Cascade。不同于传统的将不同领域中异质提示融合在一起的方法,Cascade RL协调了一系列按领域划分的RL过程,从而降低了工程复杂性,并在广泛的基准测试中实现了最先进的性能。 值得注意的是,在使用RLHF(基于人类反馈的强化学习)作为预步骤进行对齐时,这不仅优化了模型偏好,还极大地增强了模型的推理能力。此外,在后续的领域特定RLVR阶段中,罕见地会降低早期领域所达到的基准性能,并且可能会提升其表现(如图1所示)。 我们训练的14B模型在经过RL训练后,在LiveCodeBench v5/v6/Pro上超越了它的SFT教师DeepSeek-R1-0528,并在2025年国际信息学奥林匹克竞赛(IOI)中取得了银牌成绩。我们将透明地分享我们的训练和数据配方。
https://arxiv.org/abs/2512.13607
Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at this https URL.
近期,深度学习框架在组织病理学领域取得了显著成就,特别是结合了多个实例学习(MIL)和病理基础模型(PFMs)。然而,PFMs 在处理特定类型的癌症或样本时表现出局限性,因为这些类型通常很少用于预训练,并且样本中包含的基于组织的伪影在预训练人群中罕见。例如,在经尿道膀胱肿瘤切除术(TURBT)中,该手术对于诊断肌层浸润型膀胱癌(MIBC)至关重要,但其中含有破碎的组织芯片和电凝固伪影,并且这些样本并未广泛用于公开可用的PFMs中。 为解决这一问题,我们提出了一种简单而有效的领域自适应自监督适配器(DA-SSL),它可以在不微调基础模型的情况下将预训练的PFM特征重新对准到TURBT域。我们在预测TURBT治疗反应方面试点了这种框架,其中组织形态学特征目前被严重忽视,并且识别哪些患者会从新辅助化疗中受益非常具有挑战性。 在我们多中心研究中,在五折交叉验证中DA-SSL达到了0.77±0.04的AUC值,并在外部测试集中通过多数投票获得了0.84的准确性,0.71的灵敏度和0.91的特异性。我们的结果表明,轻量级领域适应结合自监督可以有效提升PFM基础的MIL流程在临床挑战性组织病理学任务中的性能。 代码可在以下链接获取:[提供实际的URL]
https://arxiv.org/abs/2512.13600
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
自回归模型(ARM)由于慢速的顺序推理而受到限制。虽然屏蔽扩散模型(MDM)提供了并行替代方案,但它们存在关键缺点:由于排除了键值(KV)缓存,计算开销高;以及因在难以处理的标记组合空间中学习依赖关系而导致生成不一致。为了克服这些局限性,我们引入了ReFusion,这是一种新型屏蔽扩散模型,通过将并行解码从令牌级别提升到更高的槽级别来实现卓越的表现和效率,其中每个槽是一个固定长度、连续的子序列。这是通过迭代“规划与填充”解码过程实现的:首先,在一个基于扩散的规划步骤中识别出一组弱依赖关系的槽;然后,在自回归填充步骤中并行地对选定的这些槽进行解码。该槽级设计同时解锁了完整的KV缓存重用,并在统一因果框架内将学习复杂性从难以处理的令牌组合空间减少到可管理的槽级别置换空间。在七个多样化的基准测试上的广泛实验表明,ReFusion不仅以34%的表现提升和平均超过18倍的速度加快超过了以前的MDMs,而且还缩小了与强大ARMs之间的性能差距,并且保持了2.33倍的平均速度优势。
https://arxiv.org/abs/2512.13586
In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)}/(\sqrt{n}J\epsilon) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(\epsilon, \delta\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.
在这篇论文中,我们提出了一种用于有向图分散学习的差异隐私随机梯度推进算法(带压缩通信,称为DP-CSGP)。与现有工作不同,该算法旨在在确保严格差分隐私(DP)保证的同时维持高模型效用,并且具有高效的通信效率。对于一般非凸和平滑的目标函数,我们证明所提出的算法达到了紧致的效用界限$\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)}/(\sqrt{n}J\epsilon) \right)$($J$和$d$分别是本地样本的数量和决策变量的维度),每个节点具有$(\epsilon, \delta)$-DP保证,这与使用精确通信的分散式方法的效用界限相匹配。基准任务上的广泛实验表明,在相同的隐私预算下,DP-CSGP在模型准确性方面可以与现有的基于精确通信的分散式方法相比拟,同时显著降低了通信成本。
https://arxiv.org/abs/2512.13583
Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.
神经网络通过叠加实现卓越性能:在激活空间中,它们将多个特征编码为重叠的方向,而不是让每个单独的神经元代表一个特定的特征。这种做法挑战了模型的可解释性,但我们缺乏衡量这一现象的原则性方法。我们提出了一种基于信息论框架的方法来测量神经表示的有效自由度。通过应用香农熵于稀疏自动编码器(sparse autoencoder)的激活状态,我们可以计算出有效特征的数量,即实现无干扰编码所需的最少神经元数量。等价地讲,这种方法衡量了网络如何通过叠加模拟“虚拟”神经元。 当网络编码的有效特征多于实际神经元时,它们必须接受干扰作为压缩的代价。我们的度量标准在玩具模型中与真实情况高度相关,在算法任务中可以检测最小的叠加,并揭示由于dropout而出现系统性减少的情况。层级模式反映了Pythia-70M内在维度的研究结果。此度量还能捕捉发展动力学,包括grokking期间特征急剧整合的现象。 令人惊讶的是,对抗训练可以在提高鲁棒性的同时增加有效特征数量,这与叠加导致脆弱性的假设相矛盾。相反,这种效应取决于任务复杂性和网络容量:在简单的任务和充足容量的情况下,允许特征扩展(充裕模式);而在复杂的任务或受限容量下,则会导致减少(稀缺模式)。通过将叠加定义为有损压缩,这项工作使我们能够在计算约束下原则性地测量神经网络如何组织信息,并且连接了叠加与对抗鲁棒性的关系。
https://arxiv.org/abs/2512.13568
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
记忆已成为基于基础模型的代理的核心能力,并且将继续保持这一地位。随着关于代理记忆的研究迅速扩展并吸引前所未有的关注,该领域也变得越来越碎片化。现有属于代理记忆范畴的工作在动机、实现和评估协议方面往往存在显著差异,而松散定义的记忆术语进一步模糊了概念的清晰度。传统的分类方法,如长/短期记忆,证明不足以捕捉当代代理记忆系统的多样性。 本文旨在提供当前代理记忆研究的最新全景图。我们首先明确界定代理记忆的范围,并将其与诸如大型语言模型(LLM)记忆、检索增强生成(RAG)、上下文工程等相关概念区分开来。然后,我们通过形式、功能和动态性这三大统一视角审视代理记忆。 从形式的角度来看,我们识别出三种主导型的代理记忆实现方式:令牌级、参数化和潜在记忆。从功能角度来看,我们提出了一种更细粒度的分类法,区分事实记忆、体验记忆和工作记忆。从动态性的角度来看,我们分析了如何随着时间推移形成、演变和检索记忆。 为了支持实际开发,我们编制了一份全面的记忆基准测试和开源框架汇总表。超越整合之外,我们还提出了对未来研究前沿的前瞻性视角,包括记忆自动化、强化学习集成、多模态记忆、多代理记忆以及可信性问题。我们希望此次调查不仅作为现有工作的参考,还可以作为重新思考未来智能设计中记忆这一首要原始概念的概念基础。
https://arxiv.org/abs/2512.13564
Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor's veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer-based encoders. In this work, we propose a stance-aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross-platfrom generalization.
在社交媒体上验证谣言对于遏制虚假信息的传播至关重要。对话回复的态度常常提供了一些重要的线索来判断谣言的真实性。然而,现有的模型难以同时捕捉语义内容、态度信息和对话结构,特别是在基于变压器编码器的序列长度限制下更是如此。在这项工作中,我们提出了一种具有立场感知的结构性建模方法,该方法能够对每个讨论中的帖子进行编码并加入其立场信号,并通过立场类别聚合回复嵌入,从而生成整个对话线程的一种可扩展且语义丰富的表示形式。为了增强结构意识,我们引入了立场分布和层级深度作为协变量,以捕捉立场不平衡以及回复深度的影响。 在基准数据集上的广泛实验表明,我们的方法在预测谣言真实性方面显著优于先前的方法,并展示了模型在早期检测和跨平台泛化方面的灵活性。
https://arxiv.org/abs/2512.13559
Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode mental health with state-of-the-art biomarker precision. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, identifies individuals living with mental illness, and achieves near-ceiling accuracy when predicting group-level mental health. By extracting non-verbal signatures of psychological function that have so far remained untapped, MAILA represents a key step toward foundation models for mental health. The ability to decode mental states at zero marginal cost creates new opportunities in neuroscience, medicine, and public health, while raising urgent questions about privacy, agency, and autonomy online.
可扩展的心理疾病评估仍然是全球范围内实现心理健康服务普及和公平的关键障碍,而心理疾病是导致残疾的主要原因。在这里,我们展示了人机交互能够以最先进的生物标志物精度编码心理健康状态。我们引入了MAILA(用于从数字活动推断潜在精神状态的机器学习框架)。通过使用20,000个来自9,000名在线参与者的光标和触摸屏记录数据集中的130万份自我报告的心理健康数据,我们训练了MAILA。该数据集中包括2,000名纵向评估人员、1,500名被诊断为抑郁症患者以及500名患有强迫症的个体的数据。 MAILA能够追踪三种正交维度上的动态心理状态,识别有心理健康问题的人,并且在预测群体层面的心理健康时达到了接近天花板的准确性。通过提取迄今为止尚未开发的心理功能的非言语特征,MAILA代表了构建面向精神健康的基石模型的关键一步。以零边际成本解码心理状态的能力为神经科学、医学和公共卫生领域创造了新的机会,同时也引发了关于隐私、自主权以及在线环境中的个人权利等紧迫问题。
https://arxiv.org/abs/2511.20179
Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at this https URL.
具备推理能力的大规模语言模型在多个领域展示了出色的表现。在临床应用中,透明且步骤明确的推理过程为医生提供了强有力的支持决策证据。虽然强化学习已经在医学场景中有效地提升了推理性能,但这些推理过程的临床可靠性仍然受限,因为它们的准确性和有效性往往在训练过程中被忽视。为了填补这一空白,我们提出了MedCEG框架,该框架通过关键证据图(Critical Evidence Graph, CEG)明确监督医疗语言模型中的推理过程,从而增强其医学上有效的推理路径。为此,我们收集了一组具有挑战性的临床案例,并为每个样本算法构建一个CEG来表示高质量且可验证的推理路径。 为了指导推理过程,我们引入了一个临床推理程序奖励机制,该机制评估节点覆盖率、结构正确性和链条完整性,从而对推理质量进行全面评价。实验结果表明,MedCEG在性能上超越了现有方法,并产生了具有临床有效性的推理链,这是可靠医疗AI推理的一个重要进步。 代码和模型可在以下链接获取:[请在这里插入实际的URL]
https://arxiv.org/abs/2512.13510
In recent years, hierarchical case-based-reasoning models of precedential constraint have been proposed. In various papers, Trevor Bench-Capon criticised these models on the grounds that they would give incorrect outcomes in some cases. In particular, the models would not account for the possibility that intermediate factors are established with different strengths by different base-level factors. In this paper we respond to these criticisms for van Woerkom's result-based hierarchical models. We argue that in some examples Bench-Capon seems to interpret intermediate factors as dimensions, and that applying van Woerkom's dimension-based version of the hierarchical result model to these examples avoids Bench-Capon's criticisms.
近年来,有人提出了层级案例推理模型来解释先例约束。特雷弗·本奇-卡彭在多篇论文中批评了这些模型,认为它们会在某些情况下得出错误的结果。特别是,他认为这些模型没有考虑到中间因素可能因不同的基础层面的因素而具有不同的强度这一可能性。在这篇文章中,我们回应了本奇-卡彭对范·沃克姆结果导向的层级模型的批评。我们认为,在一些例子中,本奇-卡彭似乎将中间因素视为维度,并且应用范·沃克姆基于维度版本的结果层级模型可以避免这些批评。
https://arxiv.org/abs/2512.13505
Machine learning based intrusion detection systems are increasingly targeted by black box adversarial attacks, where attackers craft evasive inputs using indirect feedback such as binary outputs or behavioral signals like response time and resource usage. While several defenses have been proposed, including input transformation, adversarial training, and surrogate detection, they often fall short in practice. Most are tailored to specific attack types, require internal model access, or rely on static mechanisms that fail to generalize across evolving attack strategies. Furthermore, defenses such as input transformation can degrade intrusion detection system performance, making them unsuitable for real time deployment. To address these limitations, we propose Adaptive Feature Poisoning, a lightweight and proactive defense mechanism designed specifically for realistic black box scenarios. Adaptive Feature Poisoning assumes that probing can occur silently and continuously, and introduces dynamic and context aware perturbations to selected traffic features, corrupting the attacker feedback loop without impacting detection capabilities. The method leverages traffic profiling, change point detection, and adaptive scaling to selectively perturb features that an attacker is likely exploiting, based on observed deviations. We evaluate Adaptive Feature Poisoning against multiple realistic adversarial attack strategies, including silent probing, transferability based attacks, and decision boundary based attacks. The results demonstrate its ability to confuse attackers, degrade attack effectiveness, and preserve detection performance. By offering a generalizable, attack agnostic, and undetectable defense, Adaptive Feature Poisoning represents a significant step toward practical and robust adversarial resilience in machine learning based intrusion detection systems.
基于机器学习的入侵检测系统越来越多地受到黑盒对抗性攻击的影响,这些攻击中,攻击者利用诸如二进制输出或响应时间和资源使用率等行为信号来制作逃避输入。尽管已经提出了多种防御措施,包括输入转换、对抗训练和替代检测方法,但它们在实践中往往效果不佳。大多数防御手段针对特定类型的攻击,需要访问内部模型或者依赖静态机制,这些静态机制无法应对不断演变的攻击策略。此外,像输入转换这样的防御方法可能会降低入侵检测系统的性能,使其不适合实时部署。 为了解决这些问题,我们提出了一种名为“自适应特征投毒”的轻量级且主动型防御机制,特别适用于现实中的黑盒场景。自适应特征投毒假设探针可以悄无声息地持续进行,并通过动态和上下文感知的方式对选定的流量特征引入扰动,从而破坏攻击者的反馈循环而不影响检测能力。该方法利用流量分析、变化点检测以及自适应缩放来选择性地扰乱攻击者可能正在利用的特征,基于观察到的偏差。我们评估了自适应特征投毒在多种现实中的对抗策略下的表现,包括无声探测、迁移攻击和决策边界攻击等。 结果表明,自适应特征投毒能够混淆攻击者,降低攻击效果,并保持检测性能不受影响。通过提供一种泛化性好、无攻击特定性且难以被发现的防御机制,自适应特征投毒在基于机器学习的入侵检测系统中对抗实战中的对抗性韧性方面迈出了重要一步。
https://arxiv.org/abs/2512.13501
Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.
大型语言模型(LLM)在各种任务中表现出卓越的性能,然而,它们庞大的参数规模给计算和内存资源有限的边缘设备部署带来了巨大挑战。低秩压缩作为一种有前景的方法被提出以解决这一问题,通过减少计算和内存成本使大规模语言模型更适合于资源受限的环境。但是,简单的低秩压缩方法需要大幅度降低保留的秩才能实现有意义的记忆和计算节省。对于一个低秩模型来说,为了获得效率提升,其排名通常需要减半以上。然而,这种激进的截断往往会导致性能显著下降。 为了解决这一权衡问题,我们提出了SkipCat,这是一种新的低秩压缩框架,在相同的压缩率下能够使用更高的秩。首先,我们引入了一种层内共享的低秩投影方法,其中多个具有相同输入的矩阵共用一个投影器以减少冗余并提高压缩效率。其次,我们提出了一种块跳过技术,该技术省略了选定子块在低秩分解中的计算和内存传输操作。这两种技术共同使我们的压缩模型能够在相同的压缩预算下保留更多的有效排名。 实验结果显示,在没有额外微调的情况下,与先前的低秩压缩方法相比,我们的方法在零样本任务中实现了7%的准确率提升,并且在相同压缩率下表现更优。这些结果突显了我们提出的最大化的秩压缩策略在资源紧张约束下的模型性能保持方面具有高度有效性。
https://arxiv.org/abs/2512.13494
Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.
嫉妒是一种常见的行为,会塑造竞争性,并在团队环境中改变结果。随着大型语言模型(LLM)越来越多地代表人类参与协作和竞争的工作流程中发挥作用,评估它们是否以及在何种条件下表现出类似嫉妒的偏好变得越来越迫切。在这篇论文中,我们测试了LLM之间是否存在类似嫉妒的行为模式。我们考虑了两个场景:(1) 一个点分配游戏,测试模型是否试图超越其同行。(2) 观察工作场所环境中当认可不公平时的行为表现。 我们的研究结果揭示了一些LLM在某些情况下表现出一致的类似嫉妒行为模式,且这种行为在不同模型和情境下差异很大。例如,GPT-5-mini 和 Claude-3.7-Sonnet 显示出明显的倾向,试图拉低同行的表现以实现成果均等化,而 Mistral-Small-3.2-24B 则更专注于最大化自身的个体收益。 这些结果强调了在基于LLM的多代理系统设计中考虑竞争倾向作为安全和设计因素的重要性。
https://arxiv.org/abs/2512.13481
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
过早的语义崩溃——即在有足够的上下文之前被迫提前承诺单一含义——仍然是当前语言模型的核心架构限制之一。由softmax驱动的竞争和贪婪解码导致模型在没有足够背景信息的情况下舍弃有效的解释,从而引发脆弱的推理和上下文理解失败。我们提出了一种通用计算框架:非解析推理(NRR),该框架在推断过程中保留语义模糊性,并仅在明确需要时进行解析。NRR整合了三个组成部分: 1. 多向量嵌入,为每个标记维持多个可行的解释。 2. 非崩溃注意力机制,阻止各层间的赢家通吃动态过程。 3. 上下文身份追踪(CIT),用于根据上下文环境赋予反复出现的实体特定的身份(例如区分“心脏病专家史密斯医生”和“研究员史密斯博士”)。 这些机制通过一个外部解析操作符$\rho$统一起来,该操作符使语义承诺变得显式、可控并依赖于任务需求。与标准架构不同的是,NRR将表示与解析分离,使得单一模型能够在创造性推理、事实性推理和保留模糊性的推理之间自由切换而无需重新训练。合成评估表明了NRR在保持模糊性和追踪上下文方面的有效性:增强CIT的模型在外来身份转变任务上达到90.9%的准确率,相比之下基于变换器的基础模型仅为9.1%。 NRR为过早崩溃提供了一个有原则的替代方案,将模棱两可视为一种显式的表示状态而不是失败模式。问题不再是AI是否应该解析模糊性,而是何时、如何以及在谁的控制下进行这种解析。
https://arxiv.org/abs/2512.13478