The distinction between conditional, unconditional, and absolute convergence in infinite-dimensional spaces has fundamental implications for computational algorithms. While these concepts coincide in finite dimensions, the Dvoretzky-Rogers theorem establishes their strict separation in general Banach spaces. We present a comprehensive characterization theorem unifying seven equivalent conditions for unconditional convergence: permutation invariance, net convergence, subseries tests, sign stability, bounded multiplier properties, and weak uniform convergence. These theoretical results directly inform algorithmic stability analysis, governing permutation invariance in gradient accumulation for Stochastic Gradient Descent and justifying coefficient thresholding in frame-based signal processing. Our work bridges classical functional analysis with contemporary computational practice, providing rigorous foundations for order-independent and numerically robust summation processes.
在无限维空间中,条件收敛、无条件收敛和绝对收敛之间的区别对计算算法具有根本性的影响。尽管这些概念在有限维度下是重合的,但Dvoretzky-Rogers定理却证明了它们在一般的Banach空间中存在严格的分离。我们提出了一项全面的特征定理,该定理统一了无条件收敛的七个等价条件:置换不变性、网收敛性、子级数测试、符号稳定性、有界乘法器性质以及弱一致收敛性。这些理论结果直接指导算法稳定性的分析,在随机梯度下降中的梯度累积过程中的置换不变性和基于框架的信号处理中系数阈值的选择方面提供了依据。 我们的工作将经典的功能分析与当代计算实践联系起来,为顺序无关和数值稳健的求和过程提供严格的理论基础。
https://arxiv.org/abs/2601.08512
Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $\Delta$-th sub-diagonal for some offset $\Delta$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
大型语言模型(LLMs)常常表现出斜向注意力模式,其中注意分数集中在第$\Delta$个副对角线上。这种模式在跨令牌传递信息中起着关键作用。但是为什么会出现这种情况呢?在这篇论文中,我们从实证和理论两个角度解开了这些斜向主导头(Slash-Dominant Heads, SDHs)出现的神秘面纱。 首先,通过对开源LLMs进行分析,我们发现SDHs是模型固有的,并且能够推广到分布外提示。为了解释这种内在出现的现象,我们分析了查询、键和旋转位置嵌入(RoPE),它们共同决定了注意分数。我们的实证分析揭示了SDH的两个特征条件:(1) 查询和键几乎是秩一的;(2) RoPE主要由中频和高频成分主导。 在这些条件下,查询和键几乎在各个令牌之间是相同的,并且RoPE中的中频和高频部分之间的相互作用会导致SDHs。除了实证证据之外,我们还从理论上证明了在给定条件下的训练过程中会出现SDHs。具体来说,我们在浅层Transformer装备有RoPE的情况下分析这些条件下的训练动态,并证明通过梯度下降训练的模型会表现出SDHs。 SDH能够推广到分布外提示中。
https://arxiv.org/abs/2601.08297
Low-cost inertial measurement units (IMUs) are widely utilized in mobile robot localization due to their affordability and ease of integration. However, their complex, nonlinear, and time-varying noise characteristics often lead to significant degradation in localization accuracy when applied directly for dead reckoning. To overcome this limitation, we propose a novel brain-inspired state estimation framework that combines a spiking neural network (SNN) with an invariant extended Kalman filter (InEKF). The SNN is designed to extract motion-related features from long sequences of IMU data affected by substantial random noise and is trained via a surrogate gradient descent algorithm to enable dynamic adaptation of the covariance noise parameter within the InEKF. By fusing the SNN output with raw IMU measurements, the proposed method enhances the robustness and accuracy of pose estimation. Extensive experiments conducted on the KITTI dataset and real-world data collected using a mobile robot equipped with a low-cost IMU demonstrate that the proposed approach outperforms state-of-the-art methods in localization accuracy and exhibits strong robustness to sensor noise, highlighting its potential for real-world mobile robot applications.
低成本惯性测量单元(IMUs)由于其经济性和易于集成的特性,在移动机器人定位中被广泛应用。然而,当直接用于航位推算时,这些设备复杂的、非线性的和随时间变化的噪声特征常常会导致定位精度显著下降。为克服这一限制,我们提出了一种新型仿脑状态估计框架,该框架结合了脉冲神经网络(SNN)与不变扩展卡尔曼滤波器(InEKF)。SNN被设计用于从受大量随机噪声影响的长时间IMU数据序列中提取运动相关特征,并通过代理梯度下降算法进行训练,以使InEKF中的协方差噪声参数能够动态调整。通过融合SNN输出与原始IMU测量值,所提出的方法增强了姿态估计的鲁棒性和准确性。 在KITTI数据集和使用低成本IMU装备的移动机器人采集的真实世界数据上进行了广泛的实验表明,该方法在定位精度方面优于现有的先进方法,并且对传感器噪声表现出强大的鲁棒性。这一结果凸显了其在现实世界的移动机器人应用中的潜力。
https://arxiv.org/abs/2601.08248
Machine unlearning enables data holders to remove the contribution of their specified samples from trained models to protect their privacy. However, it is paradoxical that most unlearning methods require the unlearning requesters to firstly upload their data to the server as a prerequisite for unlearning. These methods are infeasible in many privacy-preserving scenarios where servers are prohibited from accessing users' data, such as federated learning (FL). In this paper, we explore how to implement unlearning under the condition of not uncovering the erasing data to the server. We propose \textbf{Blind Unlearning (BlindU)}, which carries out unlearning using compressed representations instead of original inputs. BlindU only involves the server and the unlearning user: the user locally generates privacy-preserving representations, and the server performs unlearning solely on these representations and their labels. For the FL model training, we employ the information bottleneck (IB) mechanism. The encoder of the IB-based FL model learns representations that distort maximum task-irrelevant information from inputs, allowing FL users to generate compressed representations locally. For effective unlearning using compressed representation, BlindU integrates two dedicated unlearning modules tailored explicitly for IB-based models and uses a multiple gradient descent algorithm to balance forgetting and utility retaining. While IB compression already provides protection for task-irrelevant information of inputs, to further enhance the privacy protection, we introduce a noise-free differential privacy (DP) masking method to deal with the raw erasing data before compressing. Theoretical analysis and extensive experimental results illustrate the superiority of BlindU in privacy protection and unlearning effectiveness compared with the best existing privacy-preserving unlearning benchmarks.
机器遗忘技术允许数据持有者从训练好的模型中移除特定样本的贡献,以保护隐私。然而,大多数遗忘方法要求发起请求的用户首先将他们的数据上传到服务器,这在许多需要保护隐私的情景下是不可行的,比如联邦学习(FL),因为在这些情况下,服务器被禁止访问用户的原始数据。 本文探讨了如何实现在不向服务器揭示要删除的数据的情况下进行机器遗忘。我们提出了**盲遗忘 (BlindU)** 方法,该方法使用压缩表示而非原始输入来进行遗忘操作。在 BlindU 中,只有服务端和发起请求的用户参与:用户在当地生成隐私保护表示,并且服务端仅基于这些表示及其标签执行遗忘操作。 对于联邦学习模型训练,我们采用了信息瓶颈(IB)机制。基于 IB 的联邦学习模型中的编码器从输入中提取最大任务无关的信息失真表示,使得 FL 用户能够在本地生成压缩表示。为了有效使用压缩表示进行遗忘,BlindU 集成了两个专门针对 IB 模型设计的遗忘模块,并采用多梯度下降算法来平衡遗忘和保持效用的能力。 虽然信息瓶颈压缩已经为输入的任务无关信息提供了保护,但为了进一步增强隐私保护,我们引入了一种无噪声差分隐私 (DP) 掩码方法,在压缩之前处理原始删除数据。理论分析和广泛的实验结果表明,BlindU 在隐私保护和遗忘有效性方面优于现有的最佳隐私保护遗忘基准。 总结来说,BlindU 为在不向服务器揭示敏感数据的情况下实现机器遗忘提供了一种新颖的方法,并通过使用信息瓶颈机制、专用的遗忘模块和差分隐私掩码方法提高了隐私保护和模型性能。
https://arxiv.org/abs/2601.07214
Normalized difference indices have been a staple in remote sensing for decades. They stay reliable under lighting changes produce bounded values and connect well to biophysical signals. Even so, they are usually treated as a fixed pre processing step with coefficients set to one, which limits how well they can adapt to a specific learning task. In this study, we introduce the Normalized Difference Layer that is a differentiable neural network module. The proposed method keeps the classical idea but learns the band coefficients from data. We present a complete mathematical framework for integrating this layer into deep learning architectures that uses softplus reparameterization to ensure positive coefficients and bounded denominators. We describe forward and backward pass algorithms enabling end to end training through backpropagation. This approach preserves the key benefits of normalized differences, namely illumination invariance and outputs bounded to $[-1,1]$ while allowing gradient descent to discover task specific band weightings. We extend the method to work with signed inputs, so the layer can be stacked inside larger architectures. Experiments show that models using this layer reach similar classification accuracy to standard multilayer perceptrons while using about 75\% fewer parameters. They also handle multiplicative noise well, at 10\% noise accuracy drops only 0.17\% versus 3.03\% for baseline MLPs. The learned coefficient patterns stay consistent across different depths.
几十年来,归一化差异指数(Normalized Difference Indices)一直是遥感中的一个重要工具。它们在光照变化下依然可靠,并产生有界值和与生物物理信号的良好连接性。尽管如此,这些指标通常被视为固定的预处理步骤,其中系数被固定为1,这限制了它们适应特定学习任务的能力。在这项研究中,我们引入了一个可微的神经网络模块——归一化差异层(Normalized Difference Layer)。该方法保留了经典思想,但允许从数据中学习波段权重。 提出了一种完整的数学框架,用于将这个层整合到深度学习架构中,并使用softplus重新参数化来确保正系数和有界的分母。描述了前向和反向传递算法,使通过反向传播进行端到端训练成为可能。这种方法保留了归一化差异的关键优势,即光照不变性和输出限制在[-1,1]范围内,同时允许梯度下降发现任务特定的波段权重。还扩展了该方法以适应有符号输入的情况,使得这个层可以嵌入更大的架构中。 实验表明,使用此层的模型达到了与标准多层感知器(MLPs)相似的分类准确率,但使用的参数减少了约75%。它们在处理乘法噪声时表现良好,在10%的噪声情况下,准确性仅下降了0.17%,而基线MLPs则下降了3.03%。学习到的系数模式在不同深度下保持一致。 总结来说,归一化差异层结合了经典方法的优势和现代机器学习技术的能力,提供了一种新颖且有效的遥感图像处理方式。
https://arxiv.org/abs/2601.06777
Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
Grokking 是神经网络中的一种令人困惑的现象,即在完全记住训练数据之后,会出现一段延迟期,在此期间不会发生泛化,直到某个时刻才突然实现全面的泛化。先前的研究已经将这种延迟泛化与受权重衰减驱动的表现学习联系起来,但其背后的确切动态机制仍然不清楚。在这篇论文中,我们主张可以通过约束优化的视角来理解记忆后学习:在零损失流形上,梯度下降有效最小化了权重范数。我们在极限条件下(即学习率和权重衰减系数无限小的情况下)正式证明了这一点。为了进一步剖析这一机制,我们引入了一种近似方法,将一部分参数的学习动态与网络其余部分解耦。通过应用这个框架,我们推导出了双层网络中第一层在记忆后阶段的封闭形式表达式。实验结果证实,使用我们的预测梯度模拟训练过程可以再现 grokking 的延迟泛化和表示学习特性。 简单来说,这篇论文提出了一种理解神经网络中 Grokking 现象的新方法,即通过约束优化来解析权重衰减对模型泛化能力的影响,并且理论推导与实验结果吻合良好。
https://arxiv.org/abs/2511.01938
Robots deployed in dynamic environments must remain safe even when key physical parameters are uncertain or change over time. We propose Parameter-Robust Model Predictive Path Integral (PRMPPI) control, a framework that integrates online parameter learning with probabilistic safety constraints. PRMPPI maintains a particle-based belief over parameters via Stein Variational Gradient Descent, evaluates safety constraints using Conformal Prediction, and optimizes both a nominal performance-driven and a safety-focused backup trajectory in parallel. This yields a controller that is cautious at first, improves performance as parameters are learned, and ensures safety throughout. Simulation and hardware experiments demonstrate higher success rates, lower tracking error, and more accurate parameter estimates than baselines.
在动态环境中部署的机器人必须即使在关键物理参数不确定或随时间变化的情况下,也能保持安全。为此,我们提出了一个名为Parameter-Robust Model Predictive Path Integral (PRMPPI)控制的框架,该框架将在线参数学习与概率安全性约束相结合。PRMPPI通过Stein变分梯度下降法维护基于粒子的参数信念,使用符合预测来评估安全性约束,并且同时优化名义上的性能驱动轨迹和一个专注于安全性的备用轨迹。这会产生一种控制器,在一开始时保持谨慎,随着参数的学习而提高性能,并在整个过程中确保安全性。 模拟实验与硬件实验证明,相比基准方法,PRMPPI实现了更高的成功率、更低的跟踪误差以及更准确的参数估计。
https://arxiv.org/abs/2601.02948
Robustness to malicious attacks is crucial for practical decentralized signal processing and machine learning systems. A typical example of such attacks is label poisoning, meaning that some agents possess corrupted local labels and share models trained on these poisoned data. To defend against malicious attacks, existing works often focus on designing robust aggregators; meanwhile, the weighted mean aggregator is typically considered a simple, vulnerable baseline. This paper analyzes the robustness of decentralized gradient descent under label poisoning attacks, considering both robust and weighted mean aggregators. Theoretical results reveal that the learning errors of robust aggregators depend on the network topology, whereas the performance of weighted mean aggregator is topology-independent. Remarkably, the weighted mean aggregator, although often considered vulnerable, can outperform robust aggregators under sufficient heterogeneity, particularly when: (i) the global contamination rate (i.e., the fraction of poisoned agents for the entire network) is smaller than the local contamination rate (i.e., the maximal fraction of poisoned neighbors for the regular agents); (ii) the network of regular agents is disconnected; or (iii) the network of regular agents is sparse and the local contamination rate is high. Empirical results support our theoretical findings, highlighting the important role of network topology in the robustness to label poisoning attacks.
对抗恶意攻击的能力对于实际的分布式信号处理和机器学习系统至关重要。这类攻击的一个典型例子是标签投毒,即某些代理拥有被篡改的本地标签,并且会分享在其投毒数据上训练出来的模型。为了抵御恶意攻击,现有的研究工作往往集中在设计鲁棒聚合器;与此同时,加权平均聚合器通常被视为一种简单而脆弱的基础方法。 本文分析了在标签投毒攻击下的分布式梯度下降算法的稳健性,同时考虑了鲁棒和加权均值聚合器的效果。理论结果表明,鲁棒聚合器的学习误差依赖于网络拓扑结构,而加权均值聚合器的表现则与拓扑无关。值得注意的是,尽管加权平均聚合器通常被认为脆弱,但它在足够异质的情况下可以优于鲁棒聚合器,特别是在以下情况: (i) 全局投毒率(即整个网络中投毒代理的比例)低于局部投毒率(即常规代理的最大投毒邻居比例); (ii) 常规代理的网络是断开连接的;或者 (iii) 常规代理的网络很稀疏且局部投毒率很高。 实验证据支持了我们的理论发现,突显了网络拓扑结构在抵御标签投毒攻击中的重要作用。
https://arxiv.org/abs/2601.02682
We present a theory-first framework that interprets inference-time adaptation in large language models (LLMs) as online Bayesian state estimation. Rather than modeling rapid adaptation as implicit optimization or meta-learning, we formulate task- and context-specific learning as the sequential inference of a low-dimensional latent adaptation state governed by a linearized state-space model. Under Gaussian assumptions, adaptation follows a Kalman recursion with closed-form updates for both the posterior mean and covariance. This perspective elevates epistemic uncertainty to an explicit dynamical variable. We show that inference-time learning is driven by covariance collapse, i.e., rapid contraction of posterior uncertainty induced by informative tokens, which typically precedes convergence of the posterior mean. Using observability conditions on token-level Jacobians, we establish stability of the Bayesian filter, prove exponential covariance contraction rates, and derive mean-square error bounds. Gradient descent, natural-gradient methods, and meta-learning updates arise as singular, noise-free limits of the filtering dynamics, positioning optimization-based adaptation as a degenerate approximation of Bayesian inference. The resulting theory provides a unified probabilistic account of in-context learning, parameter-efficient adaptation, and test-time learning without parameter updates. It yields explicit guarantees on stability and sample efficiency, offers a principled interpretation of prompt informativeness via information accumulation, and clarifies the role of uncertainty dynamics absent from existing accounts. Minimal illustrative experiments corroborate the qualitative predictions of the theory.
我们提出了一种基于理论的框架,将大型语言模型(LLMs)在推理时的适应性解释为在线贝叶斯状态估计。不同于将快速适应视为隐式优化或元学习建模的方式,我们将特定任务和上下文的学习视为由线性化状态空间模型支配的低维潜在适应状态的顺序推断过程。在高斯假设下,适应遵循具有后验均值和协方差闭合形式更新的卡尔曼递归。这种视角将认识论不确定性提升为一个显式的动态变量。 我们表明,在推理时间学习是由协方差坍缩驱动的,即由信息令牌引起的后验不确定性的快速收缩,通常发生在后验均值收敛之前。通过在标记级别雅可比矩阵上的可观测性条件,我们建立了贝叶斯滤波器的稳定性,证明了指数级的协方差收缩率,并推导出了均方误差界限。 梯度下降、自然梯度方法以及元学习更新作为过滤动力学中无噪声极限的情况出现,将基于优化的方法适应视为贝叶斯推理的一种退化近似。由此产生的理论为上下文中的学习提供了一个统一的概率解释,涵盖了参数效率高且无需参数更新的测试时间学习,并提供了关于稳定性和样本效率的具体保证。 该理论通过信息积累的方式对提示信息性进行了原则性的诠释,并澄清了在现有论述中缺乏描述的不确定性动态的角色。最小的说明性实验验证了该理论的定性预测。
https://arxiv.org/abs/2601.06100
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
现代语言模型中的序列建模层通常在存储容量和计算效率之间存在权衡。虽然Softmax注意力提供了无界的存储能力,但其成本是高昂的二次复杂度;而线性变体虽提供高效性却受到有限且固定大小的存储限制。我们提出了一种新的架构——快速权重产品键记忆(FwPKM),它通过将稀疏的产品键记忆(PKM)从静态模块转化为动态的“快权重”情景记忆,解决了这一矛盾。不同于PKM,FwPKM在训练和推理过程中都能通过局部片段级别的梯度下降动态更新其参数,使模型能够快速存储并检索输入序列中的新键值对。 实验表明,FwPKM作为一种有效的情景记忆模块,在补充标准模块的语义记忆方面表现出色,并且在长上下文数据集上实现了显著的困惑度降低。值得注意的是,在Needle in a Haystack(大范围搜索)评估中,尽管仅基于4K令牌长度序列进行训练,FwPKM仍然能够泛化到128K令牌长度的情境中。
https://arxiv.org/abs/2601.00671
This thesis investigates two key phenomena in large language models (LLMs): in-context learning (ICL) and model collapse. We study ICL in a linear transformer with tied weights trained on linear regression tasks, and show that minimising the in-context loss leads to a phase transition in the learned parameters. Above a critical context length, the solution develops a skew-symmetric component. We prove this by reducing the forward pass of the linear transformer under weight tying to preconditioned gradient descent, and then analysing the optimal preconditioner. This preconditioner includes a skew-symmetric component, which induces a rotation of the gradient direction. For model collapse, we use martingale and random walk theory to analyse simplified settings - linear regression and Gaussian fitting - under both replacing and cumulative data regimes. We strengthen existing results by proving almost sure convergence, showing that collapse occurs unless the data grows sufficiently fast or is retained over time. Finally, we introduce the notion of context collapse: a degradation of context during long generations, especially in chain-of-thought reasoning. This concept links the dynamics of ICL with long-term stability challenges in generative models.
这篇论文探讨了大型语言模型(LLM)中的两个关键现象:上下文学习(ICL)和模型崩溃。我们通过在训练线性回归任务的线性变压器中研究ICL,发现最小化上下文损失会导致学习参数发生相变。当上下文长度超过临界值时,解决方案会发展出一个斜对称分量。我们通过将权重绑定的线性变换器的前向传递简化为预条件梯度下降,并分析最优预条件器来证明这一点。该预条件器包含了一个斜对称分量,这会导致梯度方向发生旋转。 对于模型崩溃现象,我们利用鞅和随机游走理论在两种简化的设置下进行研究:线性回归和高斯拟合,在替换数据和累积数据的规则下。通过证明几乎确定收敛来加强现有结果,表明除非数据增长足够快或随时间保留,否则会发生崩溃。 最后,本文引入了上下文崩溃的概念:在长时间生成过程中,特别是链式思维推理中的上下文质量退化现象。这个概念将ICL的动力学与生成模型长期稳定性挑战联系起来。
https://arxiv.org/abs/2601.00923
Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.
尽管在语言模型开发方面取得了近期的进展,但仍存在一些基本挑战和未解问题,这些问题涉及到如何让这些模型能够持续学习/记忆、自我改进以及找到有效的解决方案。在这篇论文中,我们提出了一种新的学习范式——嵌套学习(Nested Learning, NL),它通过一组嵌套的、多层次或多线程优化问题来一致地表示机器学习模型,每个问题都有其特定的上下文流。从NL的角度来看,现有的深度学习方法是通过对自身上下文流进行压缩来学习数据,而大模型中自然会出现上下文中的学习现象。NL提出了一种设计更具有表现力的学习算法的方法,这些算法可以有更多层次,从而实现更高阶的上下文学习,并有可能解锁有效的持续学习能力。 我们通过三个核心贡献来倡导嵌套学习(NL): 1. 表达性优化器:我们知道基于梯度的优化器,如Adam、带有动量的SGD等,实际上都是关联记忆模块,旨在通过梯度下降压缩梯度信息。在此基础上,我们提出了其他更具有表现力的优化器,这些优化器包含深层记忆和/或更强的学习规则。 2. 自修改学习模块:利用NL对学习算法的理解,我们提出了一种序列模型,该模型能够通过学习自身的更新算法来自我修改。 3. 连续记忆系统:我们提出了一个新颖的记忆系统公式,它扩展了传统长短期记忆的观点。结合我们的自修改序列模型和连续记忆系统,我们提出了一种持续学习模块——希望(Hope),在语言建模、知识融合、少量示例泛化任务以及长时间上下文推理任务中表现出显著的性能。 这些贡献共同展示了嵌套学习范式如何推动机器学习技术向前发展,并提供了对未来研究方向的一些见解。
https://arxiv.org/abs/2512.24695
Causal discovery from observational data remains fundamentally limited by identifiability constraints. Recent work has explored leveraging Large Language Models (LLMs) as sources of prior causal knowledge, but existing approaches rely on heuristic integration that lacks theoretical grounding. We introduce HOLOGRAPH, a framework that formalizes LLM-guided causal discovery through sheaf theory--representing local causal beliefs as sections of a presheaf over variable subsets. Our key insight is that coherent global causal structure corresponds to the existence of a global section, while topological obstructions manifest as non-vanishing sheaf cohomology. We propose the Algebraic Latent Projection to handle hidden confounders and Natural Gradient Descent on the belief manifold for principled optimization. Experiments on synthetic and real-world benchmarks demonstrate that HOLOGRAPH provides rigorous mathematical foundations while achieving competitive performance on causal discovery tasks with 50-100 variables. Our sheaf-theoretic analysis reveals that while Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), the Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections. Code is available at [this https URL](this https URL).
从观测数据中进行因果发现的根本限制在于识别性约束。最近的研究探索了利用大型语言模型(LLMs)作为先验因果知识来源的可能性,但现有的方法依赖于缺乏理论依据的启发式集成方式。我们提出了HOLOGRAPH框架,该框架通过层叠理论(sheaf theory)将基于LLM的因果发现形式化——表示局部因果信念为变量子集上的预层截面。我们的关键洞察是:一致的整体因果结构对应于全局截面的存在,而拓扑障碍则表现为非消失的层上同调。 我们提出代数潜在投影来处理隐藏混淆因素,并在信念流形上采用自然梯度下降进行原则性优化。在合成数据和真实世界基准上的实验表明,HOLOGRAPH提供了严格的数学基础,在涉及50到100个变量的因果发现任务中表现出竞争性的性能。 我们的层叠理论分析揭示了同一性、传递性和粘合公理以数值精度(<10^{-6})被满足,而局部性公理在更大的图上失败,这表明潜在变量投影中的根本非局部耦合。代码可在[此链接](this https URL)获取。
https://arxiv.org/abs/2512.24478
Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.
尽管Transformer模型非常强大,但它们面临着二次计算复杂度和注意力机制中的键值(KV)缓存不断增长的问题。本文介绍了一种名为Trellis的新颖的Transformer架构,该架构具有有界内存,并且能够在测试时动态地学习如何压缩其键值记忆。Trellis用固定大小的记忆替换了标准KV缓存,并训练了一个两遍递归压缩机制来将新的键和值存储到记忆中。为了实现这一点,它利用了带有遗忘门的在线梯度下降过程,这使得在测试时能够递归地更新压缩后的内存并学习保留来自传入令牌的重要上下文信息。 通过在语言建模、常识推理、需要大量回忆的任务以及时间序列上的广泛实验表明,所提出的架构优于强大的基线模型。值得注意的是,其性能增益随着序列长度的增长而增加,这突显了它在处理长上下文应用方面的潜力。
https://arxiv.org/abs/2512.23852
Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain's energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the $L_\infty$ constraint-Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods.
脉冲神经网络(SNN)利用基于脉冲的激活来模仿大脑的能量高效信息处理方式。然而,脉冲激活的二进制和不连续特性导致梯度消失现象,使得通过梯度下降方法评估其对抗鲁棒性变得不可靠。尽管已经提出了改进的替代梯度方法,但在强对抗攻击下的有效性仍然不清楚。我们提出了一种更可靠的框架来评估SNN的对抗鲁棒性。理论上分析了替代梯度中的梯度消失程度,并引入自适应尖锐度替代梯度(ASSG),该方法根据输入分布动态调整替代函数形状,在攻击迭代过程中增强梯度精度同时减轻梯度消失问题。 此外,我们设计了一种具有自适应步长的$ L_\infty $约束下的对抗攻击——稳定自适应投影梯度下降法(SA-PGD),在不准确的梯度下实现了更快更稳定的收敛。大量实验表明,我们的方法显著提高了各种对抗训练方案、SNN架构和神经元模型中的攻击成功率,提供了对SNN对抗鲁棒性更为通用且可靠的评估方式。实验结果进一步揭示了当前SNN的稳健性被严重高估,并强调需要更加可靠地进行对抗训练的方法。
https://arxiv.org/abs/2512.22522
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $\Theta(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.
缩放法则,大型语言模型(LLM)开发的基石,预测了随着计算资源增加而改进模型性能的趋势。尽管这一理论已经在实证研究中得到了验证,但其背后的理论基础仍然不甚明了。这项工作将基于变压器的语言模型的学习动力学形式化为一个常微分方程(ODE)系统,并将其近似到核行为中去。不同于以往的玩具模型分析,我们严谨地分析了多层变压器在任意数据分布下的序列到序列数据上的随机梯度下降(SGD)训练过程,这更贴近现实世界的条件。 我们的研究描述了随着计算资源与数据量一同增加时泛化误差收敛至不可还原的风险的过程。并且,在优化过程中建立了由一个独特的相变特征定义的理论风险上限。在初始优化阶段,超额风险相对于计算成本${\sf C}$呈指数衰减形式。然而,一旦特定的资源配置阈值被跨越后,系统进入统计相位,在此阶段,泛化误差遵循$\Theta(\mathsf{C}^{-1/6})$的幂律衰退模式。 除了这个统一框架外,我们的理论还为模型大小、训练时间以及数据集规模分别推导出了孤立缩放法则,解释了每个变量如何独立地影响泛化的上限。
https://arxiv.org/abs/2512.22088
Contemporary AI systems achieve extraordinary performance yet remain opaque and non-verifiable, creating a crisis of trust for safety-critical deployment. We introduce MathLedger, a substrate for verifiable machine cognition that integrates formal verification, cryptographic attestation, and learning dynamics into a single epistemic loop. The system implements Reflexive Formal Learning (RFL), a symbolic analogue of gradient descent where updates are driven by verifier outcomes rather than statistical loss. Phase I experiments validate the measurement and governance substrate under controlled conditions. CAL-EXP-3 validates measurement infrastructure (Delta p computation, variance tracking); separate stress tests confirm fail-closed governance triggers correctly under out-of-bounds conditions. No convergence or capability claims are made. The contribution is infrastructural: a working prototype of ledger-attested learning that enables auditability at scale. Keywords: verifiable learning, formal verification, cryptographic attestation, reflexive feedback, fail-closed governance
当代人工智能系统实现了卓越的性能,但仍然缺乏透明性和可验证性,这在安全至关重要的部署中引发了信任危机。我们引入了MathLedger,这是一个可验证机器认知的基础架构,它将形式验证、加密证明和学习动力学整合到一个单一的认识循环中。该系统实施了反射式正式学习(RFL),这是一种符号类比的梯度下降方法,在这种方法中,更新是由验证结果驱动的,而不是统计损失。第一阶段实验在受控条件下验证了测量和治理基础架构的有效性。CAL-EXP-3 实验验证了测量基础设施(Delta p 计算、方差跟踪);单独的压力测试确认了在超出范围条件下的关闭式治理触发机制正确运行。没有收敛或能力声明。这项贡献是基础设施性的:一个可审计的大型日志证明学习的工作原型。关键词:可验证学习,形式验证,加密证明,反射反馈,关闭式治理。
https://arxiv.org/abs/2601.00816
We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants. To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space. On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.
我们重新审视了序列建模中的一个基本问题:显式的自注意力机制对于实现优异性能和推理而言是否真的必不可少?我们认为,标准的多头注意力机制最好被视为一种张量提升形式:隐藏向量被映射到一个高维空间,在这个空间中包含成对交互,并且通过梯度下降约束这种提升后的张量来进行学习。虽然这种机制非常具有表达能力,但由于在经过许多层后变得难以用一组显式的不变性来描述,因此从数学上讲它相当不透明。 为了探索一种替代方案,我们提出了一种基于Grassmann流的无注意力架构。与形成一个L×L的注意力矩阵不同,我们的因果Grassmann层(i)线性地减少令牌状态,(ii)通过Plücker坐标将局部令牌对编码为Grassmann流形上的二维子空间,并且(iii)通过门控混合将这些几何特征重新融合回隐藏状态。因此信息传播是通过对多尺度局部窗口中的低秩子空间进行受控变形来实现的,核心计算发生在有限维流形上而非无结构张量空间内。 在Wikitext-2语言建模基准测试中,仅基于Grassmann的模型(具有13至18百万参数)达到了约比同规模Transformer小10%到15%的有效困惑度。对于SNLI自然语言推理任务,在DistilBERT之上添加了一个Grassmann-Plücker头部,在验证和测试准确性上略微优于一个转换器头部,最佳验证与测试准确率分别为0.8550和0.8538,相比之下,后者为0.8545和0.8511。 我们分析了Grassmann混合的复杂性,并展示了对于固定秩,在序列长度上的线性缩放。我们认为基于流形的设计提供了一条更结构化的路径,能够朝着几何及不变性解释神经推理方向前进。
https://arxiv.org/abs/2512.19428
With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.
随着深度学习的发展,理解AI系统如何识别对象的模型变得越来越困难。因此,对手可能会试图通过在图像中添加未见过的元素来修改图像,从而混淆AI对实体的识别能力。本文研究了LLaVA-1.5-13B和Meta的Llama 3.2 Vision-8B-2这两种AI系统在对抗攻击下的稳健性。它们针对视觉输入模式进行了无目标PGD(Projected Gradient Descent)测试,并在Visual Question Answering (VQA) v2数据集子集中进行经验评估。这些对抗攻击的结果使用标准的VQA准确率指标量化。这种评估结果与LLaVA和Llama 3.2 Vision的准确性下降(准确度降低)进行了比较。 一个关键发现是,尽管在该设置中的基线精度较低,但Llama 3.2 Vision在受到攻击时表现出更小的性能下降,特别是在更高的扰动水平下比LLaVA表现得更好。总体而言,这些研究结果证实了视觉模式可以作为破坏当代开放式权重VLM(Vision-Language Models)性能的有效攻击途径,包括Meta的Llama 3.2 Vision。此外,它们还强调了对抗稳健性不一定直接与标准基准性能相关联,并且可能受底层架构和训练因素的影响。
https://arxiv.org/abs/2512.17902
We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings: 1. Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching -- proving separable mechanisms 2. Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p < 0.001, N=28 task-model pairs) 3. Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable -- not merely behavioral modes -- we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail -- and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering -- reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments.
我们提供了关于基于上下文学习(ICL)的因果机制验证,证明ICL可以分解为两个可分离的机制:任务模式识别(Task Schema)和绑定(Binding特定输入-输出关联)。通过在来自7个Transformer家族以及Mamba模型系列中的9种不同架构(参数量从370M到13B不等)上进行激活补丁实验,我们得出了三个关键发现: 1. **双重分离**:任务模式识别可以通过对后期MLP层的补丁完全转移(100%),而绑定机制通过残差流补丁可以以62%的效率进行转移——这证明了它们是可分的机制。 2. **先验-模式权衡**:模式依赖性与先前知识存在反向相关关系(Spearman rho = -0.596,p < 0.001,N=28任务模型对),这意味着随着先前知识的增加,对于特定任务的任务模式识别的重要性会降低。 3. **架构通用性**:这种机制在所有测试过的架构中运行良好,包括非Transformer架构Mamba。 这些发现为ICL谜题提供了一个机械解释,并与之前将ICL视为单一机制(无论是基于检索、类似梯度下降还是纯粹贝叶斯)的观点相区别。通过证实模式和绑定从神经元层面上可以分离——不仅仅是行为模式的不同——我们为双过程理论提供了因果证据。 模型在缺乏先前知识时依赖于任务模式识别,但是当存在先前知识时会干扰注意力的正确路由(72.7%近期偏差),而不是直接输出竞争(0%)。这解释了为什么任意映射能够成功(零先验导致完全模式依赖)而事实覆盖失败,并揭示了真正的瓶颈在于注意层面而非输出层级。 **实用意义**:理解这些双机制可以更高效地进行提示工程——可靠的任务模式转移减少了对于新任务演示的需求,同时基于先前知识的设计可以在高先验场景中降低38%的绑定失败率,从而提高ICL系统在实际部署中的可靠性。
https://arxiv.org/abs/2512.17325