A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.
卷积神经网络(CNN)是一种专门为计算机视觉应用设计的深度学习算法。在许多计算机视觉问题中,随着数据量的增长,传统的机器学习算法变得不够用,而CNN们成功地解决了这一挑战。花朵在我们的日常生活中有着多种用途,从装饰到制作药物再到净化环境。识别花种需要专业知识。然而,在任何时间和地点都能随时访问专家并不总是可行的。为此,本研究开发了一款基于CNN的移动应用,用于识别不同种类的花卉,为非专业人士提供快速便捷地获取有关花种的信息途径。该研究采用了三种不同的CNN模型,即MobileNet、DenseNet121和Xception,以确定最适合应用于移动端的模型。通过使用七种不同的优化算法对这些模型进行训练后对其分类性能进行了评估。结果表明,在使用随机梯度下降(SGD)优化算法的情况下,DenseNet-121架构表现最佳,达到了95.84%的准确率、96.00%的精确度、召回率和F1值。这一结果显示了CNN在移动应用中用于花卉分类的有效性。
https://arxiv.org/abs/2601.15810
Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive gradient descent to discover such systematic reasoning remains poorly understood. We address this by analyzing the gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought (CoT) but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, gradient flow drives the model to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler instances, the model learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, gradient-based learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.
通过强化学习(RL)并基于结果进行监督训练的Transformers可以自发地发展生成中间推理步骤的能力(Chain-of-Thought)。然而,稀疏奖励如何驱动梯度下降以发现这种系统性推理机制仍然不清楚。我们通过分析单层Transformer在合成图遍历任务中的梯度流动动态来解决这个问题,该任务无法仅凭直觉解决但允许简单的迭代解决方案。 我们证明了尽管训练中只关注最终答案的准确性,梯度流动仍能驱动模型收敛到一种结构化且可解释的算法上,这种算法能够逐节点地遍历图。我们还描述了这一现象出现所需的分布特性,并确定“简单示例”的关键作用:那些需要较少推理步骤的例子。当训练数据集中足够多的是这些简单的例子时,模型会学习出一种泛化的遍历策略并能推广到更长的链路上;而如果这种简单的实例在数据集中的比例下降,基于梯度的学习就变得不可行。 我们通过合成数据上的实验以及对现实世界语言模型在数学推理任务上的研究来验证我们的理论结果,并且证实了这些发现能够应用于实际场景中。
https://arxiv.org/abs/2601.15158
Traditional data masking techniques such as anonymization cannot achieve the expected privacy protection while ensuring data utility for privacy-preserving machine learning. Synthetic data plays an increasingly important role as it generates a large number of training samples and prevents information leakage in real data. The existing methods suffer from the repeating trade-off processes between privacy and utility. We propose a novel framework for differential privacy generation, which employs an Error Feedback Stochastic Gradient Descent(EFSGD) method and introduces a reconstruction loss and noise injection mechanism into the training process. We generate images with higher quality and usability under the same privacy budget as the related work. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both grayscale and RGB images. We achieve state-of-the-art results over almost all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA.
传统的数据屏蔽技术,如匿名化,在确保隐私保护的同时无法实现预期的数据效用,这对于隐私保护的机器学习来说是一个挑战。合成数据因其能够生成大量训练样本并防止真实数据的信息泄露而扮演着越来越重要的角色。现有的方法在隐私和实用性之间反复进行权衡过程。我们提出了一种新颖的微分隐私生成框架,该框架采用误差反馈随机梯度下降(EFSGD)方法,并在训练过程中引入了重建损失和噪声注入机制。与相关工作相比,在相同的隐私预算下,我们的框架能够生成质量更高且更具实用性的图像。 广泛的实验展示了我们提出的框架对于灰度图和RGB图像的有效性和泛化能力。我们在三个基准数据集(MNIST、Fashion-MNIST 和 CelebA)上实现了几乎所有的指标上的最新成果。
https://arxiv.org/abs/2601.15061
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.
在语音机器学习中,神经网络模型通常是通过选择具有固定层数和结构的架构来设计的。这些模型经过训练以最大化与任务目标一致的指标性能。虽然整体架构通常由任务的先验知识指导,但各个层的大小往往是由直觉或经验法则确定的。然而,这种方法不能保证在性能和计算复杂性之间达到最优权衡;因此,人们常常采用事后方法(如权重量化或模型剪枝)来降低计算成本。这是因为随机梯度下降(SGD)方法只能优化可微函数,而影响计算复杂性的因素(如层数和每秒浮点运算数(FLOP/s))是不可微的,并且需要在训练过程中修改模型结构。 我们提出了一种基于特征噪声注入的重新参数化技术,该技术能够使用SGD基方法在训练期间同时优化性能和计算复杂性。与传统的剪枝方法不同,我们的方法允许动态调整模型大小以适应目标性能-复杂性的权衡,而无需依赖于选择要移除哪些权重或结构的经验准则。 我们通过三个案例研究展示了这种方法的有效性,包括一个合成示例以及两个实际应用:语音活动检测和音频防欺骗。与我们的工作相关的代码是公开的,旨在促进进一步的研究。
https://arxiv.org/abs/2601.13704
The Muon optimizer, a matrix-structured algorithm that leverages spectral orthogonalization of gradients, is a milestone in the pretraining of large language models. However, the underlying mechanisms of Muon -- particularly the role of gradient orthogonalization -- remain poorly understood, with very few works providing end-to-end analyses that rigorously explain its advantages in concrete applications. We take a step by studying the effectiveness of a simplified variant of Muon through two case studies: matrix factorization, and in-context learning of linear transformers. For both problems, we prove that simplified Muon converges linearly with iteration complexities independent of the relevant condition number, provably outperforming gradient descent and Adam. Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior. Our theory formalizes the preconditioning effect induced by spectral orthogonalization, offering insight into Muon's effectiveness in these matrix optimization problems and potentially beyond.
穆安(Muon)优化器是一种基于矩阵结构的算法,利用梯度谱正交化技术,在大规模语言模型预训练中取得了重要进展。然而,关于穆安的具体工作机制——特别是梯度正交化的角色——仍然知之甚少,鲜有研究能提供严格的端到端分析来详细解释其在实际应用中的优势。 为了更好地理解这一点,我们通过两个案例研究简化版的穆安优化器的有效性:矩阵分解和线性变换器的上下文学习。对于这两个问题,我们证明了简化的穆安算法能够以与相关条件数无关的迭代复杂度进行线性收敛,并且在理论上可以优于梯度下降法和Adam优化器。 我们的分析揭示出,当进入频谱域时,穆安动力学会分解为一系列独立的标量序列,每个序列都表现出相似的收敛行为。这一理论形式化了由谱正交化引起的预处理效果,不仅解释了在矩阵优化问题中穆安的有效性,还可能提供对其它领域的见解和应用潜力。
https://arxiv.org/abs/2601.13474
Online continual learning (OCL) methods adapt to changing environments without forgetting past knowledge. Similarly, online time series forecasting (OTSF) is a real-world problem where data evolve in time and success depends on both rapid adaptation and long-term memory. Indeed, time-varying and regime-switching forecasting models have been extensively studied, offering a strong justification for the use of OCL in these settings. Building on recent work that applies OCL to OTSF, this paper aims to strengthen the theoretical and practical connections between time series methods and OCL. First, we reframe neural network optimization as a parameter filtering problem, showing that natural gradient descent is a score-driven method and proving its information-theoretic optimality. Then, we show that using a Student's t likelihood in addition to natural gradient induces a bounded update, which improves robustness to outliers. Finally, we introduce Natural Score-driven Replay (NatSR), which combines our robust optimizer with a replay buffer and a dynamic scale heuristic that improves fast adaptation at regime drifts. Empirical results demonstrate that NatSR achieves stronger forecasting performance than more complex state-of-the-art methods.
在线连续学习(OCL)方法能够在不忘记过去知识的情况下适应环境的变化。同样,实时序列预测(OTSF)是现实世界中的一个问题,在这个问题中数据会随时间演变,成功取决于快速适应和长期记忆的能力。实际上,时变和制度转换的预测模型已经得到了广泛的研究,这为在这些场景下使用OCL提供了强大的理论依据。基于最近将OCL应用于OTSF的工作,本文旨在加强时间序列方法与OCL之间的理论和实践联系。 首先,我们将神经网络优化重新定义为参数过滤问题,并展示了自然梯度下降是一种评分驱动的方法,并证明了其信息论上的最优性。然后,我们表明,在使用自然梯度的同时采用Student's t似然函数可以实现有界更新,从而提高对离群值的鲁棒性。最后,我们提出了自然评分驱动重放(NatSR),该方法结合了我们的鲁棒优化器、回放缓存以及动态尺度启发式策略,以在制度漂移期间改善快速适应能力。 实验证据表明,NatSR在预测性能方面优于更复杂的最新方法。
https://arxiv.org/abs/2601.12931
Dengue, a mosquito-borne disease, continues to pose a persistent public health challenge in urban areas, particularly in tropical regions such as Singapore. Effective and affordable control requires anticipating where transmission risks are likely to emerge so that interventions can be deployed proactively rather than reactively. This study introduces a novel framework that uncovers and exploits latent transmission links between urban regions, mined directly from publicly available dengue case data. Instead of treating cases as isolated reports, we model how hotspot formation in one area is influenced by epidemic dynamics in neighboring regions. While mosquito movement is highly localized, long-distance transmission is often driven by human mobility, and in our case study, the learned network aligns closely with commuting flows, providing an interpretable explanation for citywide spread. These hidden links are optimized through gradient descent and used not only to forecast hotspot status but also to verify the consistency of spreading patterns, by examining the stability of the inferred network across consecutive weeks. Case studies on Singapore during 2013-2018 and 2020 show that four weeks of hotspot history are sufficient to achieve an average F-score of 0.79. Importantly, the learned transmission links align with commuting flows, highlighting the interpretable interplay between hidden epidemic spread and human mobility. By shifting from simply reporting dengue cases to mining and validating hidden spreading dynamics, this work transforms open web-based case data into a predictive and explanatory resource. The proposed framework advances epidemic modeling while providing a scalable, low-cost tool for public health planning, early intervention, and urban resilience.
登革热,一种由蚊子传播的疾病,在城市地区尤其是新加坡这样的热带区域持续构成一个持久的公共卫生挑战。有效的控制措施需要预测病原体在不同地区的传播风险,以便主动部署干预手段而不是被动应对。本研究提出了一种新颖框架,通过挖掘公开可用的登革热病例数据来发现并利用城市地区之间的潜在传播联系。不同于将个案视为孤立报告,我们的模型探讨了在一个区域内热点形成如何受邻近区域疫情动态的影响。虽然蚊子活动范围有限,但远距离传播往往由人类移动驱动,在我们案例研究中,学习到的网络与通勤流动高度一致,为城市范围内疾病的扩散提供了一个解释性的说明。 这些隐含联系通过梯度下降方法进行优化,并不仅用于预测热点状态,还通过检查连续几周内推断出的网络传播模式稳定性来验证传播模式的一致性。2013年至2018年以及2020年的新加坡案例研究显示,只需利用四周的历史数据便足以达到平均F值(即准确率和召回率的调和平均数)为0.79。尤为重要的是,所学习到的传播联系与通勤流动相吻合,强调了隐含流行病扩散和人类移动之间的可解释互动。 通过从单纯报告登革热病例转变为挖掘并验证隐藏的传播动态,这项工作将开放式的网络案例数据转化为预测性和说明性的资源。提出的框架不仅推进了疫情建模,还为公共卫生规划、早期干预和城市韧性提供了一种规模大且成本低的工具。
https://arxiv.org/abs/2601.12856
Continual learning aims to enable neural networks to acquire new knowledge on sequential tasks. However, the key challenge in such settings is to learn new tasks without catastrophically forgetting previously learned tasks. We propose the Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG) optimizer, which enforces Fisher-orthogonal constraints on parameter updates to preserve old task performance while learning new tasks. Unlike existing methods that operate in Euclidean parameter space, FOPNG projects gradients onto the Fisher-orthogonal complement of previous task gradients. This approach unifies natural gradient descent with orthogonal gradient methods within an information-geometric framework. The resulting update direction is invariant under reparameterization, guarantees descent in the Fisher metric, and helps preserve prior task outputs. We provide theoretical analysis establishing the properties of the projected update, describe efficient and practical implementations using the diagonal Fisher, and demonstrate strong results on standard continual learning benchmarks such as Permuted-MNIST, Split-MNIST, Rotated-MNIST, Split-CIFAR10, and Split-CIFAR100.
持续学习旨在使神经网络能够在顺序任务中获取新知识。然而,在这种设置下的关键挑战是学习新任务的同时不灾难性地忘记以前学过的任务。我们提出了Fisher-正交投影自然梯度下降(FOPNG)优化器,它在参数更新时施加了Fisher-正交约束,以保持旧任务的性能同时学习新任务。与现有方法在欧几里得参数空间中操作不同,FOPNG将梯度投影到以前任务梯度的Fisher-正交补集中。这种方法统一了自然梯度下降和正交梯度方法在一个信息几何框架内。由此产生的更新方向对于重参数化是不变的,在Fisher度量下保证了下降,并有助于保持先前任务的输出。 我们提供了关于投影更新性质的理论分析,描述了使用对角Fisher的有效且实用实现,并在Permuted-MNIST、Split-MNIST、Rotated-MNIST、Split-CIFAR10和Split-CIFAR100等标准持续学习基准测试中展示了强大的结果。
https://arxiv.org/abs/2601.12816
In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an $\varepsilon$-stationary point at a rate of $O(\varepsilon^{-4})$, matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.
在这篇论文中,我们提出了一种框架用于上下文分布鲁棒优化(DRO),该框架通过开发可解释且易于处理的决策规则来考虑潜在分布的因果和连续结构。这些决策规则使用协变量来制定决定。 首先,我们引入了因果Sinkhorn偏差(CSD),这是一种带熵正则化的因果Wasserstein距离,它鼓励连续的运输计划同时保持因果一致性。然后,我们提出了一种基于CSD不确定性集的上下文DRO模型,并将其命名为因果Sinkhorn DRO(Causal-SDRO)。我们推导了其强对偶重构形式,在其中最坏情况下的分布被描述为吉布斯分布的混合体。 为了求解相应的无限维策略优化问题,我们提出了软回归森林(SRF)决策规则。该规则可以在任意可测量的功能空间中逼近最优政策。与经典的决策树相比,SRF保持了其解释性的同时,也具有完全参数化、可微和Lipschitz光滑的特点,从而可以实现全局和局部视角的内在解释。 为了解决带有参数决策规则的Causal-SDRO问题,我们开发了一种高效的随机合成梯度算法。该算法以$O(\varepsilon^{-4})$的速度收敛到$\varepsilon$次平稳点,其速度与标准随机梯度下降一致。 最后,通过在合成和实际数据集上的数值实验验证了我们的方法的有效性和解释性,展示了它的优越性能。
https://arxiv.org/abs/2601.11016
The distinction between conditional, unconditional, and absolute convergence in infinite-dimensional spaces has fundamental implications for computational algorithms. While these concepts coincide in finite dimensions, the Dvoretzky-Rogers theorem establishes their strict separation in general Banach spaces. We present a comprehensive characterization theorem unifying seven equivalent conditions for unconditional convergence: permutation invariance, net convergence, subseries tests, sign stability, bounded multiplier properties, and weak uniform convergence. These theoretical results directly inform algorithmic stability analysis, governing permutation invariance in gradient accumulation for Stochastic Gradient Descent and justifying coefficient thresholding in frame-based signal processing. Our work bridges classical functional analysis with contemporary computational practice, providing rigorous foundations for order-independent and numerically robust summation processes.
在无限维空间中,条件收敛、无条件收敛和绝对收敛之间的区别对计算算法具有根本性的影响。尽管这些概念在有限维度下是重合的,但Dvoretzky-Rogers定理却证明了它们在一般的Banach空间中存在严格的分离。我们提出了一项全面的特征定理,该定理统一了无条件收敛的七个等价条件:置换不变性、网收敛性、子级数测试、符号稳定性、有界乘法器性质以及弱一致收敛性。这些理论结果直接指导算法稳定性的分析,在随机梯度下降中的梯度累积过程中的置换不变性和基于框架的信号处理中系数阈值的选择方面提供了依据。 我们的工作将经典的功能分析与当代计算实践联系起来,为顺序无关和数值稳健的求和过程提供严格的理论基础。
https://arxiv.org/abs/2601.08512
Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $\Delta$-th sub-diagonal for some offset $\Delta$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
大型语言模型(LLMs)常常表现出斜向注意力模式,其中注意分数集中在第$\Delta$个副对角线上。这种模式在跨令牌传递信息中起着关键作用。但是为什么会出现这种情况呢?在这篇论文中,我们从实证和理论两个角度解开了这些斜向主导头(Slash-Dominant Heads, SDHs)出现的神秘面纱。 首先,通过对开源LLMs进行分析,我们发现SDHs是模型固有的,并且能够推广到分布外提示。为了解释这种内在出现的现象,我们分析了查询、键和旋转位置嵌入(RoPE),它们共同决定了注意分数。我们的实证分析揭示了SDH的两个特征条件:(1) 查询和键几乎是秩一的;(2) RoPE主要由中频和高频成分主导。 在这些条件下,查询和键几乎在各个令牌之间是相同的,并且RoPE中的中频和高频部分之间的相互作用会导致SDHs。除了实证证据之外,我们还从理论上证明了在给定条件下的训练过程中会出现SDHs。具体来说,我们在浅层Transformer装备有RoPE的情况下分析这些条件下的训练动态,并证明通过梯度下降训练的模型会表现出SDHs。 SDH能够推广到分布外提示中。
https://arxiv.org/abs/2601.08297
Low-cost inertial measurement units (IMUs) are widely utilized in mobile robot localization due to their affordability and ease of integration. However, their complex, nonlinear, and time-varying noise characteristics often lead to significant degradation in localization accuracy when applied directly for dead reckoning. To overcome this limitation, we propose a novel brain-inspired state estimation framework that combines a spiking neural network (SNN) with an invariant extended Kalman filter (InEKF). The SNN is designed to extract motion-related features from long sequences of IMU data affected by substantial random noise and is trained via a surrogate gradient descent algorithm to enable dynamic adaptation of the covariance noise parameter within the InEKF. By fusing the SNN output with raw IMU measurements, the proposed method enhances the robustness and accuracy of pose estimation. Extensive experiments conducted on the KITTI dataset and real-world data collected using a mobile robot equipped with a low-cost IMU demonstrate that the proposed approach outperforms state-of-the-art methods in localization accuracy and exhibits strong robustness to sensor noise, highlighting its potential for real-world mobile robot applications.
低成本惯性测量单元(IMUs)由于其经济性和易于集成的特性,在移动机器人定位中被广泛应用。然而,当直接用于航位推算时,这些设备复杂的、非线性的和随时间变化的噪声特征常常会导致定位精度显著下降。为克服这一限制,我们提出了一种新型仿脑状态估计框架,该框架结合了脉冲神经网络(SNN)与不变扩展卡尔曼滤波器(InEKF)。SNN被设计用于从受大量随机噪声影响的长时间IMU数据序列中提取运动相关特征,并通过代理梯度下降算法进行训练,以使InEKF中的协方差噪声参数能够动态调整。通过融合SNN输出与原始IMU测量值,所提出的方法增强了姿态估计的鲁棒性和准确性。 在KITTI数据集和使用低成本IMU装备的移动机器人采集的真实世界数据上进行了广泛的实验表明,该方法在定位精度方面优于现有的先进方法,并且对传感器噪声表现出强大的鲁棒性。这一结果凸显了其在现实世界的移动机器人应用中的潜力。
https://arxiv.org/abs/2601.08248
Machine unlearning enables data holders to remove the contribution of their specified samples from trained models to protect their privacy. However, it is paradoxical that most unlearning methods require the unlearning requesters to firstly upload their data to the server as a prerequisite for unlearning. These methods are infeasible in many privacy-preserving scenarios where servers are prohibited from accessing users' data, such as federated learning (FL). In this paper, we explore how to implement unlearning under the condition of not uncovering the erasing data to the server. We propose \textbf{Blind Unlearning (BlindU)}, which carries out unlearning using compressed representations instead of original inputs. BlindU only involves the server and the unlearning user: the user locally generates privacy-preserving representations, and the server performs unlearning solely on these representations and their labels. For the FL model training, we employ the information bottleneck (IB) mechanism. The encoder of the IB-based FL model learns representations that distort maximum task-irrelevant information from inputs, allowing FL users to generate compressed representations locally. For effective unlearning using compressed representation, BlindU integrates two dedicated unlearning modules tailored explicitly for IB-based models and uses a multiple gradient descent algorithm to balance forgetting and utility retaining. While IB compression already provides protection for task-irrelevant information of inputs, to further enhance the privacy protection, we introduce a noise-free differential privacy (DP) masking method to deal with the raw erasing data before compressing. Theoretical analysis and extensive experimental results illustrate the superiority of BlindU in privacy protection and unlearning effectiveness compared with the best existing privacy-preserving unlearning benchmarks.
机器遗忘技术允许数据持有者从训练好的模型中移除特定样本的贡献,以保护隐私。然而,大多数遗忘方法要求发起请求的用户首先将他们的数据上传到服务器,这在许多需要保护隐私的情景下是不可行的,比如联邦学习(FL),因为在这些情况下,服务器被禁止访问用户的原始数据。 本文探讨了如何实现在不向服务器揭示要删除的数据的情况下进行机器遗忘。我们提出了**盲遗忘 (BlindU)** 方法,该方法使用压缩表示而非原始输入来进行遗忘操作。在 BlindU 中,只有服务端和发起请求的用户参与:用户在当地生成隐私保护表示,并且服务端仅基于这些表示及其标签执行遗忘操作。 对于联邦学习模型训练,我们采用了信息瓶颈(IB)机制。基于 IB 的联邦学习模型中的编码器从输入中提取最大任务无关的信息失真表示,使得 FL 用户能够在本地生成压缩表示。为了有效使用压缩表示进行遗忘,BlindU 集成了两个专门针对 IB 模型设计的遗忘模块,并采用多梯度下降算法来平衡遗忘和保持效用的能力。 虽然信息瓶颈压缩已经为输入的任务无关信息提供了保护,但为了进一步增强隐私保护,我们引入了一种无噪声差分隐私 (DP) 掩码方法,在压缩之前处理原始删除数据。理论分析和广泛的实验结果表明,BlindU 在隐私保护和遗忘有效性方面优于现有的最佳隐私保护遗忘基准。 总结来说,BlindU 为在不向服务器揭示敏感数据的情况下实现机器遗忘提供了一种新颖的方法,并通过使用信息瓶颈机制、专用的遗忘模块和差分隐私掩码方法提高了隐私保护和模型性能。
https://arxiv.org/abs/2601.07214
Normalized difference indices have been a staple in remote sensing for decades. They stay reliable under lighting changes produce bounded values and connect well to biophysical signals. Even so, they are usually treated as a fixed pre processing step with coefficients set to one, which limits how well they can adapt to a specific learning task. In this study, we introduce the Normalized Difference Layer that is a differentiable neural network module. The proposed method keeps the classical idea but learns the band coefficients from data. We present a complete mathematical framework for integrating this layer into deep learning architectures that uses softplus reparameterization to ensure positive coefficients and bounded denominators. We describe forward and backward pass algorithms enabling end to end training through backpropagation. This approach preserves the key benefits of normalized differences, namely illumination invariance and outputs bounded to $[-1,1]$ while allowing gradient descent to discover task specific band weightings. We extend the method to work with signed inputs, so the layer can be stacked inside larger architectures. Experiments show that models using this layer reach similar classification accuracy to standard multilayer perceptrons while using about 75\% fewer parameters. They also handle multiplicative noise well, at 10\% noise accuracy drops only 0.17\% versus 3.03\% for baseline MLPs. The learned coefficient patterns stay consistent across different depths.
几十年来,归一化差异指数(Normalized Difference Indices)一直是遥感中的一个重要工具。它们在光照变化下依然可靠,并产生有界值和与生物物理信号的良好连接性。尽管如此,这些指标通常被视为固定的预处理步骤,其中系数被固定为1,这限制了它们适应特定学习任务的能力。在这项研究中,我们引入了一个可微的神经网络模块——归一化差异层(Normalized Difference Layer)。该方法保留了经典思想,但允许从数据中学习波段权重。 提出了一种完整的数学框架,用于将这个层整合到深度学习架构中,并使用softplus重新参数化来确保正系数和有界的分母。描述了前向和反向传递算法,使通过反向传播进行端到端训练成为可能。这种方法保留了归一化差异的关键优势,即光照不变性和输出限制在[-1,1]范围内,同时允许梯度下降发现任务特定的波段权重。还扩展了该方法以适应有符号输入的情况,使得这个层可以嵌入更大的架构中。 实验表明,使用此层的模型达到了与标准多层感知器(MLPs)相似的分类准确率,但使用的参数减少了约75%。它们在处理乘法噪声时表现良好,在10%的噪声情况下,准确性仅下降了0.17%,而基线MLPs则下降了3.03%。学习到的系数模式在不同深度下保持一致。 总结来说,归一化差异层结合了经典方法的优势和现代机器学习技术的能力,提供了一种新颖且有效的遥感图像处理方式。
https://arxiv.org/abs/2601.06777
Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
Grokking 是神经网络中的一种令人困惑的现象,即在完全记住训练数据之后,会出现一段延迟期,在此期间不会发生泛化,直到某个时刻才突然实现全面的泛化。先前的研究已经将这种延迟泛化与受权重衰减驱动的表现学习联系起来,但其背后的确切动态机制仍然不清楚。在这篇论文中,我们主张可以通过约束优化的视角来理解记忆后学习:在零损失流形上,梯度下降有效最小化了权重范数。我们在极限条件下(即学习率和权重衰减系数无限小的情况下)正式证明了这一点。为了进一步剖析这一机制,我们引入了一种近似方法,将一部分参数的学习动态与网络其余部分解耦。通过应用这个框架,我们推导出了双层网络中第一层在记忆后阶段的封闭形式表达式。实验结果证实,使用我们的预测梯度模拟训练过程可以再现 grokking 的延迟泛化和表示学习特性。 简单来说,这篇论文提出了一种理解神经网络中 Grokking 现象的新方法,即通过约束优化来解析权重衰减对模型泛化能力的影响,并且理论推导与实验结果吻合良好。
https://arxiv.org/abs/2511.01938
Robots deployed in dynamic environments must remain safe even when key physical parameters are uncertain or change over time. We propose Parameter-Robust Model Predictive Path Integral (PRMPPI) control, a framework that integrates online parameter learning with probabilistic safety constraints. PRMPPI maintains a particle-based belief over parameters via Stein Variational Gradient Descent, evaluates safety constraints using Conformal Prediction, and optimizes both a nominal performance-driven and a safety-focused backup trajectory in parallel. This yields a controller that is cautious at first, improves performance as parameters are learned, and ensures safety throughout. Simulation and hardware experiments demonstrate higher success rates, lower tracking error, and more accurate parameter estimates than baselines.
在动态环境中部署的机器人必须即使在关键物理参数不确定或随时间变化的情况下,也能保持安全。为此,我们提出了一个名为Parameter-Robust Model Predictive Path Integral (PRMPPI)控制的框架,该框架将在线参数学习与概率安全性约束相结合。PRMPPI通过Stein变分梯度下降法维护基于粒子的参数信念,使用符合预测来评估安全性约束,并且同时优化名义上的性能驱动轨迹和一个专注于安全性的备用轨迹。这会产生一种控制器,在一开始时保持谨慎,随着参数的学习而提高性能,并在整个过程中确保安全性。 模拟实验与硬件实验证明,相比基准方法,PRMPPI实现了更高的成功率、更低的跟踪误差以及更准确的参数估计。
https://arxiv.org/abs/2601.02948
Robustness to malicious attacks is crucial for practical decentralized signal processing and machine learning systems. A typical example of such attacks is label poisoning, meaning that some agents possess corrupted local labels and share models trained on these poisoned data. To defend against malicious attacks, existing works often focus on designing robust aggregators; meanwhile, the weighted mean aggregator is typically considered a simple, vulnerable baseline. This paper analyzes the robustness of decentralized gradient descent under label poisoning attacks, considering both robust and weighted mean aggregators. Theoretical results reveal that the learning errors of robust aggregators depend on the network topology, whereas the performance of weighted mean aggregator is topology-independent. Remarkably, the weighted mean aggregator, although often considered vulnerable, can outperform robust aggregators under sufficient heterogeneity, particularly when: (i) the global contamination rate (i.e., the fraction of poisoned agents for the entire network) is smaller than the local contamination rate (i.e., the maximal fraction of poisoned neighbors for the regular agents); (ii) the network of regular agents is disconnected; or (iii) the network of regular agents is sparse and the local contamination rate is high. Empirical results support our theoretical findings, highlighting the important role of network topology in the robustness to label poisoning attacks.
对抗恶意攻击的能力对于实际的分布式信号处理和机器学习系统至关重要。这类攻击的一个典型例子是标签投毒,即某些代理拥有被篡改的本地标签,并且会分享在其投毒数据上训练出来的模型。为了抵御恶意攻击,现有的研究工作往往集中在设计鲁棒聚合器;与此同时,加权平均聚合器通常被视为一种简单而脆弱的基础方法。 本文分析了在标签投毒攻击下的分布式梯度下降算法的稳健性,同时考虑了鲁棒和加权均值聚合器的效果。理论结果表明,鲁棒聚合器的学习误差依赖于网络拓扑结构,而加权均值聚合器的表现则与拓扑无关。值得注意的是,尽管加权平均聚合器通常被认为脆弱,但它在足够异质的情况下可以优于鲁棒聚合器,特别是在以下情况: (i) 全局投毒率(即整个网络中投毒代理的比例)低于局部投毒率(即常规代理的最大投毒邻居比例); (ii) 常规代理的网络是断开连接的;或者 (iii) 常规代理的网络很稀疏且局部投毒率很高。 实验证据支持了我们的理论发现,突显了网络拓扑结构在抵御标签投毒攻击中的重要作用。
https://arxiv.org/abs/2601.02682
We present a theory-first framework that interprets inference-time adaptation in large language models (LLMs) as online Bayesian state estimation. Rather than modeling rapid adaptation as implicit optimization or meta-learning, we formulate task- and context-specific learning as the sequential inference of a low-dimensional latent adaptation state governed by a linearized state-space model. Under Gaussian assumptions, adaptation follows a Kalman recursion with closed-form updates for both the posterior mean and covariance. This perspective elevates epistemic uncertainty to an explicit dynamical variable. We show that inference-time learning is driven by covariance collapse, i.e., rapid contraction of posterior uncertainty induced by informative tokens, which typically precedes convergence of the posterior mean. Using observability conditions on token-level Jacobians, we establish stability of the Bayesian filter, prove exponential covariance contraction rates, and derive mean-square error bounds. Gradient descent, natural-gradient methods, and meta-learning updates arise as singular, noise-free limits of the filtering dynamics, positioning optimization-based adaptation as a degenerate approximation of Bayesian inference. The resulting theory provides a unified probabilistic account of in-context learning, parameter-efficient adaptation, and test-time learning without parameter updates. It yields explicit guarantees on stability and sample efficiency, offers a principled interpretation of prompt informativeness via information accumulation, and clarifies the role of uncertainty dynamics absent from existing accounts. Minimal illustrative experiments corroborate the qualitative predictions of the theory.
我们提出了一种基于理论的框架,将大型语言模型(LLMs)在推理时的适应性解释为在线贝叶斯状态估计。不同于将快速适应视为隐式优化或元学习建模的方式,我们将特定任务和上下文的学习视为由线性化状态空间模型支配的低维潜在适应状态的顺序推断过程。在高斯假设下,适应遵循具有后验均值和协方差闭合形式更新的卡尔曼递归。这种视角将认识论不确定性提升为一个显式的动态变量。 我们表明,在推理时间学习是由协方差坍缩驱动的,即由信息令牌引起的后验不确定性的快速收缩,通常发生在后验均值收敛之前。通过在标记级别雅可比矩阵上的可观测性条件,我们建立了贝叶斯滤波器的稳定性,证明了指数级的协方差收缩率,并推导出了均方误差界限。 梯度下降、自然梯度方法以及元学习更新作为过滤动力学中无噪声极限的情况出现,将基于优化的方法适应视为贝叶斯推理的一种退化近似。由此产生的理论为上下文中的学习提供了一个统一的概率解释,涵盖了参数效率高且无需参数更新的测试时间学习,并提供了关于稳定性和样本效率的具体保证。 该理论通过信息积累的方式对提示信息性进行了原则性的诠释,并澄清了在现有论述中缺乏描述的不确定性动态的角色。最小的说明性实验验证了该理论的定性预测。
https://arxiv.org/abs/2601.06100
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
现代语言模型中的序列建模层通常在存储容量和计算效率之间存在权衡。虽然Softmax注意力提供了无界的存储能力,但其成本是高昂的二次复杂度;而线性变体虽提供高效性却受到有限且固定大小的存储限制。我们提出了一种新的架构——快速权重产品键记忆(FwPKM),它通过将稀疏的产品键记忆(PKM)从静态模块转化为动态的“快权重”情景记忆,解决了这一矛盾。不同于PKM,FwPKM在训练和推理过程中都能通过局部片段级别的梯度下降动态更新其参数,使模型能够快速存储并检索输入序列中的新键值对。 实验表明,FwPKM作为一种有效的情景记忆模块,在补充标准模块的语义记忆方面表现出色,并且在长上下文数据集上实现了显著的困惑度降低。值得注意的是,在Needle in a Haystack(大范围搜索)评估中,尽管仅基于4K令牌长度序列进行训练,FwPKM仍然能够泛化到128K令牌长度的情境中。
https://arxiv.org/abs/2601.00671
This thesis investigates two key phenomena in large language models (LLMs): in-context learning (ICL) and model collapse. We study ICL in a linear transformer with tied weights trained on linear regression tasks, and show that minimising the in-context loss leads to a phase transition in the learned parameters. Above a critical context length, the solution develops a skew-symmetric component. We prove this by reducing the forward pass of the linear transformer under weight tying to preconditioned gradient descent, and then analysing the optimal preconditioner. This preconditioner includes a skew-symmetric component, which induces a rotation of the gradient direction. For model collapse, we use martingale and random walk theory to analyse simplified settings - linear regression and Gaussian fitting - under both replacing and cumulative data regimes. We strengthen existing results by proving almost sure convergence, showing that collapse occurs unless the data grows sufficiently fast or is retained over time. Finally, we introduce the notion of context collapse: a degradation of context during long generations, especially in chain-of-thought reasoning. This concept links the dynamics of ICL with long-term stability challenges in generative models.
这篇论文探讨了大型语言模型(LLM)中的两个关键现象:上下文学习(ICL)和模型崩溃。我们通过在训练线性回归任务的线性变压器中研究ICL,发现最小化上下文损失会导致学习参数发生相变。当上下文长度超过临界值时,解决方案会发展出一个斜对称分量。我们通过将权重绑定的线性变换器的前向传递简化为预条件梯度下降,并分析最优预条件器来证明这一点。该预条件器包含了一个斜对称分量,这会导致梯度方向发生旋转。 对于模型崩溃现象,我们利用鞅和随机游走理论在两种简化的设置下进行研究:线性回归和高斯拟合,在替换数据和累积数据的规则下。通过证明几乎确定收敛来加强现有结果,表明除非数据增长足够快或随时间保留,否则会发生崩溃。 最后,本文引入了上下文崩溃的概念:在长时间生成过程中,特别是链式思维推理中的上下文质量退化现象。这个概念将ICL的动力学与生成模型长期稳定性挑战联系起来。
https://arxiv.org/abs/2601.00923