This paper presents a performance benchmarking study of a Gradient-Optimized Fuzzy Inference System (GF) classifier against several state-of-the-art machine learning models, including Random Forest, XGBoost, Logistic Regression, Support Vector Machines, and Neural Networks. The evaluation was conducted across five datasets from the UCI Machine Learning Repository, each chosen for their diversity in input types, class distributions, and classification complexity. Unlike traditional Fuzzy Inference Systems that rely on derivative-free optimization methods, the GF leverages gradient descent to significantly improving training efficiency and predictive performance. Results demonstrate that the GF model achieved competitive, and in several cases superior, classification accuracy while maintaining high precision and exceptionally low training times. In particular, the GF exhibited strong consistency across folds and datasets, underscoring its robustness in handling noisy data and variable feature sets. These findings support the potential of gradient optimized fuzzy systems as interpretable, efficient, and adaptable alternatives to more complex deep learning models in supervised learning tasks.
本文介绍了一项针对梯度优化模糊推理系统(GF)分类器的性能基准研究,将其与几种最先进的机器学习模型进行了对比,包括随机森林、XGBoost、逻辑回归、支持向量机和神经网络。评估在来自UCI机器学习仓库的五个不同数据集上进行,这些数据集因输入类型多样、类别分布各异及分类难度的不同而被选中。 不同于依赖无导数优化方法的传统模糊推理系统,GF利用梯度下降显著提升了训练效率和预测性能。实验结果显示,在保持高精度的同时,GF模型实现了与其它基准模型相当甚至更优的分类准确率,并且具备极低的训练时间。此外,GF在各个折集(folds)和数据集上表现出了强一致性,这突显了其处理噪声数据及变化特征集合时的鲁棒性。 这些发现支持了梯度优化模糊系统作为解释性强、高效且适应性强的替代方案,可应用于监督学习任务中复杂的深度学习模型。
https://arxiv.org/abs/2504.16263
Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-specific tokens are particularly effective in preserving continuity across frames at low training cost. In this work, we want to provide a general theoretical framework for adapters that maintain frame consistency in DDIM-based models under a temporal consistency loss. First, we prove that the temporal consistency objective is differentiable under bounded feature norms, and we establish a Lipschitz bound on its gradient. Second, we show that gradient descent on this objective decreases the loss monotonically and converges to a local minimum if the learning rate is within an appropriate range. Finally, we analyze the stability of modules in the DDIM inversion procedure, showing that the associated error remains controlled. These theoretical findings will reinforce the reliability of diffusion-based video editing methods that rely on adapter strategies and provide theoretical insights in video generation tasks.
基于适配器的方法通常用于在最小增加复杂度的情况下增强模型性能,特别是在需要帧到帧一致性的视频编辑任务中。通过将小型可学习模块插入预训练的扩散模型中,这些适配器可以在无需大量重新训练的情况下保持时间一致性。结合提示学习(使用共享和特定于每一帧的标记)的方法特别有效,在低训练成本下跨帧维持连续性。 在这项工作中,我们希望为在DDIM(Denoising Diffusion Implicit Models)基于模型中保持帧一致性的适配器提供一个通用理论框架,并在此基础上引入时间一致性损失。首先,我们将证明当特征范数有界时,时间一致性目标是可微的,并且我们建立了其梯度上的Lipschitz边界。其次,我们展示在适当的学习率范围内进行梯度下降会使该目标的损失单调减少并最终收敛到局部最小值。最后,我们分析了DDIM反演过程中的模块稳定性,证明相关误差保持受控。 这些理论发现将增强基于扩散方法依赖适配器策略进行视频编辑技术的可靠性,并为视频生成任务提供理论见解。
https://arxiv.org/abs/2504.16016
This paper presents a novel cascade nonlinear observer framework for inertial state estimation. It tackles the problem of intermediate state estimation when external localization is unavailable or in the event of a sensor outage. The proposed observer comprises two nonlinear observers based on a recently developed iteratively preconditioned gradient descent (IPG) algorithm. It takes the inputs via an IMU preintegration model where the first observer is a quaternion-based IPG. The output for the first observer is the input for the second observer, estimating the velocity and, consequently, the position. The proposed observer is validated on a public underwater dataset and a real-world experiment using our robot platform. The estimation is compared with an extended Kalman filter (EKF) and an invariant extended Kalman filter (InEKF). Results demonstrate that our method outperforms these methods regarding better positional accuracy and lower variance.
本文提出了一种新的级联非线性观测器框架,用于惯性状态估计。该方法解决了在外部定位不可用或传感器故障情况下中间状态估计的问题。所提出的观测器由两个基于最近开发的迭代预条件梯度下降(IPG)算法的非线性观测器组成。它通过IMU(惯性测量单元)预积分模型接收输入,其中第一个观测器是基于四元数的IPG。第一个观测器的输出作为第二个观测器的输入,用于估计速度和位置。该提出的观测器在公开的水下数据集和我们机器人平台上的真实世界实验中进行了验证,并与扩展卡尔曼滤波(EKF)和不变扩展卡尔曼滤波(InEKF)方法进行比较。结果显示,我们的方法在位置精度和方差方面优于这些方法。
https://arxiv.org/abs/2504.15235
While test-time fine-tuning is beneficial in few-shot learning, the need for multiple backpropagation steps can be prohibitively expensive in real-time or low-resource scenarios. To address this limitation, we propose an approach that emulates gradient descent without computing gradients, enabling efficient test-time adaptation. Specifically, we formulate gradient descent as an Euler discretization of an ordinary differential equation (ODE) and train an auxiliary network to predict the task-conditional drift using only the few-shot support set. The adaptation then reduces to a simple numerical integration (e.g., via the Euler method), which requires only a few forward passes of the auxiliary network -- no gradients or forward passes of the target model are needed. In experiments on cross-domain few-shot classification using the Meta-Dataset and CDFSL benchmarks, our method significantly improves out-of-domain performance over the non-fine-tuned baseline while incurring only 6\% of the memory cost and 0.02\% of the computation time of standard fine-tuning, thus establishing a practical middle ground between direct transfer and fully fine-tuned approaches.
虽然测试时的微调在少量样本学习中是有益的,但在实时或资源受限场景下,需要进行多次反向传播步骤的成本可能过高。为了解决这一限制,我们提出了一种方法,该方法能够在不计算梯度的情况下模拟梯度下降,从而实现高效的测试时适应性调整。具体而言,我们将梯度下降表述为普通微分方程(ODE)的欧拉离散化,并训练一个辅助网络仅使用少量样本支持集来预测任务条件下的漂移。随后的适应过程简化为简单的数值积分(例如通过欧拉方法),这只需要进行几次辅助网络的前向传递——不需要计算梯度或目标模型的前向传递。 在跨域少量样本分类实验中,我们使用了Meta-Dataset和CDFSL基准测试,并且我们的方法显著提高了跨域性能,相比于不进行微调的基础线而言,在仅占用标准微调内存成本6%和0.02%计算时间的情况下实现了这一改进。因此,该方法在直接迁移与完全微调之间建立了一种实用的中间途径。 简言之,这项研究提供了一个有效的方法来优化少量样本学习中的测试时适应性调整过程,在减少资源消耗的同时提高跨域模型性能。
https://arxiv.org/abs/2504.15323
One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made -- using any image editing tool -- on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach.
一次性可控视频编辑(OCVE)是一项重要但具有挑战性的任务,旨在将用户在视频第一帧上使用任何图像编辑工具进行的修改传播到所有后续帧,并确保编辑后的帧与原始帧之间保持内容一致性。为了实现这一目标,先前的方法采用DDIM逆向过程将源帧转换为潜在噪声,然后将其输入预训练的扩散模型,在此模型中以用户编辑的第一帧作为条件来生成编辑后的视频。然而,DDIM逆向过程中会累积错误,这阻碍了潜在噪声准确重建原始帧的能力,最终影响了生成的编辑帧的内容一致性。 为了克服这一问题,我们的方法通过一种新的基于视觉提示的方法消除了对DDIM逆向过程的需求,并在此基础上执行OCVE。此外,受到能够进行多步一致性采样以生成一系列内容一致图像的一致性模型的启发,我们提出了内容一致性采样(CCS),以确保生成的编辑帧与源帧之间的内容一致性。另外,我们引入了一种基于Stein变分梯度下降的时间-内容一致性采样(TCS)方法,以确保编辑后的帧之间的时间连续性。大量的实验验证了我们这种方法的有效性。
https://arxiv.org/abs/2504.14335
One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen & Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called "negative log-unlikelihood" (NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).
语言模型“遗忘”(unlearning)的一个目标是减少对选定文本实例的记忆,同时保持模型的一般能力。尽管已提出了多种方法,但在不显著降低模型实用性的情况下减少大型数据集的内存仍是一个挑战。在本文中,我们研究了均值教师算法(Tarvainen & Valpola, 2017),这是一种源自连续学习文献中的简单邻近优化方法,它逐步修改教师模型。我们展示了均值教师可以逼近慢速自然梯度下降(NGD)的轨迹,而这种更新本质上寻求低曲率更新,这些更新不太可能降低模型实用性。虽然慢速NGD可能会遇到消失的梯度问题,但我们引入了一种新的遗忘损失方法,称为“负对数非可能性”(NLUL),它可以避免这个问题。我们展示了均值教师与NLUL结合后,在MUSE基准测试(Shi et al., 2024)中的某些指标有所提高。
https://arxiv.org/abs/2504.13388
Inferring an adversary's goals from exhibited behavior is crucial for counterplanning and non-cooperative multi-agent systems in domains like cybersecurity, military, and strategy games. Deep Inverse Reinforcement Learning (IRL) methods based on maximum entropy principles show promise in recovering adversaries' goals but are typically offline, require large batch sizes with gradient descent, and rely on first-order updates, limiting their applicability in real-time scenarios. We propose an online Recursive Deep Inverse Reinforcement Learning (RDIRL) approach to recover the cost function governing the adversary actions and goals. Specifically, we minimize an upper bound on the standard Guided Cost Learning (GCL) objective using sequential second-order Newton updates, akin to the Extended Kalman Filter (EKF), leading to a fast (in terms of convergence) learning algorithm. We demonstrate that RDIRL is able to recover cost and reward functions of expert agents in standard and adversarial benchmark tasks. Experiments on benchmark tasks show that our proposed approach outperforms several leading IRL algorithms.
从对手的行为推断其目标对于网络安全、军事和策略游戏中的反规划及非合作多智能体系统至关重要。基于最大熵原则的深度逆向强化学习(IRL)方法在恢复对手的目标方面展现出潜力,但这些方法通常为离线操作,需要大量批次大小以及使用梯度下降,并且依赖于一阶更新,这限制了它们在实时场景中的应用。我们提出了一种在线递归深度逆向强化学习(RDIRL)的方法来恢复控制对手行动和目标的成本函数。具体来说,我们通过类似扩展卡尔曼滤波器(EKF)的顺序二阶牛顿更新最小化标准引导成本学习(GCL)目标上的上限,从而获得一个快速收敛的学习算法。我们在标准和对抗性基准任务中展示了RDIRL能够恢复专家代理的成本和奖励函数的能力。在基准任务上的实验表明,我们提出的方法优于几种领先的IRL算法。 总结来说,这项研究提供了一种新的方法来实时地理解和预测对手的行为,在网络安全、军事策略以及多智能体系统等领域具有重要应用价值。
https://arxiv.org/abs/2504.13241
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4\%-4.8\% mIoU improvements and reduces memory consumption by 27\%-72\%.
我们提出了GDFusion,这是一种用于基于视觉的三维语义占用预测(VisionOcc)的时间融合方法。GDFusion 开拓了在 VisionOcc 框架内未充分探索的时间融合方面,专注于时间线索和融合策略。它系统地审查了整个 VisionOcc 管道,并识别出三个基础但以前被忽视的时间线索:场景级一致性、运动校准和几何补充。这些线索捕捉到时间演变的各个方面,在 VisionOcc 框架的不同模块中做出了独特的贡献。 为了有效融合异构表示中的时序信号,我们提出了一种新的融合策略,通过重新解释标准 RNN 的公式来实现这一点。这种重新解读利用了特征上的梯度下降,以统一不同类型的时序信息的集成,并将提出的时序线索无缝地嵌入到网络中。 在 nuScenes 数据集上进行的广泛实验表明,GDFusion 显著优于现有的基准方法。特别是在 Occ3D 基准测试上,它实现了 1.4%-4.8% 的 mIoU 改进,并减少了 27%-72% 的内存消耗。
https://arxiv.org/abs/2504.12959
Deep learning (DL)-based image classification models are essential for autonomous vehicle (AV) perception modules since incorrect categorization might have severe repercussions. Adversarial attacks are widely studied cyberattacks that can lead DL models to predict inaccurate output, such as incorrectly classified traffic signs by the perception module of an autonomous vehicle. In this study, we create and compare hybrid classical-quantum deep learning (HCQ-DL) models with classical deep learning (C-DL) models to demonstrate robustness against adversarial attacks for perception modules. Before feeding them into the quantum system, we used transfer learning models, alexnet and vgg-16, as feature extractors. We tested over 1000 quantum circuits in our HCQ-DL models for projected gradient descent (PGD), fast gradient sign attack (FGSA), and gradient attack (GA), which are three well-known untargeted adversarial approaches. We evaluated the performance of all models during adversarial attacks and no-attack scenarios. Our HCQ-DL models maintain accuracy above 95\% during a no-attack scenario and above 91\% for GA and FGSA attacks, which is higher than C-DL models. During the PGD attack, our alexnet-based HCQ-DL model maintained an accuracy of 85\% compared to C-DL models that achieved accuracies below 21\%. Our results highlight that the HCQ-DL models provide improved accuracy for traffic sign classification under adversarial settings compared to their classical counterparts.
基于深度学习(DL)的图像分类模型对于自动驾驶汽车(AV)感知模块至关重要,因为错误的分类可能会导致严重后果。对抗性攻击是广泛研究的网络攻击之一,可以导致DL模型预测出不准确的结果,例如自动驾驶车辆感知模块中交通标志被误分类的情况。在这项研究中,我们创建并比较了混合经典-量子深度学习(HCQ-DL)模型与传统深度学习(C-DL)模型,以展示其在对抗性攻击下为感知模块提供的鲁棒性。为了将它们输入到量子系统之前,我们使用迁移学习模型alexnet和vgg-16作为特征提取器。我们在我们的HCQ-DL模型中测试了超过1000个量子电路,针对投影梯度下降(PGD)、快速梯度符号攻击(FGSA)和梯度攻击(GA),这三种著名的非目标对抗性方法进行了测试。我们评估了所有模型在遭遇对抗性和无攻击场景下的性能表现。我们的HCQ-DL模型在无攻击场景下保持超过95%的准确率,在面对GA和FGSA攻击时准确率维持在91%以上,这一数值高于传统DL模型的表现。在PGD攻击期间,我们基于alexnet的HCQ-DL模型能够保持85%的准确性,而C-DL模型的准确率则低于21%。我们的研究结果表明,在对抗性设置下,HCQ-DL模型为交通标志分类提供的精度优于其传统DL模型对应者。
https://arxiv.org/abs/2504.12644
Poisson-Gaussian noise describes the noise of various imaging systems thus the need of efficient algorithms for Poisson-Gaussian image restoration. Deep learning methods offer state-of-the-art performance but often require sensor-specific training when used in a supervised setting. A promising alternative is given by plug-and-play (PnP) methods, which consist in learning only a regularization through a denoiser, allowing to restore images from several sources with the same network. This paper introduces PG-DPIR, an efficient PnP method for high-count Poisson-Gaussian inverse problems, adapted from DPIR. While DPIR is designed for white Gaussian noise, a naive adaptation to Poisson-Gaussian noise leads to prohibitively slow algorithms due to the absence of a closed-form proximal operator. To address this, we adapt DPIR for the specificities of Poisson-Gaussian noise and propose in particular an efficient initialization of the gradient descent required for the proximal step that accelerates convergence by several orders of magnitude. Experiments are conducted on satellite image restoration and super-resolution problems. High-resolution realistic Pleiades images are simulated for the experiments, which demonstrate that PG-DPIR achieves state-of-the-art performance with improved efficiency, which seems promising for on-ground satellite processing chains.
泊松-高斯噪声描述了各种成像系统中的噪声特性,因此需要高效的算法来解决泊松-高斯图像恢复问题。深度学习方法提供了最先进的性能,但当在监督设置下使用时通常需要特定传感器的训练。一种有前景的替代方案是由插件播放(PnP)方法提供的,这些方法仅通过去噪器学习正则化项,从而能够利用同一网络从多个来源恢复图像。本文介绍了PG-DPIR,这是一种高效的PnP方法,用于解决高计数泊松-高斯逆问题,基于DPIR进行了改进。虽然DPIR是为白色高斯噪声设计的,但直接将其应用于泊松-高斯噪声会导致算法运行速度极其缓慢,因为缺乏封闭形式的近似算子。为了应对这一挑战,我们针对Poisson-Gaussian噪声的特点对DPIR进行了调整,并特别提出了一种高效的梯度下降初始化方法,用于加速近似步骤中的收敛速度,提高了几个数量级的速度。实验在卫星图像恢复和超分辨率问题上进行。利用高分辨率的现实Pleiades图像模拟了实验数据,结果表明PG-DPIR实现了最先进的性能并提高了效率,这似乎对于地面卫星处理链来说前景广阔。
https://arxiv.org/abs/2504.10375
Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Additionally, we adapt TinyViT, a compact and efficient vision transformer for coordinate classification in HPE, providing a lightweight yet powerful backbone that enhances privacy-preserving deployment feasibility on resource-limited devices. Our approach is particularly valuable for multimedia interpretation tasks, enabling privacy-safe analysis and understanding of human motion across diverse visual media while preserving the semantic meaning required for downstream applications. Comprehensive experiments on the MPII Human Pose Dataset demonstrate significant performance enhancement with PDP-SGD achieving 78.48% PCKh@0.5 at a strict privacy budget ($\epsilon=0.2$), compared to 63.85% for standard DP-SGD. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.
人体姿态估计(HPE)在医疗保健、活动识别和人机交互等多个领域变得至关重要。然而,处理敏感视觉数据的隐私问题构成了关键应用领域的重大部署障碍。传统的匿名化技术提供的保护有限,并且常常会损害用于广泛运动分析的数据效用。相比之下,差分隐私(DP)提供了正式的隐私保证,但当直接应用于模型训练时通常会导致性能下降。在这项工作中,我们首次提出了一个具有差分隐私保障的二维人体姿态估计方法(2D-HPE),通过将差分私有随机梯度下降法(DP-SGD)应用到这个任务中来实现。为了有效地平衡隐私与性能之间的关系,我们采用了投影式差分私有随机梯度下降法(PDP-SGD),这种方法将带有噪声的梯度投影到了一个低维子空间内。 此外,我们将TinyViT这种紧凑且高效的视觉变换器应用到坐标分类中,并将其用于人体姿态估计,提供了一个轻量级但功能强大的骨干网络,增强了资源受限设备上隐私保护部署的可能性。我们的方法对于多媒体解读任务特别有价值,它能够在不损害下游应用程序所需语义意义的情况下,在各种视觉媒体中实现对人类运动的隐私安全分析与理解。 在MPII人体姿态数据集上的全面实验表明,使用PDP-SGD的方法取得了显著的性能提升,当严格的隐私预算为$\epsilon=0.2$时,实现了78.48%的PCKh@0.5指标,而标准DP-SGD仅达到63.85%。这项工作为现实世界中敏感应用中的差分私有人体姿态估计奠定了基础。
https://arxiv.org/abs/2504.10190
We present a design called \emph{Proof of Gradient Optimization} (PoGO) for blockchain consensus, where miners produce verifiable evidence of training large-scale machine-learning models. Building on previous work, we incorporate \emph{quantized gradients} (4-bit precision) to reduce storage and computation requirements, while still preserving the ability of verifiers to check that real progress has been made on lowering the model's loss. Additionally, we employ Merkle proofs over the full 32-bit model to handle large parameter sets and to enable random leaf checks with minimal on-chain data. We illustrate these ideas using GPT-3 (175B parameters) as a reference example and also refer to smaller but high-performance models (e.g., \emph{Gemma~3} with 27B parameters). We provide an empirical cost analysis showing that verification is significantly cheaper than training, thanks in part to quantization and sampling. We also discuss the necessity of longer block times (potentially hours) when incorporating meaningful training steps, the trade-offs when using specialized GPU hardware, and how binary diffs may incrementally optimize updates. Finally, we note that fine-tuning can be handled in a similar manner, merely changing the dataset and the manner of sampling but preserving the overall verification flow. Our protocol allows verifiers to issue either \emph{positive} or \emph{negative} attestations; these are aggregated at finalization to either confirm the update or slash the miner.
我们提出了一种名为“梯度优化证明”(PoGO)的区块链共识设计,其中矿工生成训练大规模机器学习模型的有效证据。在此前工作的基础上,我们采用4位精度的量化梯度来减少存储和计算需求,同时仍然保持验证者检查实际损失降低进展的能力。此外,我们在整个32位模型上使用Merkle证明处理大量参数,并允许进行随机叶节点检查,仅需极少量的链上数据。我们以GPT-3(1750亿参数)作为参考示例来阐述这些想法,并提及一些较小但高性能的模型(如具有270亿参数的Gemma 3)。我们提供了一个经验成本分析,表明由于量化和采样的原因,验证比训练要便宜得多。我们还讨论了在包含有意义的训练步骤时需要更长的区块时间(可能长达数小时),使用专用GPU硬件时的成本与收益权衡,以及二进制差分如何逐步优化更新的问题。最后,我们注意到微调可以以类似的方式处理,只需更改数据集和采样的方式即可保持整体验证流程不变。我们的协议允许验证者发布“积极”或“消极”的认证;这些在最终确定时进行汇总,确认更新或削减矿工的奖励。
https://arxiv.org/abs/2504.07540
We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove that the generalization error is independent of the network architecture.
我们证明了过度参数化的神经网络能够在测试误差上实现泛化,该误差与过度参数化的程度无关,并且不受Vapnik-Chervonenkis (VC) 维度的影响。我们给出了具体依赖于测试集和训练集的测地几何、激活函数的正则性性质以及权重算子范数及偏置范数的显式界。对于输入空间维度受限的过度参数化深度ReLU网络,我们在不使用梯度下降的情况下明确构造出零损失最小值,并证明了泛化误差与网络架构无关。
https://arxiv.org/abs/2504.05695
Attention mechanisms have revolutionized sequence learning but suffer from quadratic computational complexity. This paper introduces Lattice, a novel recurrent neural network (RNN) mechanism that leverages the inherent low-rank structure of K-V matrices to efficiently compress the cache into a fixed number of memory slots, achieving sub-quadratic complexity. We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step. The resulting recurrence features a state- and input-dependent gating mechanism, offering an interpretable memory update process. The core innovation is the orthogonal update: each memory slot is updated exclusively with information orthogonal to its current state hence incorporation of only novel, non-redundant data, which minimizes the interference with previously stored information. The experimental results show that Lattice achieves the best perplexity compared to all baselines across diverse context lengths, with performance improvement becoming more pronounced as the context length increases.
注意力机制已彻底改变了序列学习,但其计算复杂度呈二次方增长。本文介绍了一种名为Lattice的新颖递归神经网络(RNN)机制,该机制利用K-V矩阵的固有低秩结构来高效地将缓存压缩为固定数量的记忆槽,从而实现次线性复杂度。我们将这种压缩形式化为一个在线优化问题,并基于单步梯度下降推导出一种动态记忆更新规则。由此产生的递归特征包含了一个依赖于状态和输入的门控机制,提供了可解释的记忆更新过程。 Lattice的核心创新在于正交更新:每个记忆槽仅通过与其当前状态正交的信息进行更新,从而只吸收新颖且非冗余的数据,最大限度地减少对先前存储信息的干扰。实验结果表明,无论上下文长度如何,Lattice在所有基准测试中都达到了最佳困惑度(perplexity),并且随着上下文长度的增加,其性能改进愈发明显。
https://arxiv.org/abs/2504.05646
Non-convex constrained optimizations are ubiquitous in robotic applications such as multi-agent navigation, UAV trajectory optimization, and soft robot simulation. For this problem class, conventional optimizers suffer from small step sizes and slow convergence. We propose BC-ADMM, a variant of Alternating Direction Method of Multiplier (ADMM), that can solve a class of non-convex constrained optimizations with biconvex constraint relaxation. Our algorithm allows larger step sizes by breaking the problem into small-scale sub-problems that can be easily solved in parallel. We show that our method has both theoretical convergence speed guarantees and practical convergence guarantees in the asymptotic sense. Through numerical experiments in a row of four robotic applications, we show that BC-ADMM has faster convergence than conventional gradient descent and Newton's method in terms of wall clock time.
非凸约束优化在机器人应用中非常普遍,例如多智能体导航、无人机轨迹优化和软机器人模拟。对于这类问题,传统优化器由于步长小且收敛慢而存在局限性。我们提出了BC-ADMM,这是一种交替方向乘子法(ADMM)的变体,能够解决一类带有双凸约束松弛的非凸约束优化问题。我们的算法通过将大问题分解为可以并行求解的小规模子问题来允许使用更大的步长。我们证明了该方法在理论上的收敛速度保证以及渐近意义下的实际收敛性保证。通过一系列四个机器人应用中的数值实验,我们展示了BC-ADMM在实时时钟时间上比传统梯度下降和牛顿法的收敛速度快。
https://arxiv.org/abs/2504.05465
There exist many methods to explain how an image classification model generates its decision, but very little work has explored methods to explain why a classifier might lack confidence in its prediction. As there are various reasons the classifier might lose confidence, it would be valuable for this model to not only indicate its level of uncertainty but also explain why it is uncertain. Counterfactual images have been used to visualize changes that could be made to an image to generate a different classification decision. In this work, we explore the use of counterfactuals to offer an explanation for low model competency--a generalized form of predictive uncertainty that measures confidence. Toward this end, we develop five novel methods to generate high-competency counterfactual images, namely Image Gradient Descent (IGD), Feature Gradient Descent (FGD), Autoencoder Reconstruction (Reco), Latent Gradient Descent (LGD), and Latent Nearest Neighbors (LNN). We evaluate these methods across two unique datasets containing images with six known causes for low model competency and find Reco, LGD, and LNN to be the most promising methods for counterfactual generation. We further evaluate how these three methods can be utilized by pre-trained Multimodal Large Language Models (MLLMs) to generate language explanations for low model competency. We find that the inclusion of a counterfactual image in the language model query greatly increases the ability of the model to generate an accurate explanation for the cause of low model competency, thus demonstrating the utility of counterfactual images in explaining low perception model competency.
有许多方法可以解释图像分类模型如何生成其决策,但很少有研究探索了分类器为何对其预测缺乏信心的方法。由于造成分类器自信度下降的原因多种多样,该模型不仅需要表明它的不确定性水平,还需要解释为什么它会感到不确定。 反事实图像被用来展示对图像进行哪些改变可以产生不同的分类结果。在这项工作中,我们探讨使用反事实来提供低模型能力(一种衡量信心的预测不确定性的泛化形式)的解释的方法。为此,我们开发了五种新的生成高能力反事实图像的方法:图像梯度下降(IGD)、特征梯度下降(FGD)、自动编码器重建(Reco)、潜在梯度下降(LGD)和潜在最近邻(LNN)。我们在包含六种已知导致低模型能力的图像的两个独特数据集上评估了这些方法,并发现Reco、LGD和LNN是生成反事实最有前景的方法。我们进一步研究了这三种方法如何被预训练的多模态大型语言模型(MLLM)用来为低模型能力产生文本解释。我们发现,在语言模型查询中包含一张反事实图像大大提高了其生成准确解释的能力,即解释造成低模型能力的原因,从而证明了反事实图像在解释低感知模型能力方面的实用性。
https://arxiv.org/abs/2504.05254
Understanding the spatial and temporal patterns of environmental exposure to radio-frequency electromagnetic fields (RF-EMF) is essential for conducting risk assessments. These assessments aim to explore potential connections between RF-EMF exposure and its effects on human health, as well as on wildlife and plant life. Existing research has used different machine learning tools for EMF exposure estimation; however, a comparative analysis of these techniques is required to better understand their performance for real-world datasets. In this work, we present both finite and infinite-width convolutional network-based methods to estimate and assess EMF exposure levels from 70 real-world sensors in Lille, France. A comparative analysis has been conducted to analyze the performance of the methods' execution time and estimation accuracy. To improve estimation accuracy for higher-resolution grids, we utilized a preconditioned gradient descent method for kernel estimation. Root Mean Square Error (RMSE) is used as the evaluation criterion for comparing the performance of these deep learning models.
理解射频电磁场(RF-EMF)的环境暴露在时空模式中的规律对于进行风险评估至关重要。这些评估旨在探索RF-EMF暴露与人类健康、野生动物和植物生命之间潜在联系的影响。现有的研究已经使用了不同的机器学习工具来估算电磁场(EMF)暴露,但是需要对这些技术进行全面分析以更好地了解它们处理实际数据集的表现。在这项工作中,我们提出了基于有限宽度和无限宽度卷积网络的方法来估计和评估法国里尔70个真实世界传感器的EMF暴露水平。进行了一次比较性分析来分析方法执行时间和估算准确性的表现。为了提高更高分辨率网格中的估算准确性,我们在核函数估计中使用了预处理梯度下降法。均方根误差(RMSE)被用作比较这些深度学习模型性能的标准。
https://arxiv.org/abs/2504.07990
Gaussian Process Motion Planning (GPMP) is a widely used framework for generating smooth trajectories within a limited compute time--an essential requirement in many robotic applications. However, traditional GPMP approaches often struggle with enforcing hard nonlinear constraints and rely on Maximum a Posteriori (MAP) solutions that disregard the full Bayesian posterior. This limits planning diversity and ultimately hampers decision-making. Recent efforts to integrate Stein Variational Gradient Descent (SVGD) into motion planning have shown promise in handling complex constraints. Nonetheless, these methods still face persistent challenges, such as difficulties in strictly enforcing constraints and inefficiencies when the probabilistic inference problem is poorly conditioned. To address these issues, we propose a novel constrained Stein Variational Gaussian Process Motion Planning (cSGPMP) framework, incorporating a GPMP prior specifically designed for trajectory optimization under hard constraints. Our approach improves the efficiency of particle-based inference while explicitly handling nonlinear constraints. This advancement significantly broadens the applicability of GPMP to motion planning scenarios demanding robust Bayesian inference, strict constraint adherence, and computational efficiency within a limited time. We validate our method on standard benchmarks, achieving an average success rate of 98.57% across 350 planning tasks, significantly outperforming competitive baselines. This demonstrates the ability of our method to discover and use diverse trajectory modes, enhancing flexibility and adaptability in complex environments, and delivering significant improvements over standard baselines without incurring major computational costs.
高斯过程运动规划(GPMP)是一种广泛应用于生成平滑轨迹的框架,这对于许多机器人应用中的计算时间限制是一个基本要求。然而,传统的GPMP方法在处理硬非线性约束时往往面临困难,并且依赖于最大后验概率(MAP)解,这忽略了完整的贝叶斯后验分布。这种做法限制了规划多样性并最终影响决策制定能力。最近将Stein变分梯度下降法(SVGD)集成到运动规划中的尝试,在处理复杂约束方面显示出潜力。尽管如此,这些方法仍然面临持续的挑战,例如难以严格强制执行约束和在概率推理问题条件不佳时效率低下。 为了解决这些问题,我们提出了一种新的带有非线性约束的Stein变分高斯过程运动规划(cSGPMP)框架,该框架结合了专门用于硬约束轨迹优化的GPMP先验。我们的方法通过显式处理非线性约束提高了基于粒子的推理效率。这一改进大大扩展了GPMP在需要稳健贝叶斯推断、严格遵守约束和有限时间内计算高效的运动规划场景中的适用范围。 我们在标准基准上验证了我们提出的方法,实现了350个规划任务的平均成功率98.57%,显著优于竞争性基线方法。这表明我们的方法能够发现并利用多种轨迹模式,提高在复杂环境下的灵活性和适应性,并且在不大幅增加计算成本的情况下对标准基线进行了重大改进。 这种方法为机器人系统中的运动规划提供了一种强大的新工具,尤其是在需要遵守硬约束条件的动态环境中显得尤为重要。
https://arxiv.org/abs/2504.04936
Recent advancements in Transformer-based architectures have led to impressive breakthroughs in natural language processing tasks, with models such as GPT-4, Claude, and Gemini demonstrating human-level reasoning abilities. However, despite their high performance, concerns remain about the inherent limitations of these models, especially when it comes to learning basic logical functions. While complexity-theoretic analyses indicate that Transformers can represent simple logic functions (e.g., $\mathsf{AND}$, $\mathsf{OR}$, and majority gates) by its nature of belonging to the $\mathsf{TC}^0$ class, these results assume ideal parameter settings and do not account for the constraints imposed by gradient descent-based training methods. In this work, we investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. We focus on a simplified variant of the Transformer architecture and consider both $n=\mathrm{poly}(d)$ and $n=\exp(\Omega(d))$ number of training samples, where each sample is a $d$-size binary string paired with the output of a basic majority function. Our analysis demonstrates that even after $\mathrm{poly}(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large, growing exponentially with $d$. This work highlights fundamental optimization challenges in training Transformers for the simplest logical reasoning tasks and provides new insights into their theoretical limitations.
最近基于Transformer架构的进展在自然语言处理任务中取得了显著突破,模型如GPT-4、Claude和Gemini展示了接近人类水平的理解能力。然而,尽管这些模型表现出色,人们依然对其内在局限性表示担忧,特别是在学习基本逻辑功能方面的问题。虽然复杂性理论分析表明,Transformer由于属于$\mathsf{TC}^0$类(理论上能够代表简单的逻辑函数如$\mathsf{AND}$、$\mathsf{OR}$和多数门),可以自然地表达这些简单逻辑函数,但这些结果假设了理想化的参数设置,并未考虑基于梯度下降的训练方法所施加的实际约束。本研究探讨了使用基于梯度的方法训练时,Transformer是否能够真正学会简单的多数表决函数(majority functions)。 我们专注于一个简化的Transformer架构变体,并考虑两种不同的样本数量:$n=\mathrm{poly}(d)$和$n=\exp(\Omega(d))$,其中每个样本由长度为$d$的二进制字符串与其对应的简单多数表决函数输出组成。我们的分析表明,在进行了$\mathrm{poly}(d)$次梯度查询之后,Transformer模型的泛化误差仍然显著较大,并且随着$d$的增长呈指数上升。 这项工作突显了在最简单的逻辑推理任务中训练Transformer时面临的根本性优化挑战,并为理解这些模型的理论局限提供了新的视角。
https://arxiv.org/abs/2504.04702
Linear attention methods offer a compelling alternative to softmax attention due to their efficiency in recurrent decoding. Recent research has focused on enhancing standard linear attention by incorporating gating while retaining its computational benefits. Such Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV. In this work, we investigate the in-context learning capabilities of the GLA model and make the following contributions. We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms with data-dependent weights. These weights are induced by the gating mechanism and the input, enabling the model to control the contribution of individual tokens to prediction. To further understand the mechanics of this weighting, we introduce a novel data model with multitask prompts and characterize the optimization landscape of learning a WPGD algorithm. Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution. Finally, we translate these findings to explore the optimization landscape of GLA and shed light on how gating facilitates context-aware learning and when it is provably better than vanilla linear attention.
线性注意力方法由于其在递归解码中的效率,为传统的softmax注意力提供了一种有吸引力的替代方案。最近的研究集中在通过引入门控机制来增强标准线性注意力,同时保持计算效益不变。这种门控线性注意(GLA)架构包括竞争性的模型如Mamba和RWKV。在这项工作中,我们探讨了GLA模型在情境学习中的能力,并作出了以下贡献: 1. 我们展示了多层的GLA可以实现一类具有数据依赖权重的加权预处理梯度下降(WPGD)算法。这些权重是由门控机制和输入共同诱导出来的,使模型能够控制各个令牌对预测的贡献。 2. 为进一步理解这种加权机制的工作原理,我们引入了一种新的数据模型,该模型使用多任务提示,并描述了学习WPGD算法时优化景观的特点。在适度条件下,我们证明了一个全局最小值的存在性和唯一性(直到缩放为止),这对应于一个唯一的WPGD解决方案。 3. 最后,我们将这些发现转化为探讨GLA的优化景观,并阐明门控机制如何促进上下文感知学习以及何时可以严格证明其优于传统的线性注意力。 这项研究提供了关于门控线性注意模型内部运作的新见解,并为进一步改进语言模型中的注意力机制指明了方向。
https://arxiv.org/abs/2504.04308