Gradient_Descent

Improving Line Search Methods for Large Scale Neural Network Training

2024-03-27 12:50:27

Philip Kenneweg, Tristan Kenneweg, Barbara Hammer

arXiv_AI

arXiv_AI Gradient_Descent Optimization Transformer Pose Enhancement
Abstract

In recent studies, line search methods have shown significant improvements in the performance of traditional stochastic gradient descent techniques, eliminating the need for a specific learning rate schedule. In this paper, we identify existing issues in state-of-the-art line search methods, propose enhancements, and rigorously evaluate their effectiveness. We test these methods on larger datasets and more complex data domains than before. Specifically, we improve the Armijo line search by integrating the momentum term from ADAM in its search direction, enabling efficient large-scale training, a task that was previously prone to failure using Armijo line search methods. Our optimization approach outperforms both the previous Armijo implementation and tuned learning rate schedules for Adam. Our evaluation focuses on Transformers and CNNs in the domains of NLP and image data. Our work is publicly available as a Python package, which provides a hyperparameter free Pytorch optimizer.

Abstract (translated)

在最近的研究中，线搜索方法已经显著提高了传统随机梯度下降技术的表现，无需特定的学习率计划。在本文中，我们指出了最先进的线搜索方法的现有问题，提出了改进，并对其有效性进行了严格的评估。我们使用更大的数据集和更复杂的数据领域测试这些方法。具体来说，我们通过将ADAM中的动量项集成到搜索方向中，改进了Armijo线搜索，使得大规模训练成为可能，而这一点在使用Armijo线搜索方法时曾经容易导致失败。我们的优化方法超越了前Armijo实现和自适应学习率计划。我们的评估重点在于自然语言处理和图像数据的领域。我们的工作已公开发布为Python软件包，该软件包提供了一个不需要超参数的免费Pytorch优化器。

URL

https://arxiv.org/abs/2403.18519

PDF

https://arxiv.org/pdf/2403.18519.pdf
Read All
Faster Convergence for Transformer Fine-tuning with Line Search Methods

2024-03-27 12:35:23

Philip Kenneweg, Leonardo Galli, Tristan Kenneweg, Barbara Hammer

arXiv_AI

arXiv_AI Gradient_Descent Optimization Transformer
Abstract

Recent works have shown that line search methods greatly increase performance of traditional stochastic gradient descent methods on a variety of datasets and architectures [1], [2]. In this work we succeed in extending line search methods to the novel and highly popular Transformer architecture and dataset domains in natural language processing. More specifically, we combine the Armijo line search with the Adam optimizer and extend it by subdividing the networks architecture into sensible units and perform the line search separately on these local units. Our optimization method outperforms the traditional Adam optimizer and achieves significant performance improvements for small data sets or small training budgets, while performing equal or better for other tested cases. Our work is publicly available as a python package, which provides a hyperparameter-free pytorch optimizer that is compatible with arbitrary network architectures.

Abstract (translated)

近年来，研究表明，行搜索方法在各种数据集和架构上大大提高了传统随机梯度下降算法的性能[1]，[2]。在本文中，我们在自然语言处理领域成功将行搜索方法扩展到了新颖且高度流行的Transformer架构和数据领域。具体来说，我们将Armijo行搜索与Adam优化器相结合，并将其扩展到这些局部单元上，并对这些局部单元分别进行行搜索。我们的优化方法在小型数据集或小型训练预算上表现出优异的性能，而在其他测试用例上则表现出与传统Adam优化器相当或更好的性能。我们的工作已经作为Python软件包公开发布，该软件包提供了一个与任意网络架构兼容的、无需超参数的PyTorch优化器。

URL

https://arxiv.org/abs/2403.18506

PDF

https://arxiv.org/pdf/2403.18506.pdf
Read All
A Unified Kernel for Neural Network Learning

2024-03-26 07:55:45

Shao-Qun Zhang, Zong-Yi Chen, Yong-Ming Tian, Xun Lu

arXiv_AI

arXiv_AI Gradient_Descent Inference Pose
Abstract

Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which characterizes the learning dynamics of neural networks with gradient descents and parameter initialization. The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.

Abstract (translated)

在过去的几十年里，对神经网络学习和核学习之间的区别和联系产生了浓厚的兴趣。最近的发展使得无限宽神经网络和 Gaussian 过程之间的理论联系更加明确。出现了两种主要方法：神经网络核函数（NNGP）和神经 tangent 核函数（NTK）。前者的根源在于贝叶斯推理，代表零阶核函数；而后者则根植于梯度下降的 tangent 空间，代表一阶核函数。在本文中，我们提出了统一神经核函数（UNK），它描述了使用梯度下降和参数初始化的神经网络的学习动态。所提出的 UNK 核函数保持 NNGP 和 NTK 的极限性质，表现出类似于 NTK 的有限学习步数和当学习步数趋近于无穷大时趋近于 NNGP 的行为。此外，我们还理论地刻画了 UNK 核的均匀紧缩性和学习收敛性，为统一的核函数提供了全面的洞察。实验结果证实了我们所提出方法的有效性。

URL

https://arxiv.org/abs/2403.17467

PDF

https://arxiv.org/pdf/2403.17467.pdf
Read All
Hyperspherical Classification with Dynamic Label-to-Prototype Assignment

2024-03-25 17:01:34

Mohammad Saeed Ebrahimi Saadabadi, Ali Dabouei, Sahar Rahimi Malakshan, Nasser M. Nasrabad

arXiv_CV

arXiv_CV Gradient_Descent Classification Relation Optimization Pose Matching
Abstract

Aiming to enhance the utilization of metric space by the parametric softmax classifier, recent studies suggest replacing it with a non-parametric alternative. Although a non-parametric classifier may provide better metric space utilization, it introduces the challenge of capturing inter-class relationships. A shared characteristic among prior non-parametric classifiers is the static assignment of labels to prototypes during the training, ie, each prototype consistently represents a class throughout the training course. Orthogonal to previous works, we present a simple yet effective method to optimize the category assigned to each prototype (label-to-prototype assignment) during the training. To this aim, we formalize the problem as a two-step optimization objective over network parameters and label-to-prototype assignment mapping. We solve this optimization using a sequential combination of gradient descent and Bipartide matching. We demonstrate the benefits of the proposed approach by conducting experiments on balanced and long-tail classification problems using different backbone network architectures. In particular, our method outperforms its competitors by 1.22\% accuracy on CIFAR-100, and 2.15\% on ImageNet-200 using a metric space dimension half of the size of its competitors. Code: this https URL

Abstract (translated)

旨在通过参数软最大分类器的指标空间利用率，最近的研究建议用非参数分类器来代替它。尽管非参数分类器可以提供更好的指标空间利用率，但它引入了捕捉类间关系的问题。在先前的非参数分类器中，共同的特点是训练期间将标签分配给原型（即每个原型在训练过程中始终代表一个类别）。与之前的工作不同，我们提出了一个简单而有效的优化方法来优化在训练期间分配给每个原型的类别（标签-原型分配映射）。为了实现这一目标，我们将问题转化为网络参数和标签-原型分配映射的二维优化目标。我们通过梯度下降和Bipartite匹配来求解这个优化问题。我们通过使用不同骨干网络架构对平衡和长尾分类问题进行实验，证明了所提出方法的优越性。特别是，我们的方法在CIFAR-100上的准确率比其竞争对手高1.22%，而在ImageNet-200上的准确率比其竞争对手高2.15%。代码：https:// this URL

URL

https://arxiv.org/abs/2403.16937

PDF

https://arxiv.org/pdf/2403.16937.pdf
Read All
Backpropagation through space, time, and the brain

2024-03-25 16:57:02

Benjamin Ellenberger, Paul Haider, Jakob Jordan, Kevin Max, Ismael Jaras, Laura Kriener, Federico Benitez, Mihai A. Petrovici

arXiv_AI

arXiv_AI Gradient_Descent
Abstract

Effective learning in neuronal networks requires the adaptation of individual synapses given their relative contribution to solving a task. However, physical neuronal systems -- whether biological or artificial -- are constrained by spatio-temporal locality. How such networks can perform efficient credit assignment, remains, to a large extent, an open question. In Machine Learning, the answer is almost universally given by the error backpropagation algorithm, through both space (BP) and time (BPTT). However, BP(TT) is well-known to rely on biologically implausible assumptions, in particular with respect to spatiotemporal (non-)locality, while forward-propagation models such as real-time recurrent learning (RTRL) suffer from prohibitive memory constraints. We introduce Generalized Latent Equilibrium (GLE), a computational framework for fully local spatio-temporal credit assignment in physical, dynamical networks of neurons. We start by defining an energy based on neuron-local mismatches, from which we derive both neuronal dynamics via stationarity and parameter dynamics via gradient descent. The resulting dynamics can be interpreted as a real-time, biologically plausible approximation of BPTT in deep cortical networks with continuous-time neuronal dynamics and continuously active, local synaptic plasticity. In particular, GLE exploits the ability of biological neurons to phase-shift their output rate with respect to their membrane potential, which is essential in both directions of information propagation. For the forward computation, it enables the mapping of time-continuous inputs to neuronal space, performing an effective spatiotemporal convolution. For the backward computation, it permits the temporal inversion of feedback signals, which consequently approximate the adjoint states necessary for useful parameter updates.

Abstract (translated)

在神经网络中实现有效的学习需要适应个体突触以解决任务，然而，物理神经网络（无论是生物还是人工）受到时空局部性的约束。如何实现这样的网络高效信用分配，仍然是一个开放性问题。在机器学习中，答案几乎被普遍认为是误差反向传播算法，无论是空间（BP）还是时间（BPTT）。然而，BP（TT）已知依赖于不适用于生物的假设，特别是与时空非局部性有关的假设，而像实时循环学习（RTRL）这样的前馈网络则受到记忆约束的困扰。我们引入了泛化拉格朗日均衡（GLE），一种计算物理、动态神经元网络中完全局部时空间信用分配的计算框架。我们首先定义一个基于神经元局部不匹配的势能，然后通过稳态和参数下降从该势能导出神经元动力学。由此产生的动力学可以解释为在具有连续时间神经元动态和连续活动局部突触的深度皮质网络中，BPTT的生物合理近似。特别是，GLE利用了生物神经元相对于膜电位调整输出率的能力，这对于信息传播的 both 方向都是至关重要的。对于前馈计算，它实现了将时间连续输入映射到神经元空间，并执行有效的时空间卷积。对于后馈计算，它允许时域反馈信号的时域反演，从而实现有用的参数更新所需的伴随状态近似。

URL

https://arxiv.org/abs/2403.16933

PDF

https://arxiv.org/pdf/2403.16933.pdf
Read All
Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm

2024-03-25 14:54:42

Titouan Renard, Andreas Schlaginhaufen, Tingting Ni, Maryam Kamgarpour

arXiv_AI

arXiv_AI Gradient_Descent Reinforcement_Learning Pose
Abstract

Given a dataset of expert demonstrations, inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal. This work proposes a model-free algorithm to solve entropy-regularized IRL problem. In particular, we employ a stochastic gradient descent update for the reward and a stochastic soft policy iteration update for the policy. Assuming access to a generative model, we prove that our algorithm is guaranteed to recover a reward for which the expert is $\varepsilon$-optimal using $\mathcal{O}(1/\varepsilon^{2})$ samples of the Markov decision process (MDP). Furthermore, with $\mathcal{O}(1/\varepsilon^{4})$ samples we prove that the optimal policy corresponding to the recovered reward is $\varepsilon$-close to the expert policy in total variation distance.

Abstract (translated)

给定专家演示的数据集，反强化学习（IRL）旨在恢复专家的最优奖励。本文提出了一种无需模型的IRL求解熵正则化问题。具体来说，我们使用随机梯度下降更新来更新奖励，并使用随机软策略迭代更新策略。假设可以使用生成模型，我们证明我们的算法使用马尔可夫决策过程（MDP）的$\mathcal{O}(1/\varepsilon^{2})$个样本可以保证恢复专家的奖励，使得专家最优。此外，我们使用$\mathcal{O}(1/\varepsilon^{4})$个样本来证明，与恢复的奖励相关的最优策略在总方差距离上与专家策略$\varepsilon$-接近。

URL

https://arxiv.org/abs/2403.16829

PDF

https://arxiv.org/pdf/2403.16829.pdf
Read All
Multi-Task Learning with Multi-Task Optimization

2024-03-24 14:04:40

Lu Bai, Abhishek Gupta, Yew-Soon Ong

arXiv_AI

arXiv_AI Gradient_Descent Classification Image_Classification Optimization Pose
Abstract

Multi-task learning solves multiple correlated tasks. However, conflicts may exist between them. In such circumstances, a single solution can rarely optimize all the tasks, leading to performance trade-offs. To arrive at a set of optimized yet well-distributed models that collectively embody different trade-offs in one algorithmic pass, this paper proposes to view Pareto multi-task learning through the lens of multi-task optimization. Multi-task learning is first cast as a multi-objective optimization problem, which is then decomposed into a diverse set of unconstrained scalar-valued subproblems. These subproblems are solved jointly using a novel multi-task gradient descent method, whose uniqueness lies in the iterative transfer of model parameters among the subproblems during the course of optimization. A theorem proving faster convergence through the inclusion of such transfers is presented. We investigate the proposed multi-task learning with multi-task optimization for solving various problem settings including image classification, scene understanding, and multi-target regression. Comprehensive experiments confirm that the proposed method significantly advances the state-of-the-art in discovering sets of Pareto-optimized models. Notably, on the large image dataset we tested on, namely NYUv2, the hypervolume convergence achieved by our method was found to be nearly two times faster than the next-best among the state-of-the-art.

Abstract (translated)

多任务学习可以解决多个相关任务。然而，它们之间可能存在冲突。在这种情况下，单独的解决方案很难优化所有任务，导致性能权衡。为了在单个算法通过多任务优化来达到一组最优但分布良好的模型，本文提出了一种将帕雷托多任务学习通过多任务优化视角审视的方法。首先将多任务学习表示为一个多目标优化问题，然后将其分解为一组无约束的标量值子问题。这些子问题使用一种新颖的多任务梯度下降方法共同求解，该方法的独特之处在于在优化过程中模型参数在子问题之间的传递。通过引入这种传递，证明了更快的收敛速度的一个定理。我们研究了使用多任务优化解决各种问题的多任务学习方法，包括图像分类、场景理解和多目标回归。综合实验证实了所提出的方法在发现帕雷托最优模型方面显著提高了现有水平。值得注意的是，在测试的大图像数据集上，即NYUv2，我们测试的方法的凸集增长速度被认为是目前最佳方法的近两倍。

URL

https://arxiv.org/abs/2403.16162

PDF

https://arxiv.org/pdf/2403.16162.pdf
Read All
Diffusion-based Aesthetic QR Code Generation via Scanning-Robust Perceptual Guidance

2024-03-23 16:08:48

Jia-Wei Liao, Winston Wang, Tzu-Sian Wang, Li-Xuan Peng, Cheng-Fu Chou, Jun-Cheng Chen

arXiv_CV

arXiv_CV Gradient_Descent Quantitative Pose Diffusion
Abstract

QR codes, prevalent in daily applications, lack visual appeal due to their conventional black-and-white design. Integrating aesthetics while maintaining scannability poses a challenge. In this paper, we introduce a novel diffusion-model-based aesthetic QR code generation pipeline, utilizing pre-trained ControlNet and guided iterative refinement via a novel classifier guidance (SRG) based on the proposed Scanning-Robust Loss (SRL) tailored with QR code mechanisms, which ensures both aesthetics and scannability. To further improve the scannability while preserving aesthetics, we propose a two-stage pipeline with Scanning-Robust Perceptual Guidance (SRPG). Moreover, we can further enhance the scannability of the generated QR code by post-processing it through the proposed Scanning-Robust Projected Gradient Descent (SRPGD) post-processing technique based on SRL with proven convergence. With extensive quantitative, qualitative, and subjective experiments, the results demonstrate that the proposed approach can generate diverse aesthetic QR codes with flexibility in detail. In addition, our pipelines outperforming existing models in terms of Scanning Success Rate (SSR) 86.67% (+40%) with comparable aesthetic scores. The pipeline combined with SRPGD further achieves 96.67% (+50%). Our code will be available this https URL.

Abstract (translated)

二维码在日常应用中普遍存在，但由于其传统的黑白色设计，缺乏视觉吸引力。在保持可扫描性的同时实现美学是一个挑战。在本文中，我们提出了一种新颖的扩散模型为基础的美学QR码生成管道，利用预训练的控制网络并通过基于提出的扫描鲁棒损失（SRL）的新分类器指导（SRG）来确保美学和可扫描性。为了进一步提高可扫描性而保留美学，我们提出了一个两阶段管道：扫描鲁棒感知指导（SRPG）。此外，通过基于SRL的扫描鲁棒投影梯度下降（SRPGD）后处理技术，我们可以进一步增强生成的QR码的可扫描性。通过广泛的定量、定性和主观实验，结果表明，与现有模型相比，该方法在详细方面具有灵活性。此外，我们的管道在Scanning Success Rate (SSR) 86.67% (+40%)的条件下优于现有模型，具有相当的美学分数。与SRPGD的结合进一步实现了96.67% (+50%)的Scanning Success Rate。我们的代码将在https:// URL上发布。

URL

https://arxiv.org/abs/2403.15878

PDF

https://arxiv.org/pdf/2403.15878.pdf
Read All
Scaling Learning based Policy Optimization for Temporal Tasks via Dropout

2024-03-23 12:53:51

Navid Hashemi, Bardh Hoxha, Danil Prokhorov, Georgios Fainekos, Jyotirmoy Deshmukh

arXiv_RO

arXiv_RO RNN Gradient_Descent Optimization Quantitative Pose Autonomous Action Agent
Abstract

This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear environment. We desire the trained policy to ensure that the agent satisfies specific task objectives, expressed in discrete-time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the robustness, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback controllers, and we assume a feed forward neural network for learning these feedback controllers. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and naïve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. We show that, the existing smooth semantics for robustness are inefficient regarding gradient computation when the specification becomes complex. To address this challenge, we propose a new smooth semantics for DT-STL that under-approximates the robustness value and scales well for backpropagation over a complex specification. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.

Abstract (translated)

本文提出了一种基于模型的训练反馈控制器训练方法，用于在高度非线性环境中训练自主智能体。我们希望训练后的策略确保智能体满足特定的任务目标，用离散时间信号时序逻辑（DT-STL）表示。通过形式框架（如DT-STL）重新定义任务的优点是可以实现定量的满足语义。换句话说，给定轨迹和DT-STL公式，我们可以计算鲁棒性，可以解释为轨迹与满足该公式的轨迹集合之间的近似有符号距离。我们使用反馈控制器，并假设用于学习这些反馈控制器的深度神经网络。我们证明了这种学习问题与训练递归神经网络（RNNs）类似，其中递归单元的数量与智能体任务目标的时域约束成正比。这提出了一个挑战：RNNs容易受到梯度消失和爆炸的影响，因此基于梯度的 naive 梯度下降策略解决长时任务目标也会面临相同的问题。为解决这个问题，我们引入了一种基于停顿或梯度采样原理的新颖梯度逼近算法。我们证明了，当规范变得复杂时，现有鲁棒性语义是低效的。为了应对这个挑战，我们提出了一个新的基于梯度的DT-STL语义，它低于鲁棒性值，并且在复杂规范下具有良好的可扩展性。我们还证明了我们的控制合成方法论对于使用随机梯度下降（SGD）更有效地收敛并实现可扩展的时延梯度传播在具有高维状态空间的长时任务上很有帮助。

URL

https://arxiv.org/abs/2403.15826

PDF

https://arxiv.org/pdf/2403.15826.pdf
Read All
Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

2024-03-23 03:57:28

Aakash Lahoti, Stefani Karp, Ezra Winston, Aarti Singh, Yuanzhi Li

arXiv_AI

arXiv_AI CNN Gradient_Descent Classification Sparse
Abstract

Vision tasks are characterized by the properties of locality and translation invariance. The superior performance of convolutional neural networks (CNNs) on these tasks is widely attributed to the inductive bias of locality and weight sharing baked into their architecture. Existing attempts to quantify the statistical benefits of these biases in CNNs over locally connected convolutional neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories: either they disregard the optimizer and only provide uniform convergence upper bounds with no separating lower bounds, or they consider simplistic tasks that do not truly mirror the locality and translation invariance as found in real-world vision tasks. To address these deficiencies, we introduce the Dynamic Signal Distribution (DSD) classification task that models an image as consisting of $k$ patches, each of dimension $d$, and the label is determined by a $d$-sparse signal vector that can freely appear in any one of the $k$ patches. On this task, for any orthogonally equivariant algorithm like gradient descent, we prove that CNNs require $\tilde{O}(k+d)$ samples, whereas LCNs require $\Omega(kd)$ samples, establishing the statistical advantages of weight sharing in translation invariant tasks. Furthermore, LCNs need $\tilde{O}(k(k+d))$ samples, compared to $\Omega(k^2d)$ samples for FCNs, showcasing the benefits of locality in local tasks. Additionally, we develop information theoretic tools for analyzing randomized algorithms, which may be of interest for statistical research.

Abstract (translated)

翻译：视觉任务的特点在于局部性和平移不变性的属性。在这些任务上，卷积神经网络（CNNs）优越的表现很大程度上归因于它们架构中归纳偏好的平移不变性和权重共享。现有尝试衡量这些偏见在CNNs上的统计益处与局部连接卷积神经网络（LCNs）和全连接神经网络（FCNs）相比是否成立，可以分为以下几类：要么他们忽略了优化器，仅提供无分离下界的不均匀收敛，要么他们考虑了不真正反映现实世界视觉任务中局部性和平移不变性的简单任务。为了克服这些缺陷，我们引入了动态信号分布（DSD）分类任务，该任务将图像表示为由$k$个补丁组成，每个补丁的维度为$d$，标签由一个维度为$d$的稀疏信号向量决定。在这个任务上，对于任何正交等价的算法，如梯度下降，我们证明CNNs需要$\tilde{O}(k+d)$样本，而LCNs需要$\Omega(kd)$样本，从而建立了平移不变任务中权重共享的统计优势。此外，LCNs需要$\tilde{O}(k(k+d))$样本，与FCNs的$\Omega(k^2d)$样本相比，展示了局部性在局部任务上的优势。此外，我们还开发了信息论工具来分析随机算法，这可能对统计研究有感兴趣。

URL

https://arxiv.org/abs/2403.15707

PDF

https://arxiv.org/pdf/2403.15707.pdf
Read All
$nabla tau$: Gradient-based and Task-Agnostic machine Unlearning

2024-03-21 12:11:26

Daniel Trippa, Cesare Campagnano, Maria Sofia Bucarelli, Gabriele Tolomei, Fabrizio Silvestri

arXiv_AI

arXiv_AI Gradient_Descent Face Attention Inference Optimization Enhancement
Abstract

Machine Unlearning, the process of selectively eliminating the influence of certain data examples used during a model's training, has gained significant attention as a means for practitioners to comply with recent data protection regulations. However, existing unlearning methods face critical drawbacks, including their prohibitively high cost, often associated with a large number of hyperparameters, and the limitation of forgetting only relatively small data portions. This often makes retraining the model from scratch a quicker and more effective solution. In this study, we introduce Gradient-based and Task-Agnostic machine Unlearning ($\nabla \tau$), an optimization framework designed to remove the influence of a subset of training data efficiently. It applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. $\nabla \tau$ offers multiple benefits over existing approaches. It enables the unlearning of large sections of the training dataset (up to 30%). It is versatile, supporting various unlearning tasks (such as subset forgetting or class removal) and applicable across different domains (images, text, etc.). Importantly, $\nabla \tau$ requires no hyperparameter adjustments, making it a more appealing option than retraining the model from scratch. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics, demonstrating up to 10% enhancements in performance compared to state-of-the-art methods without compromising the original model's accuracy.

Abstract (translated)

机器无监督学习，即在模型训练过程中选择性地消除某些数据示例的影响，已经引起了实践者遵守最近的数据保护法规的广泛关注。然而，现有的无监督学习方法面临着关键的缺点，包括其高得惊人的成本，通常与大量超参数相关，以及只能忘记相对较小的数据部分的限制。这通常使得从零开始重新训练模型成为更快、更有效的解决方案。在本文中，我们介绍了基于梯度的无监督学习和任务无关的无监督学习（$\nabla \tau$），一种旨在有效地移除训练数据中指定数据示例影响的优化框架。它使用自适应梯度上升对要忘记的数据应用标准梯度下降。$\nabla \tau$ 提供了比现有方法多项优势。它能够启发式地忘记训练数据的大部分（最多30%）。它具有多样性，支持各种无监督任务（如子集忘记或类别删除），并适用于各种领域（图像，文本等）。重要的是，$\nabla \tau$ 不需要进行超参数调整，使其比从头开始重新训练模型更具吸引力。我们通过一系列经过检验的元学习攻击指标来评估我们的框架的有效性，证明了与最先进方法相比，性能提高了10%以上，同时保持原始模型的准确性。

URL

https://arxiv.org/abs/2403.14339

PDF

https://arxiv.org/pdf/2403.14339.pdf
Read All
No more optimization rules: LLM-enabled policy-based multi-modal query optimizer

2024-03-20 13:44:30

Yifan Wang, Haodi Ma, Daisy Zhe Wang

arXiv_AI

arXiv_AI Gradient_Descent Deep_Learning Optimization Language_Model Pose
Abstract

Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there is no work on the query optimization capability of LLM. As a critical (or could even be the most important) step that significantly impacts the execution performance of the query plan, such analysis and attempts should not be missed. From another aspect, existing query optimizers are usually rule-based or rule-based + cost-based, i.e., they are dependent on manually created rules to complete the query plan rewrite/transformation. Given the fact that modern optimizers include hundreds to thousands of rules, designing a multi-modal query optimizer following a similar way is significantly time-consuming since we will have to enumerate as many multi-modal optimization rules as possible, which has not been well addressed today. In this paper, we investigate the query optimization ability of LLM and use LLM to design LaPuda, a novel LLM and Policy based multi-modal query optimizer. Instead of enumerating specific and detailed rules, LaPuda only needs a few abstract policies to guide LLM in the optimization, by which much time and human effort are saved. Furthermore, to prevent LLM from making mistakes or negative optimization, we borrow the idea of gradient descent and propose a guided cost descent (GCD) algorithm to perform the optimization, such that the optimization can be kept in the correct direction. In our evaluation, our methods consistently outperform the baselines in most cases. For example, the optimized plans generated by our methods result in 1~3x higher execution speed than those by the baselines.

Abstract (translated)

大语言模型（LLM）在机器学习和深度学习领域标志着了一个重要的里程碑。最近，研究了其查询规划能力，包括单模态和多模态查询。然而，LLM的查询优化能力尚未得到研究。作为对查询计划执行性能的显著影响的关键（甚至可能是最重要的）一步，这种分析和尝试不应该被忽视。从另一个方面说，现有的查询优化器通常是基于规则的或基于规则的+成本基于的，即它们依赖于人工创建的规则来完成查询规划的重新编写/转换。考虑到现代优化器包括数百到数千个规则，根据类似的方法设计一个多模态查询优化器会耗时很长时间，因为我们必须枚举尽可能多的多模态优化规则，而今天这个问题尚未得到很好的解决。在本文中，我们研究了LLM的查询优化能力，并使用LLM设计了一种新颖的LLM和基于策略的多模态查询优化器LaPuda。我们不再枚举具体的和详细的规则，而只是需要几个抽象策略来指导LLM在优化，从而大大节省了时间和人力。此外，为了防止LLM犯错或进行负优化，我们借用了梯度下降的思想，并提出了一种引导成本下降（GCD）算法来进行优化，使得优化可以在正确的方向上进行。在我们的评估中，我们的方法在大多数情况下都超越了基线。例如，通过我们的方法优化后的计划使执行速度提高了1~3倍，而基线则没有变化。

URL

https://arxiv.org/abs/2403.13597

PDF

https://arxiv.org/pdf/2403.13597.pdf
Read All
Gradient-based Fuzzy System Optimisation via Automatic Differentiation -- FuzzyR as a Use Case

2024-03-18 23:18:16

Chao Chen, Christian Wagner, Jonathan M. Garibaldi

arXiv_AI

arXiv_AI Gradient_Descent Inference Knowledge
Abstract

Since their introduction, fuzzy sets and systems have become an important area of research known for its versatility in modelling, knowledge representation and reasoning, and increasingly its potential within the context explainable AI. While the applications of fuzzy systems are diverse, there has been comparatively little advancement in their design from a machine learning perspective. In other words, while representations such as neural networks have benefited from a boom in learning capability driven by an increase in computational performance in combination with advances in their training mechanisms and available tool, in particular gradient descent, the impact on fuzzy system design has been limited. In this paper, we discuss gradient-descent-based optimisation of fuzzy systems, focussing in particular on automatic differentiation--crucial to neural network learning--with a view to free fuzzy system designers from intricate derivative computations, allowing for more focus on the functional and explainability aspects of their design. As a starting point, we present a use case in FuzzyR which demonstrates how current fuzzy inference system implementations can be adjusted to leverage powerful features of automatic differentiation tools sets, discussing its potential for the future of fuzzy system design.

Abstract (translated)

自它们被引入以来，模糊集合和系统已成为一个重要的研究领域，以其在建模、知识表示和推理方面的灵活性而闻名，并且随着越来越多的应用于可解释性AI，其在该背景下的潜力也越来越受到关注。虽然模糊系统的应用领域各不相同，但从机器学习角度来看，它们的設計并没有取得太多進展。换句话说，虽然像神經网络這樣的表示形式从計算性能的提高以及訓練機制和可用工具的進步中受益，但在特別是大於神經網絡訓練強度的梯度下降的影響下，模糊系統的設計影響有限。在本文中，我們討論了基於梯度下降的模糊系統優化，重點關注自動 differentiation--關鍵於神經網絡學習--以 free fuzzy system designers from intricate derivative computations， allowing for more focus on the functional and explainability aspects of their design. 作為開始，我們在FuzzyR中展示了如何將 current fuzzy inference system implementations調整以利用自動 differentiation工具集的強大功能，探討其對未來模糊系統設計的潛力。

URL

https://arxiv.org/abs/2403.12308

PDF

https://arxiv.org/pdf/2403.12308.pdf
Read All
Context-aware LLM-based Safe Control Against Latent Risks

2024-03-18 15:17:15

Quan Khanh Luu, Xiyu Deng, Anh Van Ho, Yorie Nakahira

arXiv_RO

arXiv_RO Gradient_Descent Optimization Language_Model Pose Autonomous Action
Abstract

It is challenging for autonomous control systems to perform complex tasks in the presence of latent risks. Motivated by this challenge, this paper proposes an integrated framework that involves Large Language Models (LLMs), stochastic gradient descent (SGD), and optimization-based control. In the first phrase, the proposed framework breaks down complex tasks into a sequence of smaller subtasks, whose specifications account for contextual information and latent risks. In the second phase, these subtasks and their parameters are refined through a dual process involving LLMs and SGD. LLMs are used to generate rough guesses and failure explanations, and SGD is used to fine-tune parameters. The proposed framework is tested using simulated case studies of robots and vehicles. The experiments demonstrate that the proposed framework can mediate actions based on the context and latent risks and learn complex behaviors efficiently.

Abstract (translated)

对于具有潜在风险的环境中执行复杂任务，自主控制系统具有一定的挑战性。为了应对这一挑战，本文提出了一种集成框架，涉及到大语言模型（LLMs）、随机梯度下降（SGD）和基于优化的控制。在第一段中，所提出的框架将复杂任务分解为一系列较小的子任务，这些子任务的规格考虑了上下文信息和潜在风险。在第二阶段，这些子任务及其参数通过涉及LLMs和SGD的双过程进行进一步优化。LLM用于生成粗略猜测和失败解释，而SGD用于微调参数。所提出的框架通过模拟机器人及车辆的案例研究进行了测试。实验结果表明，与传统方法相比，所提出的框架能够通过上下文及潜在风险来调节行为，并能够有效地学习复杂的行为。

URL

https://arxiv.org/abs/2403.11863

PDF

https://arxiv.org/pdf/2403.11863.pdf
Read All
CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion

2024-03-17 10:06:38

Xiaoyu Wu, Yang Hua, Chumeng Liang, Jiaru Zhang, Hao Wang, Tao Song, Haibing Guan

arXiv_AI

arXiv_AI Gradient_Descent Few-Shot Diffusion
Abstract

Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot generation where a pretrained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success, concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response, we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pretrained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image, which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication, surpassing alternative validation techniques. Code implementation is available at this https URL.

Abstract (translated)

扩散模型（DMs）已经发展成为先进的图像生成工具，尤其是在几 shot 生成中，使用预训练模型在少量图像上微调以捕捉特定风格或对象。尽管取得了成功，但担心在这个过程中使用未经授权的数据可能引发版权侵犯。为了应对这个问题，我们提出了对比梯度反演方法（CGI-DM），一种新型的方法，为数字版权认证提供了生动的视觉表示。我们的方法涉及移除图像的部分信息并通过利用预训练模型和微调模型之间的概念差异来恢复缺失细节。我们将差异表示为给定相同输入图像的两个模型的潜在变量之间的 KL 散度，这可以通过蒙特卡洛采样和拟合梯度下降（PGD）来最大化。原始和恢复图像之间的相似性作为潜在侵权的强有力指标。在维基艺术和 Dreambooth 数据集上的大量实验证明，CGI-DM 在数字版权认证方面的准确性非常高，超过了其他验证技术。代码实现可通过此链接https://www.researchgate.net/publication/327513661_Contrasting_Gradient_Inversion_for_Diffusion_Models_CGI-DM_a_ novel_method_for_digital_copyright_authentication_with_ vivid_visual_representations_for_copyright_verification feel free to use.

URL

https://arxiv.org/abs/2403.11162

PDF

https://arxiv.org/pdf/2403.11162.pdf
Read All
DPPE: Dense Pose Estimation in a Plenoxels Environment using Gradient Approximation

2024-03-16 02:22:10

Christopher Kolios, Yeganeh Bahoo, Sajad Saeedi

arXiv_CV

arXiv_CV Gradient_Descent Pose_Estimation Pose Matching
Abstract

We present DPPE, a dense pose estimation algorithm that functions over a Plenoxels environment. Recent advances in neural radiance field techniques have shown that it is a powerful tool for environment representation. More recent neural rendering algorithms have significantly improved both training duration and rendering speed. Plenoxels introduced a fully-differentiable radiance field technique that uses Plenoptic volume elements contained in voxels for rendering, offering reduced training times and better rendering accuracy, while also eliminating the neural net component. In this work, we introduce a 6-DoF monocular RGB-only pose estimation procedure for Plenoxels, which seeks to recover the ground truth camera pose after a perturbation. We employ a variation on classical template matching techniques, using stochastic gradient descent to optimize the pose by minimizing errors in re-rendering. In particular, we examine an approach that takes advantage of the rapid rendering speed of Plenoxels to numerically approximate part of the pose gradient, using a central differencing technique. We show that such methods are effective in pose estimation. Finally, we perform ablations over key components of the problem space, with a particular focus on image subsampling and Plenoxel grid resolution. Project website: this https URL

Abstract (translated)

我们提出了DPPE，一种在Plenoxels环境中进行稠密姿态估计的算法。最近在神经辐射场技术方面的进展表明，这是一种强大的环境表示工具。更近期的神经渲染算法显著提高了训练时间和渲染速度。Plenoxels引入了一种完全可导的辐射场技术，利用Plenoptic体积元素（voxels）中的光子进行渲染，从而降低了训练时间并提高了渲染准确性，同时消除了神经网络组件。在这篇工作中，我们为Plenoxels引入了一种6DoF单目RGB-only姿态估计方法，旨在在扰动后恢复真实相机姿态。我们采用了一种基于随机梯度下降的优化方法，通过最小化重新渲染中的误差来优化姿态。特别地，我们研究了一种利用Plenoxels快速渲染速度采用离散差分技术数值近似的姿态梯度的方法。我们证明了这种方法在姿态估计方面是有效的。最后，我们对问题空间的关键组件进行平滑处理，特别关注图像子采样和Plenoxel网格分辨率。项目网站：https://this URL

URL

https://arxiv.org/abs/2403.10773

PDF

https://arxiv.org/pdf/2403.10773.pdf
Read All
GeoPro-VO: Dynamic Obstacle Avoidance with Geometric Projector Based on Velocity Obstacle

2024-03-15 06:25:22

Jihao Huang, Xuemin Chi, Jun Zeng, Zhitao Liu, Hongye Su

arXiv_RO

arXiv_RO Gradient_Descent Optimization Pose
Abstract

Optimization-based approaches are widely employed to generate optimal robot motions while considering various constraints, such as robot dynamics, collision avoidance, and physical limitations. It is crucial to efficiently solve the optimization problems in practice, yet achieving rapid computations remains a great challenge for optimization-based approaches with nonlinear constraints. In this paper, we propose a geometric projector for dynamic obstacle avoidance based on velocity obstacle (GeoPro-VO) by leveraging the projection feature of the velocity cone set represented by VO. Furthermore, with the proposed GeoPro-VO and the augmented Lagrangian spectral projected gradient descent (ALSPG) algorithm, we transform an initial mixed integer nonlinear programming problem (MINLP) in the form of constrained model predictive control (MPC) into a sub-optimization problem and solve it efficiently. Numerical simulations are conducted to validate the fast computing speed of our approach and its capability for reliable dynamic obstacle avoidance.

Abstract (translated)

基于优化的方法广泛应用于生成最优机器人运动，同时考虑各种约束，如机器人动力学、避障和物理限制。在实践中有效地解决优化问题至关重要，然而，对于基于非线性约束的优化方法来说，实现快速计算仍然是一个巨大的挑战。在本文中，我们提出了一种基于速度障碍物（GeoPro-VO）的运动障碍物避障几何投影方法，通过利用速度锥集的代表VO的速度锥的投影特征。此外，基于GeoPro-VO和提出的增强拉格朗日光谱投影梯度下降（ALSPG）算法，我们将初始混合整数非线性规划问题（MINLP）形式为约束模型预测控制（MPC）的问题转化为子优化问题并求解它，从而实现高效计算。为了验证我们方法的高速计算速度和可靠的运动障碍物避障能力，进行了一系列数值仿真。

URL

https://arxiv.org/abs/2403.10043

PDF

https://arxiv.org/pdf/2403.10043.pdf
Read All
FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models

2024-03-14 22:29:59

Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richt\'arik

arXiv_AI

arXiv_AI Gradient_Descent Attention Sparse Quantization
Abstract

Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is \emph{Local Training}, which involves running multiple local stochastic gradient descent iterations between communication phases. Our work is inspired by the innovative \emph{Scaffnew} algorithm, which has considerably advanced the reduction of communication complexity in FL. We introduce FedComLoc (Federated Compressed and Local Training), integrating practical and effective compression into \emph{Scaffnew} to further enhance communication efficiency. Extensive experiments, using the popular TopK compressor and quantization, demonstrate its prowess in substantially reducing communication overheads in heterogeneous settings.

Abstract (translated)

由于其允许异构客户端在本地处理其隐私数据并与中央服务器交互，同时尊重隐私的特点，联邦学习（FL）引起了越来越多的关注。FL的一个关键瓶颈是通信成本。减轻这一负担的关键策略是 \emph{Local Training}，它涉及在通信阶段之间运行多个局部随机梯度下降迭代。我们的工作受到了创新 \emph{Scaffnew}算法的启发，该算法在FL中大大提高了通信复杂性的减少。我们引入了FedComLoc（联邦压缩和局部训练），将实际和有效的压缩与 \emph{Scaffnew}相结合，进一步提高了通信效率。使用流行的TopK压缩器和量化进行广泛的实验证明，它在异构环境中显著减少了通信开销。

URL

https://arxiv.org/abs/2403.09904

PDF

https://arxiv.org/pdf/2403.09904.pdf
Read All
ADEdgeDrop: Adversarial Edge Dropping for Robust Graph Neural Networks

2024-03-14 08:31:39

Zhaoliang Chen, Zhihao Wu, Ylli Sadikaj, Claudia Plant, Hong-Ning Dai, Shiping Wang, Wenzhong Guo

arXiv_AI

arXiv_AI Gradient_Descent Adversarial Attention Pose
Abstract

Although Graph Neural Networks (GNNs) have exhibited the powerful ability to gather graph-structured information from neighborhood nodes via various message-passing mechanisms, the performance of GNNs is limited by poor generalization and fragile robustness caused by noisy and redundant graph data. As a prominent solution, Graph Augmentation Learning (GAL) has recently received increasing attention. Among prior GAL approaches, edge-dropping methods that randomly remove edges from a graph during training are effective techniques to improve the robustness of GNNs. However, randomly dropping edges often results in bypassing critical edges, consequently weakening the effectiveness of message passing. In this paper, we propose a novel adversarial edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor guiding the removal of edges, which can be flexibly incorporated into diverse GNN backbones. Employing an adversarial training framework, the edge predictor utilizes the line graph transformed from the original graph to estimate the edges to be dropped, which improves the interpretability of the edge-dropping method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient descent and projected gradient descent. Comprehensive experiments on six graph benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms state-of-the-art baselines across various GNN backbones, demonstrating improved generalization and robustness.

Abstract (translated)

尽管图神经网络（GNNs）通过各种消息传递机制表现出从邻居节点收集图形结构信息的力量，但GNNs的性能受到噪声和冗余图形数据导致的泛化能力和脆弱性的限制。作为突出的解决方案，Graph Augmentation Learning (GAL) 最近受到了越来越多的关注。在先前的GAL方法中，训练期间随机从图中删除边是一种有效的改进GNNs robust性的技术。然而，随机删除边通常会导致绕过关键边，从而削弱了消息传递的有效性。在本文中，我们提出了一种新颖的 adversarial edge-dropping 方法（ADEdgeDrop），它利用 adversarial edge predictor 指导删除边，可以灵活地集成到各种 GNN 骨干网络中。采用 adversarial 训练框架，adversarial edge predictor 使用从原始图中转换的线图估计要删除的边，这改善了边缘删除方法的解释性。所提出的 ADEdgeDrop 通过交替使用随机梯度下降和投影梯度下降进行优化。对六个图形基准数据集的全面实验证明，与最先进的基线相比，所提出的 ADEdgeDrop 在各种 GNN 骨干网络上都表现出卓越的性能，证明了改进的泛化能力和脆弱性。

URL

https://arxiv.org/abs/2403.09171

PDF

https://arxiv.org/pdf/2403.09171.pdf
Read All
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention

2024-03-13 17:02:27

Heejune Sheen, Siyu Chen, Tianhao Wang, Harrison H. Zhou

arXiv_AI

arXiv_AI Gradient_Descent Regularization Classification Attention
Abstract

We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such implicit regularization can be described by a Support Vector Machine (SVM) problem with respect to the attention weights. This finding contrasts with prior results showing that the gradient descent induces an implicit regularization on the Frobenius norm on the product weight matrix when the key and query matrices are combined into a single weight matrix for training. For diagonal key and query matrices, our analysis builds upon the reparameterization technique and exploits approximate KKT conditions of the SVM associated with the classification data. Moreover, the results are extended to general weights configurations given proper alignment of the weight matrices' singular spaces with the data features at initialization.

Abstract (translated)

我们研究的是在具有单层软max注意力的分类问题中，梯度在指数损失上的传播。在这种假设数据上，我们证明了当梯度达到最小损失值时，它进一步隐含地最小化了键和查询权重矩阵的乘积核范数。这种隐式正则化可以描述为与注意力权重相关的支持向量机（SVM）问题。这一发现与之前的结果相反，后者表明在将键和查询矩阵组合成一个权重矩阵进行训练时，梯度下降会在乘积权重矩阵上诱导隐式正则化。对于对称的键和查询矩阵，我们的分析基于同余变换技术和与分类数据相关的SVM的近KKT条件。此外，结果还扩展到给定合适的数据特征与初始化时权重矩阵的向量空间对齐的情况。

URL

https://arxiv.org/abs/2403.08699

PDF

https://arxiv.org/pdf/2403.08699.pdf
Read All

Content

Gradient_Descent (20)

Gradient_Descent

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL