Optimization-based approaches are widely employed to generate optimal robot motions while considering various constraints, such as robot dynamics, collision avoidance, and physical limitations. It is crucial to efficiently solve the optimization problems in practice, yet achieving rapid computations remains a great challenge for optimization-based approaches with nonlinear constraints. In this paper, we propose a geometric projector for dynamic obstacle avoidance based on velocity obstacle (GeoPro-VO) by leveraging the projection feature of the velocity cone set represented by VO. Furthermore, with the proposed GeoPro-VO and the augmented Lagrangian spectral projected gradient descent (ALSPG) algorithm, we transform an initial mixed integer nonlinear programming problem (MINLP) in the form of constrained model predictive control (MPC) into a sub-optimization problem and solve it efficiently. Numerical simulations are conducted to validate the fast computing speed of our approach and its capability for reliable dynamic obstacle avoidance.
基于优化的方法广泛应用于生成最优机器人运动,同时考虑各种约束,如机器人动力学、避障和物理限制。在实践中有效地解决优化问题至关重要,然而,对于基于非线性约束的优化方法来说,实现快速计算仍然是一个巨大的挑战。在本文中,我们提出了一种基于速度障碍物(GeoPro-VO)的运动障碍物避障几何投影方法,通过利用速度锥集的代表VO的速度锥的投影特征。此外,基于GeoPro-VO和提出的增强拉格朗日光谱投影梯度下降(ALSPG)算法,我们将初始混合整数非线性规划问题(MINLP)形式为约束模型预测控制(MPC)的问题转化为子优化问题并求解它,从而实现高效计算。为了验证我们方法的高速计算速度和可靠的运动障碍物避障能力,进行了一系列数值仿真。
https://arxiv.org/abs/2403.10043
Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is \emph{Local Training}, which involves running multiple local stochastic gradient descent iterations between communication phases. Our work is inspired by the innovative \emph{Scaffnew} algorithm, which has considerably advanced the reduction of communication complexity in FL. We introduce FedComLoc (Federated Compressed and Local Training), integrating practical and effective compression into \emph{Scaffnew} to further enhance communication efficiency. Extensive experiments, using the popular TopK compressor and quantization, demonstrate its prowess in substantially reducing communication overheads in heterogeneous settings.
由于其允许异构客户端在本地处理其隐私数据并与中央服务器交互,同时尊重隐私的特点,联邦学习(FL)引起了越来越多的关注。FL的一个关键瓶颈是通信成本。减轻这一负担的关键策略是 \emph{Local Training},它涉及在通信阶段之间运行多个局部随机梯度下降迭代。我们的工作受到了创新 \emph{Scaffnew}算法的启发,该算法在FL中大大提高了通信复杂性的减少。我们引入了FedComLoc(联邦压缩和局部训练),将实际和有效的压缩与 \emph{Scaffnew}相结合,进一步提高了通信效率。使用流行的TopK压缩器和量化进行广泛的实验证明,它在异构环境中显著减少了通信开销。
https://arxiv.org/abs/2403.09904
Although Graph Neural Networks (GNNs) have exhibited the powerful ability to gather graph-structured information from neighborhood nodes via various message-passing mechanisms, the performance of GNNs is limited by poor generalization and fragile robustness caused by noisy and redundant graph data. As a prominent solution, Graph Augmentation Learning (GAL) has recently received increasing attention. Among prior GAL approaches, edge-dropping methods that randomly remove edges from a graph during training are effective techniques to improve the robustness of GNNs. However, randomly dropping edges often results in bypassing critical edges, consequently weakening the effectiveness of message passing. In this paper, we propose a novel adversarial edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor guiding the removal of edges, which can be flexibly incorporated into diverse GNN backbones. Employing an adversarial training framework, the edge predictor utilizes the line graph transformed from the original graph to estimate the edges to be dropped, which improves the interpretability of the edge-dropping method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient descent and projected gradient descent. Comprehensive experiments on six graph benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms state-of-the-art baselines across various GNN backbones, demonstrating improved generalization and robustness.
尽管图神经网络(GNNs)通过各种消息传递机制表现出从邻居节点收集图形结构信息的力量,但GNNs的性能受到噪声和冗余图形数据导致的泛化能力和脆弱性的限制。作为突出的解决方案,Graph Augmentation Learning (GAL) 最近受到了越来越多的关注。在先前的GAL方法中,训练期间随机从图中删除边是一种有效的改进GNNs robust性的技术。然而,随机删除边通常会导致绕过关键边,从而削弱了消息传递的有效性。在本文中,我们提出了一种新颖的 adversarial edge-dropping 方法(ADEdgeDrop),它利用 adversarial edge predictor 指导删除边,可以灵活地集成到各种 GNN 骨干网络中。采用 adversarial 训练框架,adversarial edge predictor 使用从原始图中转换的线图估计要删除的边,这改善了边缘删除方法的解释性。所提出的 ADEdgeDrop 通过交替使用随机梯度下降和投影梯度下降进行优化。对六个图形基准数据集的全面实验证明,与最先进的基线相比,所提出的 ADEdgeDrop 在各种 GNN 骨干网络上都表现出卓越的性能,证明了改进的泛化能力和脆弱性。
https://arxiv.org/abs/2403.09171
We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such implicit regularization can be described by a Support Vector Machine (SVM) problem with respect to the attention weights. This finding contrasts with prior results showing that the gradient descent induces an implicit regularization on the Frobenius norm on the product weight matrix when the key and query matrices are combined into a single weight matrix for training. For diagonal key and query matrices, our analysis builds upon the reparameterization technique and exploits approximate KKT conditions of the SVM associated with the classification data. Moreover, the results are extended to general weights configurations given proper alignment of the weight matrices' singular spaces with the data features at initialization.
我们研究的是在具有单层软max注意力的分类问题中,梯度在指数损失上的传播。在这种假设数据上,我们证明了当梯度达到最小损失值时,它进一步隐含地最小化了键和查询权重矩阵的乘积核范数。这种隐式正则化可以描述为与注意力权重相关的支持向量机(SVM)问题。这一发现与之前的结果相反,后者表明在将键和查询矩阵组合成一个权重矩阵进行训练时,梯度下降会在乘积权重矩阵上诱导隐式正则化。对于对称的键和查询矩阵,我们的分析基于同余变换技术和与分类数据相关的SVM的近KKT条件。此外,结果还扩展到给定合适的数据特征与初始化时权重矩阵的向量空间对齐的情况。
https://arxiv.org/abs/2403.08699
Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
现有的基于条件图像生成模型的生成对抗网络(GAN)通常会对相同的条件输入产生固定的输出,这对于高度主观的任务(如大规模掩码图像修复或风格迁移)来说是不合理的。另一方面,基于GAN的多样图像生成方法需要重新训练或微调网络或设计复杂的噪声注入函数,这会导致计算开销、任务特定或很难生成高质量结果。鉴于许多确定性条件图像生成模型已经能够产生高质量但固定的结果,我们提出了一个有趣的问题:是否可以在不改变网络结构或参数的情况下,使预训练的确定性条件图像生成模型产生多样化的结果?为了回答这个问题,我们重新审视了条件图像生成任务,从攻击者的角度出发,提出了一种简单而有效的插值平滑梯度下降(PGD)类似方法,用于多样且可控制图像生成。关键思想是对输入条件添加一个微小的扰动。这样,就可以在不调整网络结构或对预训练模型进行微调的情况下生成多样化的结果。此外,我们还可以根据参考文本或图像指定攻击方向,从而控制生成的多样结果。我们的工作为将对抗攻击应用于低级视觉任务打开了大门,而各种条件图像生成任务的实验结果也证明了所提出方法的有效性和优越性。
https://arxiv.org/abs/2403.08294
Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.
基于Transformer的语言模型通过训练大型数据集来预测给定输入序列的下一个单词。尽管这个简单的训练目标,但它们已经在自然语言处理领域取得了革命性的进展。这种成功背后的原因在于自注意力机制。在这篇工作中,我们问:$\textit{单个自注意力层}$ 从给定序列的下一个单词预测中学习到了什么?我们证明了使用梯度下降训练自注意力会学习一个自动机,该自动机在两个不同的步骤生成下一个单词:$\textbf{(1)}$ $\textbf{艰难检索:}$ 给定输入序列,自注意力准确地选择与最后一个输入单词相关的高优先级输入单词。$\textbf{(2)}$ $\textbf{软组合:}$ 然后它创建了一个凸组合,其中高优先级的单词可以从中采样。在适当条件下,我们通过从训练数据中提取的单词的的有向图来严格地描述这些机制。我们证明,梯度下降 implicitly 发现了这个图中的强连通子图(SCC),而自注意力学会了检索具有最高优先级的 SCC 中的单词。我们的理论依赖于将模型权重分解为方向组件和一个有限组件,分别对应于艰难检索和软组合步骤。这也形式化了 [Tarzanagh et al. 2023] 中提出的相关隐含偏见公式。我们希望这些发现能阐明自注意力过程如何处理序列数据,并为解码更复杂架构铺平道路。
https://arxiv.org/abs/2403.08081
Entezari et al. (2022) conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances. This means that two independent solutions can be connected by a linear path with low loss, given one of them is appropriately permuted. However, current methods to test this theory often fail to eliminate loss barriers between two independent solutions (Ainsworth et al., 2022; Benzing et al., 2022). In this work, we conjecture that a more relaxed claim holds: the SGD solution set is a star domain that contains a star model that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on Bayesian Model Averaging over the obtained star domain. Code is available at this https URL.
Entezari等人(2022)猜想,通过随机梯度下降(SGD)达到的可行解集是凸的,考虑到排他性。这意味着,只要给定一个解被适当地变换,就可以通过具有低损失的路径将两个独立解连接起来。然而,当前方法测试这个理论往往无法消除两个独立解之间的损失障碍(Ainsworth等人,2022;Benzing等人,2022)。在本文中,我们猜想更宽松的结论成立:SGD解集是一个星域,它包含一个线性连接到所有其他解的星模型,通过具有低损失值的路径。我们提出了Starlight算法,用于找到给定学习任务的星模型。我们通过验证我们的假设,表明了这个星模型与独立找到的解是线性连接的。此外,我们还证明了在所得到的星域上,贝叶斯模型平均的置信度更高。代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2403.07968
Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in many scientific and business-intelligence applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As modern datasets continue to increase in size and complexity, there is a growing need for new statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle large prediction problems. This work presents the Bayesian Neural Field (BayesNF), a domain-general statistical model for inferring rich probability distributions over a spatiotemporal domain, which can be used for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a novel deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust uncertainty quantification. By defining the prior through a sequence of smooth differentiable transforms, posterior inference is conducted on large-scale data using variationally learned surrogates trained via stochastic gradient descent. We evaluate BayesNF against prominent statistical and machine-learning baselines, showing considerable improvements on diverse prediction problems from climate and public health datasets that contain tens to hundreds of thousands of measurements. The paper is accompanied with an open-source software package (this https URL) that is easy-to-use and compatible with modern GPU and TPU accelerators on the JAX machine learning platform.
空间时空数据集是由空间参考的时间序列组成的,在许多科学和商业智能应用程序中随处可见,如空气污染监测、疾病追踪和云计算需求预测。随着现代数据集不断增加规模和复杂性,越来越需要新的统计方法,这些方法足够灵活以捕捉复杂的时空动态,并且足够规模以处理大规模预测问题。本文介绍了一种领域通用的贝叶斯神经场(BayesNF)统计模型,用于在空间时空领域推断丰富概率分布,可以用于数据分析任务,包括预测、插值和变量提取。BayesNF采用了一种新颖的深度神经网络架构,结合了高容量函数估计和分层贝叶斯推断以应对不确定性的增强。通过通过随机梯度下降训练得到的具有后验推断的变分学习代理来进行后验推断,并在大尺度数据上进行后验推断。我们评估BayesNF与著名统计和机器学习基线,从包含成千上万个测量值的气候和公共卫生数据集中展示了显著的预测问题改善。本文还附带了一个易于使用且与现代GPU和TPU加速器兼容的开放源代码软件包(此https URL)。
https://arxiv.org/abs/2403.07657
Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better even with an increasing number of authors, and performance remains steady despite few training samples.
翻译:作者识别的任务是对文本进行适当的描述,以捕捉作者的写作风格,从而确定给定文本的原始作者。随着互联网上匿名度的增加,这个任务在各种安全和抄袭检测领域变得越来越重要。尽管其他语言(如英语、西班牙语和汉语)在各方面取得了显著的进步,但孟加拉语在這個領域全面的 research 仍然缺乏, due to its complex linguistic feature and sentence structure。此外,现有的系统在作者数量增加时无法扩展,当样本数每作者减少时,性能下降。在本文中,我们提出了使用平均随机梯度下降权重衰减长短时记忆(AWD-LSTM)架构和一种有效的迁移学习方法来解决孟加拉语文学(AABL)中作者识别问题。我们分析了不同分词方式(如词、词素和字符级别分词)的有效性,并证明了这些分词方法在拟议模型中的有效性。此外,我们还引入了由16个作者组成的公开可用的孟加拉语作者识别数据集(BAAD16),其中包括17,966个样本文本和13.4+百万个单词,以解决标准数据集 scarcity 问题,并为任何孟加拉语NLP下游任务释放六个预训练语言模型。为了评估,我们使用了我们开发的第16个BAAD16数据集,以及其他公开可用的数据集。实验证明,与最先进的模型相比,我们所提出的模型取得了更好的性能,在BAAD16数据集上的准确度达到了99.8%。此外,我们还证明了我们的系统具有更好的扩展性,即使作者数量增加,性能也会保持稳定。
https://arxiv.org/abs/2403.05519
Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.
近年来,有研究表明,大型语言模型的语义概念在表示空间中是“线性”编码的。在这篇工作中,我们研究了这种线性编码的起源。为此,我们引入了一个简单的潜在变量模型,用于抽象和形式化下一个词预测的概念动态。我们用这个形式来证明词预测目标(交叉熵软max)和梯度下降的隐含偏好共同推动了概念的线性编码。实验结果表明,当从与潜在变量模型匹配的数据中学习时,线性表示会 emergence,证实了这种简单结构已经足够产生有意义的表示。此外,我们还使用LLaMA-2大型语言模型证实了一些理论预测,从而提供了这种简化模型的普适性洞察。
https://arxiv.org/abs/2403.03867
Second-order methods can converge much faster than first-order methods by incorporating second-order derivates or statistics, but they are far less prevalent in deep learning due to their computational inefficiency. To handle this, many of the existing solutions focus on reducing the size of the matrix to be inverted. However, it is still needed to perform the inverse operator in each iteration. In this paper, we present a fast natural gradient descent (FNGD) method, which only requires computing the inverse during the first epoch. Firstly, we reformulate the gradient preconditioning formula in the natural gradient descent (NGD) as a weighted sum of per-sample gradients using the Sherman-Morrison-Woodbury formula. Building upon this, to avoid the iterative inverse operation involved in computing coefficients, the weighted coefficients are shared across epochs without affecting the empirical performance. FNGD approximates the NGD as a fixed-coefficient weighted sum, akin to the average sum in first-order methods. Consequently, the computational complexity of FNGD can approach that of first-order methods. To demonstrate the efficiency of the proposed FNGD, we perform empirical evaluations on image classification and machine translation tasks. For training ResNet-18 on the CIFAR-100 dataset, FNGD can achieve a speedup of 2.05$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.
二阶方法可以通过包含二阶导数或统计量来比一阶方法收敛得更快,但它们在深度学习中应用得相对较少,因为它们具有较低的计算效率。为了处理这个问题,许多现有解决方案集中于减小要反向的矩阵的大小。然而,在每次迭代中仍需要执行反向操作。在本文中,我们提出了一个快速的自然梯度下降(FNGD)方法,它只需要在第一 epoch 内计算反向。首先,我们用Sherman-Morrison-Woodbury公式将自然梯度下降(NGD)中的梯度前向公式重新表述为一个加权求和的形式,其中每个样本的梯度被赋予权重。在此基础上,为了避免在计算系数的过程中涉及迭代反向操作,权重在各个epoch之间共享,而不影响经验性能。FNGD近似于NGD,类似于一阶方法中的平均求和。因此,FNGD的计算复杂度可以接近于一阶方法。为了证明所提出的FNGD的有效性,我们在图像分类和机器翻译任务上进行了实证评估。在用CIFAR-100训练ResNet-18时,FNGD可以实现2.05$\times$的加速,而KFAC的加速效果要差得多。在用Multi30K训练Transformer时,FNGD的性能优于AdamW,其BLEU得分为24,训练时间几乎相同。
https://arxiv.org/abs/2403.03473
Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. As a by-product we demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. These results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.
基于Transformer的模型已经在上下文学习中表现出非凡的性能,这引发了对其 underlying 机制的广泛研究。最近的研究表明,Transformer 可以在上下文学习中实现一阶优化算法,甚至可以实现二阶优化算法,例如线性回归。在这篇工作中,我们研究 Transformer 是否可以执行高阶优化方法,超过线性回归的情况。我们证明,带有 ReLU 层的线性注意 Transformers 可以近似二阶优化算法,并且只需要多一层就能实现 $\epsilon$ 误差。作为附加品,我们证明了即使是仅包含两个层的线性注意 only Transformers 也具有在矩阵反向操作中实现牛顿迭代一阶的能力。这些结果表明,Transformer 架构可以实现复杂的算法,超过梯度下降。
https://arxiv.org/abs/2403.03183
This paper proposes a scalable distributed policy gradient method and proves its convergence to near-optimal solution in multi-agent linear quadratic networked systems. The agents engage within a specified network under local communication constraints, implying that each agent can only exchange information with a limited number of neighboring agents. On the underlying graph of the network, each agent implements its control input depending on its nearby neighbors' states in the linear quadratic control setting. We show that it is possible to approximate the exact gradient only using local information. Compared with the centralized optimal controller, the performance gap decreases to zero exponentially as the communication and control ranges increase. We also demonstrate how increasing the communication range enhances system stability in the gradient descent process, thereby elucidating a critical trade-off. The simulation results verify our theoretical findings.
本文提出了一种可扩展的分布式策略梯度方法,并在多智能体线性二次网络系统中证明了其收敛到最优解。在局部通信约束下,智能体在网络中参与,这意味着每个智能体只能与有限数量的邻近智能体交换信息。在网络的底层图上,每个智能体根据其附近邻居的状态在线性二次控制设置下实现其控制输入。我们证明了仅使用局部信息就可以近似求解精确的梯度。与集中最优控制器相比,随着通信和控制的增加,性能差距会逐渐减小到零。我们还证明了增加通信范围会增强梯度下降过程的系统稳定性,从而阐明了临界权衡。仿真结果证实了我们的理论发现。
https://arxiv.org/abs/2403.03055
With the great popularity of Graph Neural Networks (GNNs), their robustness to adversarial topology attacks has received significant attention. Although many attack methods have been proposed, they mainly focus on fixed-budget attacks, aiming at finding the most adversarial perturbations within a fixed budget for target node. However, considering the varied robustness of each node, there is an inevitable dilemma caused by the fixed budget, i.e., no successful perturbation is found when the budget is relatively small, while if it is too large, the yielding redundant perturbations will hurt the invisibility. To break this dilemma, we propose a new type of topology attack, named minimum-budget topology attack, aiming to adaptively find the minimum perturbation sufficient for a successful attack on each node. To this end, we propose an attack model, named MiBTack, based on a dynamic projected gradient descent algorithm, which can effectively solve the involving non-convex constraint optimization on discrete topology. Extensive results on three GNNs and four real-world datasets show that MiBTack can successfully lead all target nodes misclassified with the minimum perturbation edges. Moreover, the obtained minimum budget can be used to measure node robustness, so we can explore the relationships of robustness, topology, and uncertainty for nodes, which is beyond what the current fixed-budget topology attacks can offer.
随着图神经网络(GNNs)的广泛应用,它们对对抗性拓扑攻击的鲁棒性得到了广泛关注。尽管已经提出了许多攻击方法,但它们主要关注固定预算攻击,旨在在固定预算内找到目标节点上的最具攻击性的扰动。然而,考虑到每个节点的异质性,固定预算带来了不可避免的困境,即预算相对较小的时候找不到成功的扰动,而预算较大时,产生的冗余扰动会损害隐见性。为了打破这个困境,我们提出了一个新的拓扑攻击类型,称为最小预算拓扑攻击,旨在动态地找到每个节点成功攻击所需的最低扰动。为此,我们基于动态投影梯度下降算法提出了攻击模型,名为MiBTack。基于这个攻击模型,我们在三个GNNs和四个真实世界数据集上进行了大量实验,结果表明,MiBTack可以成功地将所有目标节点误分类为具有最低扰动边的节点。此外,获得的最低预算可以用来衡量节点的鲁棒性,因此我们可以探索节点之间的关系,这是当前固定预算拓扑攻击无法实现的。
https://arxiv.org/abs/2403.02723
Recent developments have underscored the critical role of \textit{differential privacy} (DP) in safeguarding individual data for training machine learning models. However, integrating DP oftentimes incurs significant model performance degradation due to the perturbation introduced into the training process, presenting a formidable challenge in the {differentially private machine learning} (DPML) field. To this end, several mitigative efforts have been proposed, typically revolving around formulating new DPML algorithms or relaxing DP definitions to harmonize with distinct contexts. In spite of these initiatives, the diminishment induced by DP on models, particularly large-scale models, remains substantial and thus, necessitates an innovative solution that adeptly circumnavigates the consequential impairment of model utility. In response, we introduce DPAdapter, a pioneering technique designed to amplify the model performance of DPML algorithms by enhancing parameter robustness. The fundamental intuition behind this strategy is that models with robust parameters are inherently more resistant to the noise introduced by DP, thereby retaining better performance despite the perturbations. DPAdapter modifies and enhances the sharpness-aware minimization (SAM) technique, utilizing a two-batch strategy to provide a more accurate perturbation estimate and an efficient gradient descent, thereby improving parameter robustness against noise. Notably, DPAdapter can act as a plug-and-play component and be combined with existing DPML algorithms to further improve their performance. Our experiments show that DPAdapter vastly enhances state-of-the-art DPML algorithms, increasing average accuracy from 72.92\% to 77.09\% with a privacy budget of $\epsilon=4$.
近年来,随着人工智能领域机器学习模型的不断发展和普及,保护个人数据的不同隐私(DP)在保障机器学习模型训练中的关键作用得到了凸显。然而,集成DP通常会导致模型性能的显著下降, due训练过程中引入的扰动,这为{不同隐私的机器学习} (DPML)领域带来了巨大的挑战。为此,已经提出了一些减轻措施,通常围绕制定新的DPML算法或通过调整DP定义来适应不同的上下文展开。尽管有这些努力,但DP对模型的影响,尤其是大规模模型,仍然相当显著。因此,需要一种创新解决方案,能够有效地绕过模型效用降低的后果。 为了解决这个问题,我们引入了DPAdapter,这是一种开拓性的技术,旨在通过增强参数的鲁棒性来提高DPML算法的模型性能。DPAdapter的核心思想是,具有鲁棒参数的模型对由DP引入的噪声具有更强的抵抗力,因此即使存在扰动,模型的性能也能保持较好的表现。DPAdapter采用双批次策略来提供更准确的扰动估计和高效的梯度下降,从而提高参数对噪声的鲁棒性。 值得注意的是,DPAdapter可以作为一个插件组件,与现有的DPML算法结合使用,进一步优化他们的性能。我们的实验结果表明,DPAdapter极大地增强了最先进的DPML算法,将平均准确率从72.92%提高到了77.09%,在隐私预算为$\epsilon=4$的情况下。
https://arxiv.org/abs/2403.02571
The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.
大语言模型(LLMs) rapidly increasing capabilities raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. 警告:本文包含可能具有冒犯或有害内容的示例数据。
https://arxiv.org/abs/2403.02475
Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs. Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models.
我们对神经网络(NNs)的泛化能力的理解仍然不完整。主导的解释是基于梯度下降(GD)的隐含偏见,但它无法解释来自无梯度方法的模型的能力,也不能解释最近观察到的未训练网络中的简单性偏差。本文寻求NNs中的其他泛化来源。发现。为了理解架构自GD提供的归纳偏见,我们研究了未训练的随机权重网络。即使是简单的MLPs也表现出强烈的归纳偏见:在权重空间中进行均匀采样产生一个非常具有偏差性的函数复杂度的分布。但与常识相反,NNs并没有固有的“简单性偏见”。这一特性取决于组件 such as ReLUs, residual connections, and layer normalizations。与Transformer一样,这些组件也继承了所有这些特性从其构建块中。启示。我们提供了深度学习独立于GD训练成功的新解释。它指出了训练模型实施解决方案的有前景的途径。
https://arxiv.org/abs/2403.02241
Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks, but it is unclear why this happens. We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics. When training with gradient descent, the loss associated with infrequent words decreases slower than the loss associated with frequent ones. As most samples come from relatively infrequent words, the average loss decreases slowly with gradient descent. On the other hand, Adam and sign-based methods do not suffer from this problem and improve predictions on all classes. To establish that this behavior is indeed caused by class imbalance, we show empirically that it persist through different architectures and data types, on language transformers, vision CNNs, and linear models. We further study this phenomenon on a linear classification with cross-entropy loss, showing that heavy-tailed class imbalance leads to ill-conditioning, and that the normalization used by Adam can counteract it.
已经有一些实验证据表明,Adam在优化大型语言Transformer时表现优异,其表现优于梯度下降,而且比其他任务的表现更好,但是不清楚为什么会出现这种情况。我们发现,在语言建模任务中存在重尾分布,这会导致优化动态的困难。在使用梯度下降进行训练时,与不常用单词相关的损失会比与常用单词相关的损失下降得更慢。由于大多数样本来自相对不常用的单词,因此随着梯度下降,平均损失也会缓慢下降。另一方面,Adam和基于正则的方法不会受到这个问题,能够提高所有类别的预测。为了确定这种行为是否确实是由类不平衡导致的,我们通过不同架构和数据类型,在语言Transformer、视觉CNN和线性模型上进行了经验研究。我们还研究了使用交叉熵损失的线性分类中类不平衡的影响。我们发现,重尾分布会导致欠拟合,而Adam使用的正则化技术可以抵消这种影响。
https://arxiv.org/abs/2402.19449
This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the ``classification margins.'' Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. In underparameterized regimes, we illustrate how the cross-entropy loss can lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.
本工作重点关注存储词向量外积的联合记忆模块的训练动态。我们将这个问题缩小为研究由数据分布性质和词向量之间关联性引起的系统粒子的交互。通过理论和实验,我们提供了几个见解。在过参数化情况下,“分类边界的对数增长。”然而,我们表明由于相关词向量引起的记忆干扰和不平衡的词频导致周期性暂态现象。振荡现象在较大步长上更加明显,这些学习率会加速动态和加速趋近收敛。在欠参数化情况下,我们说明了交叉熵损失如何导致次优的存储方案。最后,我们评估了我们在小Transformer模型上的研究结果的有效性。
https://arxiv.org/abs/2402.18724
Next-token prediction (NTP), the go-to training paradigm in training large language models, involves predicting the next token in a sequence. Departing from traditional one-hot classification, in NTP, multiple tokens with varying frequencies follow each given context. This work frames NTP training as cross-entropy minimization over distinct contexts, each associated with a sparse empirical probability vector across a finite vocabulary. It then addresses the following question: do gradient-based optimizers exhibit a bias towards solutions with specific structure as the NTP training loss reaches its lower bound (entropy)? Specifically, for linear NTP models trained using gradient descent (GD), we make the following contributions: Firstly, we determine NTP-separability conditions on the data, under which GD can attain its lower bound. We also demonstrate that these conditions hold under overparameterization. Secondly, we establish that the parameters of GD projected onto an appropriate data subspace converge to the unique solution of a system of linear equations, which requires the logits' difference of in-support tokens to be equal to the log-ratio of their respective probabilities. Meanwhile, on the orthogonal subspace, the parameters diverge and converge in the direction of the solution of a max-margin quadratic program, minimizing the Euclidean norm of parameters satisfying the \NTP-separability conditions. Akin to prior research on implicit bias of one-hot classification, our work opens exciting avenues for future research that can lead to better understanding optimization, generalization and robustness principles of models trained with NTP.
下一个词预测(NTP)是用于训练大型语言模型的常用训练范式。它涉及预测序列中的下一个词。与传统的 one-hot 分类不同,在 NTP 中,每个给定上下文都有多个具有不同频率的词。将 NTP 训练视为基于交叉熵最小化不同上下文的梯度,并且这些条件在过参数化时仍然成立。接下来,我们来回答一个问题:基于梯度的优化器是否表现出对具有特定结构的解决方案的偏见,当 NTP 训练损失达到其下限时(熵)?具体来说,对于使用梯度下降法训练的线性 NTP 模型,我们做出以下贡献:首先,我们在数据上确定 NTP 分割条件,使得梯度下降法可以达到其下限。我们还证明了这些条件在过参数化时仍然成立。其次,我们证明了将梯度在适当的数据子空间上投影的参数收敛到系统线性方程的唯一解,这需要支持词的logits之差等于它们各自的概率的log比。同时,在正交子空间中,参数发散并收敛于满足 NTP 分割条件的参数的欧氏距离最小化。与先前的关于 one-hot 分类中隐含偏见的研究所类似,我们的工作为未来研究打开了探索优化、泛化以及模型训练的优化和鲁棒性原则的更令人兴奋的途径。
https://arxiv.org/abs/2402.18551