With the growing computational capabilities of microcontroller units (MCUs), edge devices can now support machine learning models. However, deploying decentralised federated learning (DFL) on such devices presents key challenges, including intermittent connectivity, limited communication range, and dynamic network topologies. This paper proposes a novel framework, bilayer Gossip Decentralised Parallel Stochastic Gradient Descent (GD PSGD), designed to address these issues in resource-constrained environments. The framework incorporates a hierarchical communication structure using Distributed Kmeans (DKmeans) clustering for geographic grouping and a gossip protocol for efficient model aggregation across two layers: intra-cluster and inter-cluster. We evaluate the framework's performance against the Centralised Federated Learning (CFL) baseline using the MCUNet model on the CIFAR-10 dataset under IID and Non-IID conditions. Results demonstrate that the proposed method achieves comparable accuracy to CFL on IID datasets, requiring only 1.8 additional rounds for convergence. On Non-IID datasets, the accuracy loss remains under 8\% for moderate data imbalance. These findings highlight the framework's potential to support scalable and privacy-preserving learning on edge devices with minimal performance trade-offs.
随着微控制器单元(MCU)计算能力的提升,边缘设备现在能够支持机器学习模型。然而,在这些资源受限的环境中部署去中心化联邦学习(DFL)面临诸多挑战,包括间歇性连接、有限通信范围和动态网络拓扑结构等问题。本文提出了一种新的框架——双层流言式去中心化并行随机梯度下降(GD PSGD),旨在解决上述问题。该框架采用基于分布式K-means聚类的分层通讯结构,通过地理分区将设备分为不同的簇,并使用流言协议在两个层级——簇内和簇间——高效地进行模型聚合。 我们利用MCUNet模型,在CIFAR-10数据集上分别针对同分布(IID)与非同分布(Non-IID)的数据条件评估该框架相对于集中式联邦学习(CFL)基线的表现。实验结果显示,所提方法在处理同质化数据时,达到的准确度与集中式联邦学习相当,仅需额外1.8轮迭代即可收敛;而在处理异构化数据集时,精度损失保持在8%以内,即使面对中等程度的数据不平衡也能维持这一表现。这些发现表明该框架具备支持边缘设备进行可扩展且隐私保护的机器学习潜力,并能以极小性能代价实现这一点。
https://arxiv.org/abs/2501.04817
Robust tensor principal component analysis (RTPCA) aims to separate the low-rank and sparse components from multi-dimensional data, making it an essential technique in the signal processing and computer vision fields. Recently emerging tensor singular value decomposition (t-SVD) has gained considerable attention for its ability to better capture the low-rank structure of tensors compared to traditional matrix SVD. However, existing methods often rely on the computationally expensive tensor nuclear norm (TNN), which limits their scalability for real-world tensors. To address this issue, we explore an efficient scaled gradient descent (SGD) approach within the t-SVD framework for the first time, and propose the RTPCA-SGD method. Theoretically, we rigorously establish the recovery guarantees of RTPCA-SGD under mild assumptions, demonstrating that with appropriate parameter selection, it achieves linear convergence to the true low-rank tensor at a constant rate, independent of the condition number. To enhance its practical applicability, we further propose a learnable self-supervised deep unfolding model, which enables effective parameter learning. Numerical experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed methods while maintaining competitive computational efficiency, especially consuming less time than RTPCA-TNN.
鲁棒张量主成分分析(RTPCA)旨在从多维数据中分离出低秩和稀疏分量,使其成为信号处理和计算机视觉领域中的关键技术。最近兴起的张量奇异值分解(t-SVD)因其在捕捉张量低秩结构方面优于传统的矩阵SVD而备受关注。然而,现有的方法通常依赖于计算成本高昂的张量核范数(TNN),这限制了它们在现实世界中处理大规模数据的能力。为了解决这个问题,我们首次探索了一种高效的缩放梯度下降(SGD)方法,在t-SVD框架内,并提出了RTPCA-SGD方法。理论上,我们在温和假设下严格建立了RTPCA-SGD的恢复保证,证明了通过适当的参数选择,它能以恒定速率线性收敛到真实的低秩张量,且该速率与条件数无关。为了增强其实用性,我们进一步提出了一种可学习的自监督深度展开模型,这使得有效的参数学习成为可能。在合成数据和真实世界数据集上的数值实验显示了所提方法的优越性能,并保持了竞争性的计算效率,尤其是相较于RTPCA-TNN耗时更少。 这段翻译详细介绍了鲁棒张量主成分分析(RTPCA)及其改进方法——RTPCA-SGD的核心内容。文中讨论了使用t-SVD框架以及缩放梯度下降算法如何提高数据处理的效率和准确性,并通过实验验证了其有效性和优越性。
https://arxiv.org/abs/2501.04565
We provide a general and malleable heuristic for the air conflict resolution problem. This heuristic is based on a new neighborhood structure for searching the solution space of trajectories and flight-levels. Using unsupervised learning, the core idea of our heuristic is to cluster the conflict points and disperse them in various flight levels. Our first algorithm is called Cluster & Disperse and in each iteration it assigns the most problematic flights in each cluster to another flight-level. In effect, we shuffle them between the flight-levels until we achieve a well-balanced configuration. The Cluster & Disperse algorithm then uses any horizontal plane conflict resolution algorithm as a subroutine to solve these well-balanced instances. Nevertheless, we develop a novel algorithm for the horizontal plane based on a similar idea. That is we cluster and disperse the conflict points spatially in the same flight level using the gradient descent and a social force. We use a novel maneuver making flights travel on an arc instead of a straight path which is based on the aviation routine of the Radius to Fix legs. Our algorithms can handle a high density of flights within a reasonable computation time. We put their performance in context with some notable algorithms from the literature. Being a general framework, a particular strength of the Cluster & Disperse is its malleability in allowing various constraints regarding the aircraft or the environment to be integrated with ease. This is in contrast to the models for instance based on mixed integer programming.
我们提供了一种针对空中冲突解决问题的通用且灵活的启发式方法。该启发式方法基于一种新的邻域结构,用于搜索轨迹和飞行高度层面的空间解决方案。通过无监督学习的核心思想是将冲突点聚类并分散到不同的飞行层中以减少冲突。 我们的第一个算法称为“Cluster & Disperse”(聚类与分散),在每次迭代中,它会将每个簇中最棘手的航班重新分配给另一个飞行层次。实际上,我们会在各个飞行层级之间重新安排这些航班,直到达到一个平衡的状态为止。随后,“Cluster & Disperse”算法使用任何水平面冲突解决算法作为子程序来处理这种平衡后的实例。 然而,我们也开发了一种基于类似思想的新方法来处理水平面上的冲突问题:即在相同的飞行高度层内通过梯度下降和社会力量机制对冲突点进行空间聚类和分散。我们设计了一个新的机动方式让航班沿着弧线而非直线路径行驶,这是依据航空惯例中的“径向至定位”段而来的。 我们的算法能够处理高密度的航班流量,并且在合理的时间内完成计算任务。我们在文献中的一些著名算法背景下评估了这些算法的表现。作为通用框架,“Cluster & Disperse”的一个特别优势在于其灵活性,它容易融入各种关于飞机或环境的不同约束条件。这与基于混合整数规划模型的情况形成了鲜明对比。
https://arxiv.org/abs/2501.04281
Effective evaluation of real-time strategy tasks requires adaptive mechanisms to cope with dynamic and unpredictable environments. This study proposes a method to improve evaluation functions for real-time responsiveness to battle-field situation changes, utilizing an online reinforcement learning-based dynam-ic weight adjustment mechanism within the real-time strategy game. Building on traditional static evaluation functions, the method employs gradient descent in online reinforcement learning to update weights dynamically, incorporating weight decay techniques to ensure stability. Additionally, the AdamW optimizer is integrated to adjust the learning rate and decay rate of online reinforcement learning in real time, further reducing the dependency on manual parameter tun-ing. Round-robin competition experiments demonstrate that this method signifi-cantly enhances the application effectiveness of the Lanchester combat model evaluation function, Simple evaluation function, and Simple Sqrt evaluation function in planning algorithms including IDABCD, IDRTMinimax, and Port-folio AI. The method achieves a notable improvement in scores, with the en-hancement becoming more pronounced as the map size increases. Furthermore, the increase in evaluation function computation time induced by this method is kept below 6% for all evaluation functions and planning algorithms. The pro-posed dynamic adaptive evaluation function demonstrates a promising approach for real-time strategy task evaluation.
实时策略任务的有效评估需要具备适应机制以应对动态和不可预测的环境。本研究提出了一种方法,通过在实时策略游戏中使用基于在线强化学习的动力权重调整机制来改进评价函数,从而增强其对战场情况变化的实时响应能力。在此基础上,该方法利用在线强化学习中的梯度下降技术动态更新权重,并采用权重衰减技术以确保稳定性。此外,还整合了AdamW优化器,在线实时调整强化学习的学习率和衰减率,进一步减少了对手动参数调优的依赖。 通过轮次竞争实验显示,这种方法显著提高了兰彻斯特战斗模型评价函数、简单评价函数及简单平方根评价函数在包括IDABCD(迭代深入搜索与贝叶斯决策)、IDRTMinimax(迭代深入搜索与反向蒙特卡洛树搜索)以及组合AI在内的规划算法中的应用效果。该方法实现了显著的评分提升,随着地图规模的增大,改进幅度更加明显。同时,由此引发的评价函数计算时间增加在所有评价函数和规划算法中均低于6%。 所提出的动态自适应评价函数为实时策略任务评估展示了一种有前景的方法。
https://arxiv.org/abs/2501.03824
Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies in the observation signal in realworld applications. Adversarial attack is an effective method for evaluating the robustness of DRL agents. However, existing attack methods targeting individual sampled actions have limited impacts on the overall policy distribution, particularly in continuous action spaces. To address these limitations, we propose the Distribution-Aware Projected Gradient Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network, which leverages the entire policy distribution rather than relying on individual samples. We utilize the Bhattacharyya distance in DAPGD to measure policy similarity, enabling sensitive detection of subtle but critical differences between probability distributions. Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks, achieving an average 22.03% higher reward drop compared to the best baseline.
深度强化学习(DRL)在实际应用中会遇到观察信号的不确定性和不准确性问题。对抗攻击是评估DRL代理鲁棒性的一种有效方法。然而,现有针对单一样本动作的攻击方法对整体策略分布的影响有限,特别是在连续动作空间中。为了解决这些问题,我们提出了基于分布感知投影梯度下降(Distribution-Aware Projected Gradient Descent, DAPGD)的攻击方法。DAPGD利用了分布相似性作为梯度扰动输入来攻击策略网络,这种方法依赖于整个策略分布而非单独样本。 在DAPGD中,我们使用Bhattacharyya距离来衡量策略之间的相似性,这使得能够敏感地检测到概率分布之间细微但关键的差异。实验结果显示,在三个机器人导航任务上,与基线方法相比,DAPGD达到了最先进的结果,平均奖励下降比最佳基线高出22.03%。
https://arxiv.org/abs/2501.03562
Multi-view clustering (MvC) aims to integrate information from different views to enhance the capability of the model in capturing the underlying data structures. The widely used joint training paradigm in MvC is potentially not fully leverage the multi-view information, since the imbalanced and under-optimized view-specific features caused by the uniform learning objective for all views. For instance, particular views with more discriminative information could dominate the learning process in the joint training paradigm, leading to other views being under-optimized. To alleviate this issue, we first analyze the imbalanced phenomenon in the joint-training paradigm of multi-view clustering from the perspective of gradient descent for each view-specific feature extractor. Then, we propose a novel balanced multi-view clustering (BMvC) method, which introduces a view-specific contrastive regularization (VCR) to modulate the optimization of each view. Concretely, VCR preserves the sample similarities captured from the joint features and view-specific ones into the clustering distributions corresponding to view-specific features to enhance the learning process of view-specific feature extractors. Additionally, a theoretical analysis is provided to illustrate that VCR adaptively modulates the magnitudes of gradients for updating the parameters of view-specific feature extractors to achieve a balanced multi-view learning procedure. In such a manner, BMvC achieves a better trade-off between the exploitation of view-specific patterns and the exploration of view-invariance patterns to fully learn the multi-view information for the clustering task. Finally, a set of experiments are conducted to verify the superiority of the proposed method compared with state-of-the-art approaches both on eight benchmark MvC datasets and two spatially resolved transcriptomics datasets.
多视图聚类(MvC)的目标是整合来自不同视角的信息,以增强模型捕捉底层数据结构的能力。在MvC中广泛使用的联合训练范式可能无法充分利用多视角信息,因为所有视角统一的学习目标导致了不平衡和欠优化的特定视角特征。例如,在具有更多判别性信息的特定视角可能会主导联合训练过程中的学习进程,从而导致其他视角被欠优化。 为了解决这个问题,我们首先从每个特定视角特征提取器的角度,分析了多视图聚类中联合训练范式的不平衡现象。然后,我们提出了一种新的平衡多视图聚类(BMvC)方法,该方法引入了一个特定视角对比正则化(VCR),用于调节每个视角的优化过程。具体来说,VCR将从联合特征和特定视角捕获的样本相似性保留到与特定视角特征对应的聚类分布中,以增强特定视角特征提取器的学习过程。 此外,我们提供了理论分析来说明VCR如何自适应地调整更新特定视角特征提取器参数的梯度幅度,从而实现平衡多视图学习程序。通过这种方式,BMvC在利用视角特有模式和探索视角不变性模式之间实现了更好的权衡,以充分学习多视角信息用于聚类任务。 最后,我们进行了一组实验,在八个基准MvC数据集和两个空间解析转录组学数据集中验证了所提出方法相对于最新方法的优越性。
https://arxiv.org/abs/2501.02564
Federated Learning (FL) has received much attention in recent years. However, although clients are not required to share their data in FL, the global model itself can implicitly remember clients' local data. Therefore, it's necessary to effectively remove the target client's data from the FL global model to ease the risk of privacy leakage and implement ``the right to be forgotten". Federated Unlearning (FU) has been considered a promising way to remove data without full retraining. But the model utility easily suffers significant reduction during unlearning due to the gradient conflicts. Furthermore, when conducting the post-training to recover the model utility, the model is prone to move back and revert what has already been unlearned. To address these issues, we propose Federated Unlearning with Orthogonal Steepest Descent (FedOSD). We first design an unlearning Cross-Entropy loss to overcome the convergence issue of the gradient ascent. A steepest descent direction for unlearning is then calculated in the condition of being non-conflicting with other clients' gradients and closest to the target client's gradient. This benefits to efficiently unlearn and mitigate the model utility reduction. After unlearning, we recover the model utility by maintaining the achievement of unlearning. Finally, extensive experiments in several FL scenarios verify that FedOSD outperforms the SOTA FU algorithms in terms of unlearning and model utility.
近年来,联邦学习(FL)受到了广泛关注。然而,尽管在联邦学习中客户端不需要共享他们的数据,但全局模型自身可能隐式地记住各个客户端的本地数据。因此,为了减轻隐私泄露的风险并实施“被遗忘权”,有效从联邦学习的全局模型中移除目标客户端的数据是必要的。联邦撤销(FU)被认为是一种无需完全重新训练即可删除数据的有前途的方法。但是,在进行撤销时,由于梯度冲突,模型效用往往会显著下降。此外,在后续训练以恢复模型效用时,模型容易回退并逆转已撤销的内容。 为了解决这些问题,我们提出了基于正交最速下降法的联邦撤销(FedOSD)。首先设计了一种用于克服梯度上升收敛问题的撤销交叉熵损失函数。接着计算了一个在不与其他客户端梯度冲突且尽可能接近目标客户端梯度的情况下进行撤销的最佳方向。这种方法有助于高效地进行数据删除,同时最大限度减少模型效用的下降。在完成撤销后,我们通过保持已删除内容的效果来恢复模型的效用。 最后,我们在多个联邦学习场景中进行了广泛的实验,并验证了FedOSD在撤销和模型效用方面优于现有最佳的FU算法。
https://arxiv.org/abs/2412.20200
While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
尽管安全对齐的大规模语言模型(LLM)被广泛应用于解决复杂现实问题的强大系统中,如多代理框架的基石,但它们仍然可能遭受诸如越狱攻击之类的潜在敌对查询,这些攻击试图诱导有害内容。研究攻击方法有助于我们更好地理解LLM的局限性,并在有用性和安全性之间做出权衡。然而,现有的越狱攻击主要基于不透明的优化技术(例如token级别的梯度下降)和启发式搜索方法如模型细化,这些方法在透明度、可迁移性和计算成本方面存在不足。 鉴于这些限制,我们从生物病毒的进化和感染过程汲取灵感,并提出了LLM-Virus——一种基于进化算法的越狱攻击方法,称为进化型越狱。LLM-Virus将越狱攻击视为一个进化问题以及迁移学习问题,利用LLM作为启发式演化算子以确保高效、可迁移且计算成本低的攻击性能。我们在多个安全基准上的实验结果表明,LLM-Virus在性能上与现有攻击方法相当甚至更优。
https://arxiv.org/abs/2501.00055
Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.
系统发育树阐明了物种之间的进化关系,但由于结合连续参数(分支长度)和离散参数(树的拓扑结构)的复杂性,系统发育推断仍然具有挑战性。传统的马尔可夫链蒙特卡罗方法面临收敛慢和计算负担重的问题。现有的变分推断方法通常需要预先生成的拓扑结构,并且往往独立处理树结构和分支长度,这可能导致忽略关键序列特征,从而限制了它们的准确性和灵活性。 我们提出了PhyloGen这一新方法,该方法利用预训练的基因组语言模型来生成和优化系统发育树,无需依赖进化模型或对齐序列约束。PhyloGen将系统发育推断视为一个受条件约束的树结构生成问题,并通过三个核心模块联合优化树的拓扑结构和分支长度:(i) 特征提取、(ii) 系统发育树构建、(iii) 系统发育树结构建模。同时,我们引入了一个评分函数来引导模型朝着更加稳定的梯度下降方向发展。 我们在八个真实世界的基准数据集上展示了PhyloGen的有效性和鲁棒性,并通过可视化结果证实了PhyloGen能够提供更深入的系统发育关系见解。
https://arxiv.org/abs/2412.18827
We present STITCH, a novel approach for neural implicit surface reconstruction of a sparse and irregularly spaced point cloud while enforcing topological constraints (such as having a single connected component). We develop a new differentiable framework based on persistent homology to formulate topological loss terms that enforce the prior of a single 2-manifold object. Our method demonstrates excellent performance in preserving the topology of complex 3D geometries, evident through both visual and empirical comparisons. We supplement this with a theoretical analysis, and provably show that optimizing the loss with stochastic (sub)gradient descent leads to convergence and enables reconstructing shapes with a single connected component. Our approach showcases the integration of differentiable topological data analysis tools for implicit surface reconstruction.
我们介绍了STITCH,这是一种新颖的方法,用于从稀疏且不规则间隔的点云中重建神经隐式曲面,并强制执行拓扑约束(例如保持单一连通分量)。我们开发了一个基于持久同调的新可微框架来制定拓扑损失项,以确保单个2-流形对象的前提条件。我们的方法在复杂3D几何形状的拓扑结构保存方面表现出色,通过视觉和实证比较可以明显看出这一点。我们还进行了理论分析,并证明了使用随机(子)梯度下降优化损失会导致收敛并能够重建具有单一连通分量的形状。我们的方法展示了隐式曲面重建中可微拓扑数据分析工具集成的应用。
https://arxiv.org/abs/2412.18696
We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization. Traditional distributed data-parallel stochastic gradient descent involves averaging gradients of microbatches to calculate a macrobatch gradient that is then used to update model parameters. We find that gradients across microbatches are often orthogonal or negatively correlated, especially in late stages of training, which leads to memorization of the training set, reducing generalization. In this paper, we introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training and filtering out conflicting updates prior to averaging. We improve validation accuracy with significantly smaller microbatch sizes. We also show this reduces memorizing noisy labels. We demonstrate the effectiveness of this technique on standard image classification benchmarks including CIFAR-100 and CIFAR-100N-Fine. We show this technique consistently outperforms validation accuracy, in some cases by up to 18.2\% compared to traditional training approaches while reducing the computation required nearly an order of magnitude because we can now rely on smaller microbatch sizes without destabilizing training.
我们引入了梯度一致性过滤(GAF)来改进分布式深度学习优化中的梯度平均。传统的分布数据并行随机梯度下降涉及到将小批量的梯度进行平均以计算大批次梯度,然后再用于更新模型参数。我们发现,在训练后期阶段,不同小批量之间的梯度往往是正交或负相关的,这会导致对训练集的记忆化,从而降低泛化能力。在本文中,我们提出了一种简单且计算有效的减少梯度方差的方法:通过在训练过程中计算微批之间梯度的余弦距离来过滤掉冲突更新,然后再进行平均。这种方法使得我们可以使用显著更小的小批量大小提高验证准确率,并减少了对噪声标签的记忆化。我们在标准图像分类基准测试包括CIFAR-100和CIFAR-100N-Fine上展示了这项技术的有效性。结果显示,该技术在某些情况下可将验证准确性提升高达18.2%,相比传统训练方法显著提高,同时由于可以依赖更小的小批量大小而不影响稳定训练,计算需求也几乎减少了近一个数量级。
https://arxiv.org/abs/2412.18052
The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.
人工智能和大型语言模型的兴起促使数据中心在复杂训练和推理任务中增加GPU的使用,这影响了运营成本、能源需求以及大规模计算基础设施的环境足迹。本研究解决了GPU数据中心中的在线调度问题,即在不了解未来任务到达的情况下对任务进行调度。我们的目标是减少GPU碎片化并降低能耗。当部分分配GPU妨碍剩余资源的有效利用时,特别是在数据中心接近满载运行时,就会发生GPU碎片化现象。最近的一种调度策略,碎片梯度下降(FGD),通过使用一个碎片指标来解决这个问题。由于GPU的高功耗需求,减少能源消耗也至关重要。为此,我们提出了PWR这一新的调度策略,旨在通过选择节能的GPU和CPU组合来最小化能耗。这包括整合到Kubernetes评分插件中的简化能耗测量模型。通过在模拟集群中进行广泛的实验评估,我们展示了如何将PWR与FGD结合使用,在减少能源消耗和降低GPU碎片化之间实现平衡的权衡。
https://arxiv.org/abs/2412.17484
We introduce \textbf{Gr}adient Descent with \textbf{A}daptive \textbf{M}omentum \textbf{S}caling (\textbf{Grams}), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We establish a global convergence guarantee for Grams and validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams' superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams' potential as a transformative approach for efficient optimization in large-scale machine learning.
我们介绍了一种名为\textbf{Grams}(\textbf{G}radient \textbf{D}escent with \textbf{A}daptive \textbf{M}omentum \textbf{S}caling,自适应动量缩放的梯度下降)的新颖优化算法。该算法在深度学习中将参数更新的方向和幅度解耦。与直接将动量整合到更新中的传统优化器不同,Grams 将当前梯度导出的更新方向与仅用于自适应幅度调整的动量分离。这种方法使 Grams 相对于最先进的谨慎型和基于动量的优化器能够实现更好的损失下降。我们为 Grams 建立了全局收敛性保证,并通过广泛的实证评估验证了其有效性。结果表明,Grams 在性能上优于广泛使用的优化器如 Adam、Lion 及其谨慎变体,表现出更快的收敛速度和更好的泛化能力。我们的研究结果突显了 Grams 作为大规模机器学习高效优化转型方法的巨大潜力。
https://arxiv.org/abs/2412.17107
The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward pass of LMs, the backward pass of attention has been largely overlooked. In this work, we study the mathematics of the backward pass of attention, revealing that it implicitly calculates an attention matrix we refer to as "Reversed Attention". We examine the properties of Reversed Attention and demonstrate its ability to elucidate the models' behavior and edit dynamics. In an experimental setup, we showcase the ability of Reversed Attention to directly alter the forward pass of attention, without modifying the model's weights, using a novel method called "attention patching". In addition to enhancing the comprehension of how LM configure attention layers during backpropagation, Reversed Attention maps contribute to a more interpretable backward pass.
Transformer基于的语言模型(LMs)的成功源于其注意力机制。虽然这一机制在可解释性研究中已经得到了广泛的研究,尤其是在通过前向传递过程中获得的注意力值进行研究时,但注意力的反向传递却鲜为人注意。在这项工作中,我们研究了注意力反向传递的数学原理,揭示出它隐式地计算了一个我们称为“逆向注意力”的注意力矩阵。我们探讨了逆向注意力的性质,并展示了其阐明模型行为和编辑动态的能力。在实验设置中,我们通过一种名为“注意力打补丁”的新方法,展示逆向注意力能够直接改变前向传递中的注意力,而无需修改模型的权重。除了提高对LM如何在反向传播过程中配置注意力层的理解之外,逆向注意力图还促进了更易于解释的反向传递过程。
https://arxiv.org/abs/2412.17019
For gradient-based machine learning (ML) methods commonly adopted in practice such as stochastic gradient descent, the de facto differential privacy (DP) technique is perturbing the gradients with random Gaussian noise. Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP such as data pricing, collaborative ML, and federated learning (FL). Can existing data valuation methods still be used when DP is enforced via gradient perturbations? We show that the answer is no with the default approach of injecting i.i.d.~random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget, producing estimates almost like random guesses. To address this issue, we propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.~the budget. We also empirically demonstrate that our method gives better data value estimates on various ML tasks and is applicable to use cases including dataset valuation and~FL.
对于实践中常用的基于梯度的机器学习(ML)方法,例如随机梯度下降,事实上常用的一种差分隐私(DP)技术是对梯度添加随机高斯噪声。数据估值将ML性能归因于训练数据,并广泛应用于需要强制执行DP的隐私感知应用程序中,如数据定价、协作式ML和联邦学习(FL)。当通过梯度扰动来强制执行DP时,现有的数据估值方法是否仍然适用?我们表明,默认情况下向梯度添加独立同分布(i.i.d.)随机噪声的方法不行。因为这种做法会使得数据价值估计的不确定性与更多的估计预算成线性比例增加,从而产生几乎像随机猜测一样的估计结果。为了解决这一问题,我们提出了一种注入精心设计的相关噪声的方法来证明可以去除相对于预算的线性不确定性增长。我们还通过实验证明了我们的方法在各种ML任务上能给出更好的数据价值估计,并且适用于包括数据集估值和联邦学习在内的用例。
https://arxiv.org/abs/2412.17008
Robotic dexterous grasping is a key step toward human-like manipulation. To fully unleash the potential of data-driven models for dexterous grasping, a large-scale, high-quality dataset is essential. While gradient-based optimization offers a promising way for constructing such datasets, existing works suffer from limitations, such as restrictive assumptions in energy design or limited experiments on small object sets. Moreover, the lack of a standard benchmark for comparing synthesis methods and datasets hinders progress in this field. To address these challenges, we develop a highly efficient synthesis system and a comprehensive benchmark with MuJoCo for dexterous grasping. Our system formulates grasp synthesis as a bilevel optimization problem, combining a novel lower-level quadratic programming (QP) with an upper-level gradient descent process. By leveraging recent advances in CUDA-accelerated robotic libraries and GPU-based QP solvers, our system can parallelize thousands of grasps and synthesize over 49 grasps per second on a single NVIDIA 3090 GPU. Our synthesized grasps for Shadow Hand and Allegro Hand achieve a success rate above 75% in MuJoCo, with a penetration depth and contact distance of under 1 mm, outperforming existing baselines on nearly all metrics. Compared to the previous large-scale dataset, DexGraspNet, our dataset significantly improves the performance of learning models, with a simulation success rate from around 40% to 80%. Real-world testing of the trained model on the Shadow Hand achieves an 81% success rate across 20 diverse objects.
机器人灵巧抓取是实现类人操纵的关键一步。为了充分释放数据驱动模型在灵巧抓取方面的潜力,大规模、高质量的数据集至关重要。虽然基于梯度的优化为构建此类数据集提供了有希望的方法,但现有工作存在一些限制,例如能量设计中的严格假设或仅限于小型物体集合上的实验。此外,缺乏用于比较合成方法和数据集的标准基准也阻碍了这一领域的进步。为了应对这些挑战,我们开发了一个高效的综合系统,并使用MuJoCo创建了一个全面的灵巧抓取基准测试平台。我们的系统将抓取合成表述为一个双层优化问题,结合了一种新颖的低层级二次规划(QP)与高层级梯度下降过程。通过利用基于CUDA加速的机器人库和基于GPU的QP求解器的最新进展,我们的系统可以并行处理数千次抓取,并在单个NVIDIA 3090 GPU上以每秒超过49次的速度合成抓取动作。我们为Shadow Hand和Allegro Hand合成的抓取,在MuJoCo中实现了高于75%的成功率,穿透深度和接触距离均低于1毫米,几乎在所有指标上都优于现有基线模型。与之前的大型数据集DexGraspNet相比,我们的数据集显著提升了学习模型的表现,将仿真成功率从约40%提高到了80%。经过训练的模型在Shadow Hand上的实际测试中,对20种不同物体的成功率达到了81%。
https://arxiv.org/abs/2412.16490
This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add imperceptible perturbations to the image. Combining these two techniques enables successful attacks on the LLaVA 1.5 model.
这份技术报告介绍了我们排名第一的解决方案,该方案采用了两种方法,即后缀注入和投影梯度下降(PGD),来应对TiFA研讨会中的MLLM攻击挑战。具体来说,我们首先将一个错误标记选项(伪标记)的文本附加到原始查询作为后缀。使用这个修改后的查询,我们的第二种方法应用PGD技术向图像添加不易察觉的扰动。结合这两种技术能够成功地对LLaVA 1.5模型发起攻击。
https://arxiv.org/abs/2412.15614
Cross-Domain Few-Shot Learning~(CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent~(TSP). Our method first meta-learns Domain-Specific Preconditioners~(DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.
跨域少样本学习(CDFSL)方法通常使用与任务无关和特定于任务的参数来参数化模型。为了适应特定于任务的参数,最近的方法采用了固定的优化策略,尽管这些策略在不同领域或目标任务中可能存在次优性。为了解决这个问题,我们提出了一种新的自适应机制,称为任务特异性预条件梯度下降(TSP)。我们的方法首先通过元学习获得每个元训练域的特点,并捕获这些特点形成领域特定的预条件器(DSPs),然后使用任务系数将这些DSPs线性组合成任务特异性预条件器。该预条件器应用于梯度下降中,使优化过程适应目标任务。我们将预条件器限制为正定矩阵,以引导预条件后的梯度向最陡下降的方向移动。在Meta-Dataset上的实证评估表明,TSP在各种实验场景下实现了最先进的性能。
https://arxiv.org/abs/2412.15483
Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.
扩散模型,尤其是潜在扩散模型,在文本驱动的人类动作生成方面表现出显著的成功。然而,让潜在扩散模型有效地将多个语义概念整合到一个连贯的动作序列中仍然是具有挑战性的。为了解决这个问题,我们提出了EnergyMoGen,它包括两种基于能量模型的光谱:(1) 我们将扩散模型解释为一种感知潜在变量的能量模型,在潜在空间通过组合一组扩散模型来生成动作;(2) 我们引入了一种基于交叉注意力的语义感知能量模型,该模型能够实现语义合成并适应性地进行文本嵌入的梯度下降。为了克服这两种光谱中的语义不一致和动作失真挑战,我们提出了协同能量融合。这种设计使得潜在扩散动动生成模型可以通过结合对应于文本描述的多个能量项来综合高质量、复杂的动作。实验表明,在包括文本到动作生成、组合动作生成和多概念动作生成在内的各种动作生成任务中,我们的方法优于现有的最先进的模型。此外,我们还展示了该方法可以用来扩展动作数据集并提升文本到动动生成的任务性能。
https://arxiv.org/abs/2412.14706
Federated graph learning (FGL) has gained significant attention for enabling heterogeneous clients to process their private graph data locally while interacting with a centralized server, thus maintaining privacy. However, graph data on clients are typically non-IID, posing a challenge for a single model to perform well across all clients. Another major bottleneck of FGL is the high cost of communication. To address these challenges, we propose a communication-efficient personalized federated graph learning algorithm, CEFGL. Our method decomposes the model parameters into low-rank generic and sparse private models. We employ a dual-channel encoder to learn sparse local knowledge in a personalized manner and low-rank global knowledge in a shared manner. Additionally, we perform multiple local stochastic gradient descent iterations between communication phases and integrate efficient compression techniques into the algorithm. The advantage of CEFGL lies in its ability to capture common and individual knowledge more precisely. By utilizing low-rank and sparse parameters along with compression techniques, CEFGL significantly reduces communication complexity. Extensive experiments demonstrate that our method achieves optimal classification accuracy in a variety of heterogeneous environments across sixteen datasets. Specifically, compared to the state-of-the-art method FedStar, the proposed method (with GIN as the base model) improves accuracy by 5.64\% on cross-datasets setting CHEM, reduces communication bits by a factor of 18.58, and reduces the communication time by a factor of 1.65.
联合图学习(FGL)因其能允许异构客户端在与中心服务器交互的同时本地处理私有图数据,从而保持隐私性而受到了广泛关注。然而,客户端上的图数据通常是非独立同分布的(non-IID),这对单一模型在所有客户端上表现良好构成了挑战。另一个FGL的主要瓶颈是通信成本高。为了解决这些问题,我们提出了一种高效的个性化联合图学习算法——CEFGL。我们的方法将模型参数分解成低秩通用和稀疏私有模型。我们采用双通道编码器以个性化的形式学习稀疏的本地知识,并以共享的方式学习低秩的全局知识。此外,在通信阶段之间执行多次局部随机梯度下降迭代并集成高效的压缩技术到算法中。CEFGL的优势在于能够更准确地捕捉共同和个体的知识。通过利用低秩和稀疏参数以及压缩技术,CEFGL显著降低了通信复杂性。广泛的实验表明,我们的方法在十六个数据集上的各种异构环境中实现了最优的分类准确性。具体来说,与最先进的方法FedStar相比,当使用GIN作为基础模型时,在跨数据集设置CHEM上,所提出的方法提高了5.64%的准确率,通信比特数减少了18.58倍,并且通信时间缩短了1.65倍。
https://arxiv.org/abs/2412.13442