Attributing APT (Advanced Persistent Threat) malware to their respective groups is crucial for threat intelligence and cybersecurity. However, APT adversaries often conceal their identities, rendering attribution inherently adversarial. Existing machine learning-based attribution models, while effective, remain highly vulnerable to adversarial attacks. For example, the state-of-the-art byte-level model MalConv sees its accuracy drop from over 90% to below 2% under PGD (projected gradient descent) attacks. Existing gradient-based adversarial training techniques for malware detection or image processing were applied to malware attribution in this study, revealing that both robustness and training efficiency require significant improvement. To address this, we propose RoMA, a novel single-step adversarial training approach that integrates global perturbations to generate enhanced adversarial samples and employs adversarial consistency regularization to improve representation quality and resilience. A novel APT malware dataset named AMG18, with diverse samples and realistic class imbalances, is introduced for evaluation. Extensive experiments show that RoMA significantly outperforms seven competing methods in both adversarial robustness (e.g., achieving over 80% robust accuracy-more than twice that of the next-best method under PGD attacks) and training efficiency (e.g., more than twice as fast as the second-best method in terms of accuracy), while maintaining superior standard accuracy in non-adversarial scenarios.
将高级持续威胁(APT)恶意软件归因于相应的攻击组织对于威胁情报和网络安全至关重要。然而,APT对手通常会隐藏自己的身份,使得这种归因本质上具有对抗性。现有的基于机器学习的归因模型虽然有效,但仍极易受到对抗性攻击的影响。例如,最先进的字节级模型MalConv在面对PGD(投影梯度下降)攻击时,其准确性从超过90%骤降至2%以下。现有用于恶意软件检测或图像处理的基于梯度的对抗训练技术在这项研究中被应用于恶意软件归因,并揭示出这些方法的鲁棒性和训练效率亟需大幅改进。 为解决这些问题,我们提出了一种新的单步对抗训练方法RoMA(Robust Malware Attribution),该方法通过整合全局扰动生成增强的对抗样本并采用对抗一致性正则化以提高表示质量和抵御能力。此外,还引入了一个名为AMG18的新APT恶意软件数据集,其中包含多样化的真实类不平衡样本来进行评估。 广泛的实验结果表明,在对抗鲁棒性(例如,在PGD攻击下实现超过80%的稳健准确性,比排名第二的方法高出两倍以上)和训练效率方面(例如,在准确率上,RoMA比次优方法快两倍多),RoMA显著优于七种竞争方法。同时在非对抗场景中保持优越的标准准确性。 这种方法通过结合全局扰动和对抗一致性正则化技术来增强模型的鲁棒性和表示质量,从而解决了现有归因系统面对对抗性攻击时的脆弱性问题,并且提升了训练效率,在实际应用中的性能表现尤为突出。
https://arxiv.org/abs/2502.07492
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact ``student'' model from a large ``teacher'' model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. % as an effective way Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named \emph{spacing effect} in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31\% and 3.34\% on Tiny-ImageNet over online KD and self KD, respectively).
知识蒸馏(KD)是一种用于训练深度神经网络(DNN)的强大策略。尽管它最初被提出是为了从一个大型的“教师”模型中训练出更紧凑的“学生”模型,但近年来许多研究工作已转向将其应用于促进模型自身的泛化能力,例如在线KD和自我KD。在这里,我们提出了一个易于实现且兼容性强的策略——间隔知识蒸馏(Spaced KD),旨在提升在线KD和自我KD的有效性,在该策略中,“学生”模型从一段时间之前的“教师”模型中吸取知识。“间隔效应”是生物学习和记忆中的一个重要理论,认为适当的学习间隔可以显著提高学习效果。我们的方法正是受到了这一理论的启发。通过理论分析和实证研究,我们证明了所提出的间隔KD的优点在于它使得随机梯度下降(SGD)收敛到一个更平坦的损失景观中。通过广泛的实验验证了Spaced KD在提升DNN学习性能方面的有效性(例如,在Tiny-ImageNet数据集上相较于在线KD和自我KD,其性能分别提高了2.31%和3.34%)。
https://arxiv.org/abs/2502.06192
In practical applications, lattice quantizers leverage discrete lattice points to approximate arbitrary points in the lattice. An effective lattice quantizer significantly enhances both the accuracy and efficiency of these approximations. In the context of high-dimensional lattice quantization, previous work proposed utilizing low-dimensional optimal lattice quantizers and addressed the challenge of determining the optimal length ratio in orthogonal splicing. Notably, it was demonstrated that fixed length ratios and orthogonality yield suboptimal results when combining low-dimensional lattices. Building on this foundation, another approach employed gradient descent to identify optimal lattices, which inspired us to explore the use of neural networks to discover matrices that outperform those obtained from orthogonal splicing methods. We propose two novel approaches to tackle this problem: the Household Algorithm and the Matrix Exp Algorithm. Our results indicate that both the Household Algorithm and the Matrix Exp Algorithm achieve improvements in lattice quantizers across dimensions 13, 15, 17 to 19, 21, and 22. Moreover, the Matrix Exp Algorithm demonstrates superior efficacy in high-dimensional settings.
在实际应用中,晶格量化器利用离散的晶格点来近似任意维度中的晶格点。有效的晶格量化器能够显著提高这些近似的准确性和效率。对于高维晶格量化问题,先前的研究建议使用低维最优晶格量化器,并解决了确定正交拼接中的最优长度比这一挑战。值得注意的是,固定的长度比例和正交性在组合低维晶格时会产生次优结果。在此基础上,另一种方法采用梯度下降法来识别最佳晶格结构,这启发我们探索使用神经网络发现优于传统正交拼接方法获得的矩阵的方法。我们提出了两种新方法来解决这个问题:Householder算法和Matrix Exp算法。我们的结果显示,在13维、15维、17至19维、21维以及22维中,这两种算法都能改善晶格量化器的效果。此外,Matrix Exp算法在高维度场景下表现出更优越的效能。
https://arxiv.org/abs/2502.06887
Existing network paradigms have achieved lower downtime as well as a higher Quality of Experience (QoE) through the use of Artificial Intelligence (AI)-based network management tools. These AI management systems, allow for automatic responses to changes in network conditions, lowering operation costs for operators, and improving overall performance. While adopting AI-based management tools enhance the overall network performance, it also introduce challenges such as removing human supervision, privacy violations, algorithmic bias, and model inaccuracies. Furthermore, AI-based agents that fail to address these challenges should be culpable themselves rather than the network as a whole. To address this accountability gap, a framework consisting of a Deep Reinforcement Learning (DRL) model and a Machine Learning (ML) model is proposed to identify and assign numerical values of responsibility to the AI-based management agents involved in any decision-making regarding the network conditions, which eventually affects the end-user. A simulation environment was created for the framework to be trained using simulated network operation parameters. The DRL model had a 96% accuracy during testing for identifying the AI-based management agents, while the ML model using gradient descent learned the network conditions at an 83% accuracy during testing.
现有的网络范式通过使用基于人工智能(AI)的网络管理工具,已经实现了更低的停机时间和更高的服务质量(QoE)。这些AI管理系统能够自动响应网络条件的变化,从而降低运营商的操作成本并提升整体性能。尽管采用基于AI的管理工具可以增强整个网络的表现,但也带来了诸如移除人工监督、隐私侵犯、算法偏见和模型不准确等挑战。此外,未能解决这些问题的基于AI的代理应当承担相应的责任,而不是整个网络。 为了弥补这种问责缺口,提出了一种框架,该框架由深度强化学习(DRL)模型和机器学习(ML)模型组成,用于识别并为任何有关网络条件决策中的基于AI的管理代理分配责任数值。这些决策最终会影响终端用户。为此创建了一个仿真环境来训练这个框架,使用模拟的网络运行参数进行培训。 在测试过程中,DRL模型能够以96%的准确率识别出基于AI的管理系统;而ML模型通过梯度下降学习了网络条件,并且其准确性达到了83%。
https://arxiv.org/abs/2502.05608
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL), where models adapt to new tasks through example-based prompts without requiring parameter updates. However, understanding how tasks are internally encoded and generalized remains a challenge. To address some of the empirical and technical gaps in the literature, we introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads within the transformer architecture. This approach computes a single task vector as a weighted sum of attention heads, with the weights optimized causally via gradient descent. Our findings show that existing methods fail to generalize effectively to modalities beyond text. In response, we also design a benchmark to evaluate whether a task vector can preserve task fidelity in functional regression tasks. The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks, demonstrating its generalizability across modalities. Moreover, ablation studies show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
大型语言模型(LLMs)在基于示例提示的上下文学习(ICL)中展现出了卓越的能力,即通过不更新参数的方式适应新任务。然而,理解任务如何被内部编码并泛化仍然是一个挑战。为了填补文献中的实证和技术空白,我们引入了一种自动化的公式来将任务信息编码到ICL提示中,该方法基于转换器架构内的注意力头进行操作。此方法计算注意力头的加权总和以生成单个任务向量,并通过因果梯度下降优化权重。 我们的研究发现现有方法无法有效地推广到文本之外的模态。为解决这一问题,我们设计了一个基准测试来评估一个任务向量是否能在功能回归任务中保持任务保真度。所提出的方法成功地从上下文演示中提取了特定于任务的信息,并在文本和回归任务上都表现出色,展示了其跨模态的泛化能力。 此外,消融研究表明我们方法的有效性源于将最后一个隐藏状态的分布与最佳表现的ICL模型的分布对齐。
https://arxiv.org/abs/2502.05390
Anomaly detection is crucial in the energy sector to identify irregular patterns indicating equipment failures, energy theft, or other issues. Machine learning techniques for anomaly detection have achieved great success, but are typically centralized, involving sharing local data with a central server which raises privacy and security concerns. Federated Learning (FL) has been gaining popularity as it enables distributed learning without sharing local data. However, FL depends on neural networks, which are vulnerable to adversarial attacks that manipulate data, leading models to make erroneous predictions. While adversarial attacks have been explored in the image domain, they remain largely unexplored in time series problems, especially in the energy domain. Moreover, the effect of adversarial attacks in the FL setting is also mostly unknown. This paper assesses the vulnerability of FL-based anomaly detection in energy data to adversarial attacks. Specifically, two state-of-the-art models, Long Short Term Memory (LSTM) and Transformers, are used to detect anomalies in an FL setting, and two white-box attack methods, Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), are employed to perturb the data. The results show that FL is more sensitive to PGD attacks than to FGSM attacks, attributed to PGD's iterative nature, resulting in an accuracy drop of over 10% even with naive, weaker attacks. Moreover, FL is more affected by these attacks than centralized learning, highlighting the need for defense mechanisms in FL.
异常检测在能源领域至关重要,用于识别指示设备故障、能源盗窃或其他问题的不规则模式。机器学习技术在异常检测方面取得了巨大成功,但这些方法通常是集中式的,需要将本地数据与中央服务器共享,这引发了隐私和安全方面的担忧。联邦学习(FL)因其能够在不分享本地数据的情况下实现分布式学习而日益流行。然而,联邦学习依赖于神经网络,后者容易受到通过操纵数据来误导模型进行错误预测的对抗性攻击的影响。虽然在图像领域已经探讨了对抗性攻击的问题,但在时间序列问题上尤其是在能源领域中的研究仍然非常有限。此外,在联邦学习环境中对抗性攻击的效果也基本上未知。 本文评估了基于联邦学习的异常检测方法在面对对抗性攻击时的脆弱性。具体来说,使用两种最先进的模型——长短期记忆网络(LSTM)和变压器来在一个联邦学习设置中检测能源数据中的异常,并应用两种白盒攻击方法:快速梯度符号法(FGSM)和投影梯度下降(PGD),以扰动数据。实验结果表明,与FGSM相比,FL在面对PGD的迭代性质时更加敏感,即使是在使用简单且较弱的攻击手段下,准确率也会降低超过10%。此外,与集中式学习相比,联邦学习受到这些攻击的影响更大,这凸显了开发防御机制以保护联邦学习环境的需求。
https://arxiv.org/abs/2502.05041
The IoT facilitates a connected, intelligent, and sustainable society; therefore, it is imperative to protect the IoT ecosystem. The IoT-based 5G and 6G will leverage the use of machine learning and artificial intelligence (ML/AI) more to pave the way for autonomous and collaborative secure IoT networks. Zero-touch, zero-trust IoT security with AI and machine learning (ML) enablement frameworks offers a powerful approach to securing the expanding landscape of Internet of Things (IoT) devices. This paper presents a novel framework based on the integration of Zero Trust, Zero Touch, and AI/ML powered for the detection, mitigation, and prevention of DDoS attacks in modern IoT ecosystems. The focus will be on the new integrated framework by establishing zero trust for all IoT traffic, fixed and mobile 5G/6G IoT network traffic, and data security (quarantine-zero touch and dynamic policy enforcement). We perform a comparative analysis of five machine learning models, namely, XGBoost, Random Forest, K-Nearest Neighbors, Stochastic Gradient Descent, and Native Bayes, by comparing these models based on accuracy, precision, recall, F1-score, and ROC-AUC. Results show that the best performance in detecting and mitigating different DDoS vectors comes from the ensemble-based approaches.
物联网(IoT)促进了连接、智能和可持续社会的发展,因此保护物联网生态系统至关重要。基于5G和6G的物联网将更多地利用机器学习和人工智能(ML/AI),以推动自主和协作的安全物联网网络的发展。结合AI和机器学习(ML)启用框架的零接触、零信任物联网安全提供了一种强大的方法来保障不断扩大的物联网设备领域的安全性。本文提出了一种基于整合零信任、零接触以及AI/ML技术的新框架,用于检测、缓解及预防现代物联网生态系统中的分布式拒绝服务(DDoS)攻击。重点在于通过为所有物联网流量、固定和移动的5G/6G物联网网络流量以及数据安全(隔离-零接触与动态策略执行)建立零信任来构建新的综合框架。 我们对五种机器学习模型进行了比较分析,即XGBoost、随机森林、K近邻算法、随机梯度下降和支持向量机朴素贝叶斯,基于准确率、精确率、召回率、F1分数和ROC-AUC指标进行对比。结果显示,在检测和缓解不同DDoS攻击矢量方面,集成方法的表现最佳。
https://arxiv.org/abs/2502.03614
Neural architecture search (NAS) has shown promise towards automating neural network design for a given task, but it is computationally demanding due to training costs associated with evaluating a large number of architectures to find the optimal one. To speed up NAS, recent works limit the search to network building blocks (modular search) instead of searching the entire architecture (global search), approximate candidates' performance evaluation in lieu of complete training, and use gradient descent rather than naturally suitable discrete optimization approaches. However, modular search does not determine network's macro architecture i.e. depth and width, demanding manual trial and error post-search, hence lacking automation. In this work, we revisit NAS and design a navigable, yet architecturally diverse, macro-micro search space. In addition, to determine relative rankings of candidates, existing methods employ consistent approximations across entire search spaces, whereas different networks may not be fairly comparable under one training protocol. Hence, we propose an architecture-aware approximation with variable training schemes for different networks. Moreover, we develop an efficient search strategy by disjoining macro-micro network design that yields competitive architectures in terms of both accuracy and size. Our proposed framework achieves a new state-of-the-art on EMNIST and KMNIST, while being highly competitive on the CIFAR-10, CIFAR-100, and FashionMNIST datasets and being 2-4x faster than the fastest global search methods. Lastly, we demonstrate the transferability of our framework to real-world computer vision problems by discovering competitive architectures for face recognition applications.
神经架构搜索(NAS)在自动化特定任务的神经网络设计方面显示出潜力,但由于评估大量架构以找到最优架构时涉及高昂的训练成本,这一过程计算密集。为了加速NAS,最近的工作将搜索限制在网络构建模块上(模块化搜索),而不是在整个架构中进行搜索(全局搜索),并且通过近似候选者的性能评价来替代完整训练,并使用梯度下降而非更适合的离散优化方法。然而,模块化搜索无法确定网络的整体结构即深度和宽度,在搜索后仍需手动试验调整,从而失去了自动化特性。 在这项工作中,我们重新审视了NAS设计了一个可导航且架构多样的宏-微搜索空间。此外,现有方法为了确定候选者的相对排名会在整个搜索空间中使用一致的近似法,但不同的网络在相同的训练协议下可能无法公平比较。因此,我们提出了基于架构感知的不同网络采用不同训练方案的近似策略。此外,通过分离宏-微网络设计,我们开发了一种高效的搜索策略,在准确性和大小方面都产生了竞争性的架构。我们的框架在EMNIST和KMNIST数据集上达到了新的SOTA(State of the Art),同时在CIFAR-10、CIFAR-100以及FashionMNIST数据集中表现出高度竞争力,并且比最快的全局搜索方法快2到4倍。 最后,我们通过为面部识别应用发现竞争性架构展示了框架向现实世界计算机视觉问题的可转移能力。
https://arxiv.org/abs/2502.03553
We give a comprehensive analysis of transformers as time series foundation models, focusing on their approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates. We prove that it is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish bounds for pretraining when the data satisfies Dobrushin's condition. Experiments support our theoretical findings, highlighting the efficacy of transformers as time series foundation models.
我们对变压器作为时间序列基础模型进行了全面分析,重点关注它们的近似和泛化能力。首先,我们证明存在可以通过梯度下降在输入的一维时间序列上拟合自回归模型的变压器。然后,我们研究了MOIRAI,这是一种可以处理任意数量协变量的多变量时间序列基础模型,并证明它可以自动适应具有任意数量协变量的自回归模型,从而提供对其设计和实证成功的见解。对于泛化能力,我们在数据满足Dobrushin条件的情况下建立了预训练的界限。实验结果支持了我们的理论发现,突显了变压器作为时间序列基础模型的有效性。
https://arxiv.org/abs/2502.03383
In Federated Learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Gradient Tracking (GT) has recently emerged as a solution which mitigates this issue by introducing correction terms to local model updates. To date, GT has only been considered under Stochastic Gradient Descent (SGD)-based model training, while modern FL frameworks increasingly employ adaptive optimizers for improved convergence. In this work, we generalize the GT framework to a more flexible Parameter Tracking (PT) paradigm and propose two novel adaptive optimization algorithms, {\tt FAdamET} and {\tt FAdamGT}, that integrate PT into Adam-based FL. We provide a rigorous convergence analysis of these algorithms under non-convex settings. Our experimental results demonstrate that both proposed algorithms consistently outperform existing methods when evaluating total communication cost and total computation cost across varying levels of data heterogeneity, showing the effectiveness of correcting first-order information in federated adaptive optimization.
在联邦学习(FL)中,模型训练性能受到客户端数据异质性的影响很大。梯度追踪(GT)最近作为一种解决方案出现,通过引入校正项来缓解这一问题,以改进局部模型更新。到目前为止,GT仅在基于随机梯度下降(SGD)的模型训练下被考虑,而现代FL框架越来越多地采用自适应优化器以提高收敛速度。在这项工作中,我们将GT框架推广为更灵活的参数追踪(PT)范式,并提出了两个新的自适应优化算法{\tt FAdamET}和{\tt FAdamGT},这些算法将PT整合到基于Adam的FL中。我们在非凸设置下对这些算法提供了严格的收敛性分析。我们的实验结果表明,在评估不同数据异质性的总通信成本和总计算成本时,所提出的两种算法始终优于现有方法,这证明了在联邦自适应优化中校正一阶信息的有效性。
https://arxiv.org/abs/2502.02727
Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: this https URL.
最近的深度学习优化进展引入了新的算法,例如无调度优化器(Schedule-Free optimizers)、AdEMAMix、MARS和Lion等,这些算法对传统的动量机制进行了修改。在另一条研究线上,通过将动量系数从当前梯度权重中解耦合,实现了噪声主导下的随机梯度下降(SGD)的理论加速。在这篇论文中,我们建立了这两条研究线之间的明确联系。我们通过150m语言建模任务进行了一些初步实验来验证我们的理论发现,并发现在噪声主导的情况下,类似于加速版本的随机梯度下降的AdEMAMix表现最佳。 基于这些见解,我们对AdEMAMix进行了修改并推出了简化版的Simplified-AdEMAMix,在大中小各种批量大小设置下保持与原版相同的性能的同时,消除了两个不同动量项的需求。Simplified-AdEMAMix的代码可在此仓库中获取:[此链接](this https URL)。 通过这样的方式,我们的研究不仅加强了现有优化方法之间的联系,并且提供了一种改进方案以简化和提升深度学习模型训练过程中的效率。
https://arxiv.org/abs/2502.02431
In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion's better generalization performance recently documented.
在现代深度学习中的优化方法中,每次更新都依赖于之前的迭代历史记录,这通常被称为记忆,并且这种依赖性会随着向过去越来越远的迭代而迅速减弱。例如,具有动量的梯度下降法通过指数平均过去的梯度来实现指数衰减的记忆。 我们介绍了一种识别无记忆算法的方法,该算法可以近似那些具有记忆的优化算法。这是通过对更新中的所有过去迭代用当前迭代替换,并添加一个来自记忆产生的修正项(也是当前迭代的函数)来实现的。这个修正项可以被视为对损失的扰动,其性质可以揭示记忆如何隐性地正则化或反正则化优化动态。 作为我们理论的应用实例,我们发现Lion算法并不具备像AdamW那样的由记忆引起的隐性反正则化机制。这为最近记录中的Lion更好的泛化性能提供了一个基于理论的解释。
https://arxiv.org/abs/2502.02132
A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
在将大型语言模型(LLM)与人类偏好对齐时,分布偏移是一个主要挑战。现有的LLM对齐算法依赖于静态的偏好数据集,并假设这些数据能够准确地代表现实世界中用户的真实偏好。然而,在地理区域、人口统计特征、语言模式和不断演变的文化趋势方面,用户的偏好存在显著差异。这种偏好分布的变化导致许多实际应用中的严重对齐失败。 为了解决这个问题,我们采用了分布鲁棒优化的原则性框架,并开发了两种新的分布鲁棒直接偏好优化(DPO)算法:分别是Wasserstein DPO(WDPO)和Kullback-Leibler DPO(KLDPO)。我们对学习WDPO和KLDPO的最优策略参数所需的数据量进行了表征。此外,为了克服WDPO和KLDPO复杂的最小-最大损失函数带来的挑战,我们提出了可扩展的梯度下降式学习算法,并开发了相应的近似方法。 我们的实证实验表明,在存在偏好分布变化的情况下,WDPO和KLDPO在显著提高对齐质量方面表现出色。
https://arxiv.org/abs/2502.01930
Rule-based models play a crucial role in scenarios that require transparency and accountable decision-making. However, they primarily consist of discrete parameters and structures, which presents challenges for scalability and optimization. In this work, we introduce a new rule-based classifier trained using gradient descent, in which the user can control the maximum number and length of the rules. For numerical partitions, the user can also control the partitions used with fuzzy sets, which also helps keep the number of partitions small. We perform a series of exhaustive experiments on $40$ datasets to show how this classifier performs in terms of accuracy and rule base size. Then, we compare our results with a genetic search that fits an equivalent classifier and with other explainable and non-explainable state-of-the-art classifiers. Our results show how our method can obtain compact rule bases that use significantly fewer patterns than other rule-based methods and perform better than other explainable classifiers.
基于规则的模型在需要透明和可问责决策的情景中扮演着关键角色。然而,这些模型主要由离散参数和结构组成,这给其可扩展性和优化带来了挑战。在这项工作中,我们引入了一种新的基于梯度下降训练的基于规则的分类器,用户可以控制该分类器中的最大规则数量及每个规则的长度。对于数值划分,用户还可以使用模糊集来控制所用的分区,这也帮助减少了分区间数。我们在40个数据集上进行了一系列详尽的实验,以展示此分类器在准确性和规则库大小方面的表现。然后,我们将我们的结果与适用于等效分类器的遗传搜索以及其他可解释和不可解释的最新分类器进行了比较。我们的结果显示了我们方法如何能获得比其他基于规则的方法使用更少模式且优于其他可解释分类器的小型规则库。
https://arxiv.org/abs/2502.01375
Transferability, the ability of adversarial examples crafted for one model to deceive other models, is crucial for black-box attacks. Despite advancements in attack methods for semantic segmentation, transferability remains limited, reducing their effectiveness in real-world applications. To address this, we introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a novel black-box approach that enhances both attack performance and transferability. Unlike conventional segmentation attacks that rely on output predictions for gradient calculation, FSPGD computes gradients from intermediate layer features. Specifically, our method introduces a loss function that targets local information by comparing features between clean images and adversarial examples, while also disrupting contextual information by accounting for spatial relationships between objects. Experiments on Pascal VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior transferability and attack performance, establishing a new state-of-the-art benchmark. Code is available at this https URL.
对抗样本的迁移性,即为一个模型量身定制的对抗样例能够欺骗其他模型的能力,在黑盒攻击中至关重要。尽管语义分割领域的攻击方法取得了进展,但其迁移能力仍然有限,这影响了它们在实际应用中的有效性。为了应对这一挑战,我们提出了一种新的黑盒方法——特征相似性投影梯度下降(FSPGD)攻击,这种方法不仅能提高攻击性能,还能增强迁移性。 与传统的依赖于输出预测来进行梯度计算的分割攻击不同,FSPGD从中间层特性中计算梯度。具体而言,我们的方法引入了一种损失函数,该函数通过比较干净图像和对抗样例之间的特征来瞄准局部信息,并通过考虑对象间的空间关系来扰乱上下文信息。 在Pascal VOC 2012和Cityscapes数据集上的实验表明,FSPGD取得了卓越的迁移性和攻击性能,从而确立了新的行业基准。代码可在[此处](https://this-should-be-the-url-to-code-repository)获取。
https://arxiv.org/abs/2502.01262
As privacy concerns and data regulations grow, federated learning (FL) has emerged as a promising approach for training machine learning models across decentralized data sources without sharing raw data. However, a significant challenge in FL is that client data are often non-IID (non-independent and identically distributed), leading to reduced performance compared to centralized learning. While many methods have been proposed to address this issue, their underlying mechanisms are often viewed from different perspectives. Through a comprehensive investigation from gradient descent to FL, and from IID to non-IID data settings, we find that inconsistencies in client loss landscapes primarily cause performance degradation in non-IID scenarios. From this understanding, we observe that existing methods can be grouped into two main strategies: (i) adjusting parameter update paths and (ii) modifying client loss landscapes. These findings offer a clear perspective on addressing non-IID challenges in FL and help guide future research in the field.
随着隐私担忧和数据法规的增加,联邦学习(FL)作为一种在不共享原始数据的情况下,在分散的数据源上训练机器学习模型的有前途的方法应运而生。然而,联邦学习的一个重要挑战是客户端数据往往是非独立同分布(non-IID)的,这会导致其性能相对于集中式学习有所下降。虽然已经提出了许多方法来解决这一问题,但它们的基本机制往往从不同的视角来看待。通过从梯度下降到联邦学习进行全面调查,并从独立同分布(IID)数据设置扩展到非独立同分布数据设置,我们发现客户端损失景观的不一致是导致非独立同分布场景中性能下降的主要原因。基于这一理解,我们观察到现有方法可以分为两大策略:(i) 调整参数更新路径;(ii) 修改客户端损失景观。这些发现为解决联邦学习中的非独立同分布挑战提供了清晰的视角,并有助于指导该领域的未来研究。
https://arxiv.org/abs/2502.00182
Transformer-based models have demonstrated remarkable ability in in-context learning (ICL), where they can adapt to unseen tasks from a prompt with a few examples, without requiring parameter updates. Recent research has provided insight into how linear Transformers can perform ICL by implementing gradient descent estimators. In particular, it has been shown that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent with respect to a linear least-squares objective when trained on random linear regression tasks. However, the theoretical understanding of ICL for nonlinear function classes remains limited. In this work, we address this gap by first showing that LSA is inherently restricted to solving linear least-squares objectives and thus, the solutions in prior works cannot readily extend to nonlinear ICL tasks. To overcome this limitation, drawing inspiration from modern architectures, we study a mechanism that combines LSA with GLU-like feed-forward layers and show that this allows the model to perform one step of gradient descent on a polynomial kernel regression. Further, we characterize the scaling behavior of the resulting Transformer model, highlighting the necessary model size to effectively handle quadratic ICL tasks. Our findings highlight the distinct roles of attention and feed-forward layers in nonlinear ICL and identify key challenges when extending ICL to nonlinear function classes.
基于Transformer的模型在上下文学习(ICL)中展示了非凡的能力,它们可以在提示中使用少量示例的情况下适应未见任务,而无需更新参数。最近的研究表明,通过实现梯度下降估计器,线性变换器可以执行ICL。特别是,已经证明当在线性回归任务上训练时,最优的线性自注意力(LSA)机制能够实施关于线性最小二乘目标的一次梯度下降步骤。然而,对于非线性函数类的ICL的理论理解仍然有限。在这项工作中,我们通过首先表明LSA本质上仅限于解决线性最小二乘问题来填补这一空白,因此先前工作的解决方案不能直接扩展到非线性的ICL任务上。为了克服这个限制,并借鉴现代架构的灵感,我们研究了一种结合了类似GLU(Gated Linear Units)前馈层和LSA机制的方法,并表明这使模型能够对多项式核回归执行一次梯度下降步骤。此外,我们还表征了由此产生的Transformer模型的规模行为,突出了有效地处理二次ICL任务所需的模型大小。我们的发现强调了注意力和前馈层在非线性ICL中的不同作用,并确定了将ICL扩展到非线性函数类的关键挑战。
https://arxiv.org/abs/2501.18187
We propose a zero-shot method for generating images in arbitrary spaces (e.g., a sphere for 360° panoramas and a mesh surface for texture) using a pretrained image diffusion model. The zero-shot generation of various visual content using a pretrained image diffusion model has been explored mainly in two directions. First, Diffusion Synchronization-performing reverse diffusion processes jointly across different projected spaces while synchronizing them in the target space-generates high-quality outputs when enough conditioning is provided, but it struggles in its absence. Second, Score Distillation Sampling-gradually updating the target space data through gradient descent-results in better coherence but often lacks detail. In this paper, we reveal for the first time the interconnection between these two methods while highlighting their differences. To this end, we propose StochSync, a novel approach that combines the strengths of both, enabling effective performance with weak conditioning. Our experiments demonstrate that StochSync provides the best performance in 360° panorama generation (where image conditioning is not given), outperforming previous finetuning-based methods, and also delivers comparable results in 3D mesh texturing (where depth conditioning is provided) with previous methods.
我们提出了一种零样本方法,用于在任意空间(例如,球体用于360°全景图和网格表面用于纹理)生成图像,采用预训练的图像扩散模型。使用预训练的图像扩散模型进行各种视觉内容的零样本生成主要从两个方向进行了探索。首先,通过同时跨不同投影空间执行反向扩散过程并在目标空间中同步实现高质量输出,当有足够的条件信息时效果良好,但缺乏足够条件时则表现不佳。其次,通过梯度下降逐步更新目标空间数据的分数蒸馏采样方法产生了更好的连贯性,但往往缺少细节。在这篇论文中,我们首次揭示了这两种方法之间的相互联系,并突出了它们的不同之处。为此,我们提出了StochSync这一新方法,结合了两种方法的优点,在条件信息较弱的情况下也能实现有效性能。我们的实验表明,StochSync在360°全景图生成(此时未提供图像条件)中表现出最佳性能,优于以往基于微调的方法,并且在3D网格纹理化(其中提供了深度条件)方面与之前的方法相比结果相当。
https://arxiv.org/abs/2501.15445
The integration of contextual embeddings into the optimization processes of large language models is an advancement in natural language processing. The Context-Aware Neural Gradient Mapping framework introduces a dynamic gradient adjustment mechanism, incorporating contextual embeddings directly into the optimization process. This approach facilitates real-time parameter adjustments, enhancing task-specific generalization even in the presence of sparse or noisy data inputs. The mathematical foundation of this framework relies on gradient descent modifications, where contextual embeddings are derived from a supplementary neural network trained to map input features to optimal adaptation gradients. By employing differential geometry principles, high-dimensional input dependencies are encoded into low-dimensional gradient manifolds, enabling efficient adaptation without necessitating the retraining of the entire model. Empirical evaluations demonstrate that the proposed framework consistently outperforms baseline models across various metrics, including accuracy, robustness to noise, and computational efficiency. The integration of context-specific embeddings allows for a more complex understanding of language, thereby improving the model's ability to handle diverse linguistic phenomena. Furthermore, the computational efficiency achieved through this method demonstrates its scalability for large-scale language models operating under diverse constraints.
将上下文嵌入整合到大型语言模型的优化过程中是自然语言处理领域的一项进步。上下文感知神经梯度映射框架引入了一种动态梯度调整机制,该机制直接将上下文嵌入融入优化过程之中。这种方法能够实现实时参数调整,在稀疏或嘈杂的数据输入情况下也能增强特定任务的一般化能力。该框架的数学基础在于对梯度下降方法的修改,其中上下文嵌入是由一个辅助神经网络生成的,这个神经网络被训练以将输入特征映射到最优适应性梯度上。 通过运用微分几何原理,这种方法能够把高维输入依赖关系编码进低维梯度流形中,从而在无需重新训练整个模型的情况下实现高效的适应。实证评估表明,在包括准确性、抗噪声能力和计算效率在内的多个指标下,所提出的框架始终优于基准模型的表现。 上下文特定嵌入的集成使得语言的理解更加复杂,并且提升了模型处理多样语言现象的能力。此外,通过这种方法实现的计算效率也证明了其在不同约束条件下大规模语言模型中的可扩展性。
https://arxiv.org/abs/2501.14936
3D Gaussian Splatting (3DGS) has emerged as a mainstream solution for novel view synthesis and 3D reconstruction. By explicitly encoding a 3D scene using a collection of Gaussian kernels, 3DGS achieves high-quality rendering with superior efficiency. As a learning-based approach, 3DGS training has been dealt with the standard stochastic gradient descent (SGD) method, which offers at most linear convergence. Consequently, training often requires tens of minutes, even with GPU acceleration. This paper introduces a (near) second-order convergent training algorithm for 3DGS, leveraging its unique properties. Our approach is inspired by two key observations. First, the attributes of a Gaussian kernel contribute independently to the image-space loss, which endorses isolated and local optimization algorithms. We exploit this by splitting the optimization at the level of individual kernel attributes, analytically constructing small-size Newton systems for each parameter group, and efficiently solving these systems on GPU threads. This achieves Newton-like convergence per training image without relying on the global Hessian. Second, kernels exhibit sparse and structured coupling across input images. This property allows us to effectively utilize spatial information to mitigate overshoot during stochastic training. Our method converges an order faster than standard GPU-based 3DGS training, requiring over $10\times$ fewer iterations while maintaining or surpassing the quality of the compared with the SGD-based 3DGS reconstructions.
3D高斯点阵(3D Gaussian Splatting,简称3DGS)已发展成为新颖视图合成和三维重建的主流解决方案。通过使用一组高斯核显式编码三维场景,3DGS能够实现高质量渲染并具有优越的效率。作为一种基于学习的方法,3DGS的训练通常采用标准随机梯度下降(SGD)方法进行处理,这种方法最多只能提供线性收敛性能。因此,即使有GPU加速,其训练过程仍可能需要几十分钟的时间。 本文介绍了一种针对3DGS的(近似)二次收敛训练算法,该算法利用了3DGS的独特属性。我们的方法基于两个关键观察结果: 首先,高斯核的属性在图像空间损失中独立贡献这一事实支持孤立和局部优化算法的应用。我们通过将优化过程拆解到每个核属性级别来实现这一点,并为每一组参数构建小型牛顿系统(Newton system),然后高效地利用GPU线程解决这些系统,从而在不依赖全局黑塞矩阵的情况下实现了类似于牛顿方法的收敛性能。 其次,高斯核在输入图像之间表现出稀疏且有结构的耦合。这一特性允许我们有效利用空间信息来减轻随机训练过程中的过冲问题(overshoot)。 我们的方法比基于GPU的标准3DGS训练快一个数量级,在需要的迭代次数减少了超过10倍的情况下,仍能保持或超越与SGD基线相比较的质量水平。
https://arxiv.org/abs/2501.13975