Abstract
Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at this http URL.
Abstract (translated)
大型语言模型(LLMs)已成为现实世界应用中不可或缺的工具。然而,它们的广泛应用引发了一系列安全问题,特别是在回答可能带来社会危害的问题时。尽管在通过对齐改善模型安全性方面做出了大量努力,但已对齐的模型的安全防护仍可能因后续微调而被破坏——即使额外训练的数据看似无害。在这篇论文中,我们实证展示了这一脆弱性源于大型语言模型参数中的关键安全低秩子空间对微调的高度敏感性。基于这一洞见,我们提出了一种新颖的无需重新训练的方法,称为低秩外推法(LoX),通过外推已对齐LLM的安全子空间来增强安全性鲁棒性。实验结果证实了LoX的有效性,在抵御良性和恶意微调攻击的同时,保持模型在新任务上的适应能力。例如,使用LoX可以实现11%到54%的绝对成功率(ASR)降低,对抗良性或恶意微调攻击。通过考察参数的成功率地形图,我们认为LoX成功的原因在于外推将LLM参数移动到了更平坦的区域,从而使其对扰动不那么敏感。代码可在此网址获取。
URL
https://arxiv.org/abs/2506.15606