Paper Reading AI Learner

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

2025-06-18 16:30:02
Gabrel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Abstract

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at this http URL.

Abstract (translated)

大型语言模型(LLMs)已成为现实世界应用中不可或缺的工具。然而,它们的广泛应用引发了一系列安全问题,特别是在回答可能带来社会危害的问题时。尽管在通过对齐改善模型安全性方面做出了大量努力,但已对齐的模型的安全防护仍可能因后续微调而被破坏——即使额外训练的数据看似无害。在这篇论文中,我们实证展示了这一脆弱性源于大型语言模型参数中的关键安全低秩子空间对微调的高度敏感性。基于这一洞见,我们提出了一种新颖的无需重新训练的方法,称为低秩外推法(LoX),通过外推已对齐LLM的安全子空间来增强安全性鲁棒性。实验结果证实了LoX的有效性,在抵御良性和恶意微调攻击的同时,保持模型在新任务上的适应能力。例如,使用LoX可以实现11%到54%的绝对成功率(ASR)降低,对抗良性或恶意微调攻击。通过考察参数的成功率地形图,我们认为LoX成功的原因在于外推将LLM参数移动到了更平坦的区域,从而使其对扰动不那么敏感。代码可在此网址获取。

URL

https://arxiv.org/abs/2506.15606

PDF

https://arxiv.org/pdf/2506.15606.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot