Paper Reading AI Learner

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

2025-10-09 09:17:35
Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev

Abstract

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Abstract (translated)

模型剪枝,即移除部分权重,已成为减少大规模语言模型(LLM)在推理过程中内存占用的重要方法。值得注意的是,流行的推理引擎,如vLLM,允许用户方便地在部署前对下载的模型进行剪枝处理。虽然剪枝方法的有效性和效率已经显著提高,但其安全性影响却鲜少有人探索。在这项工作中,我们首次展示了现代LLM剪枝方法可以被恶意利用。具体来说,攻击者能够构建一个看似无害但实际上在经过剪枝后会表现出恶意行为的模型。 我们的方法基于这样一个理念:即攻击者可以通过计算一个代理指标来估计每个参数被剪除的可能性大小。通过这种信息,攻击者首先可以在那些不太可能被剪枝掉的参数中注入恶意行为;然后使用可能会被剪枝的参数修复模型,在未被剪枝的模型中有效抵消掉已经注入的行为。 我们通过对五种不同模型进行广泛的评估来展示了此类攻击的严重性:在应用vLLM中的任何一种剪枝方法(包括Magnitude、Wanda和SparseGPT)之后,这些模型在各种不同的攻击场景下都表现出强烈的恶意行为。例如,在越狱测试中的成功率为高达95.7%,拒绝执行良性指令的成功率接近98.7%,以及目标内容注入的成功率可达99.5%。 我们的研究揭示了一个关键的部署时安全漏洞,并强调了模型压缩领域中增强安全性意识的迫切需求。

URL

https://arxiv.org/abs/2510.07985

PDF

https://arxiv.org/pdf/2510.07985.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot