Paper Reading AI Learner

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

2025-12-15 18:48:42
Richard J. Young

Abstract

Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

Abstract (translated)

大型语言模型中的安全对齐机制通过学习拒绝行为来防止对有害查询的响应,然而这些相同的机制也阻碍了包括认知建模、对抗性测试和安全性分析在内的合法研究应用。虽然消融技术能够通过方向正交化实现拒绝始终态的外科手术式移除,但现有实施方法的有效性仍缺乏系统评估。本研究评估了四种消融工具(Heretic、DECCC、ErisForge、FailSpy)在十六个指令微调模型(70亿至140亿参数规模)上的表现,并报告所有16种模型的工具兼容性和子集限定下的定量指标。单步方法在基准测试子集中展示了更强的能力保持效果(三个模型平均GSM8K变化:ErisForge -0.28个百分点;DECCP -0.13个百分点),而贝叶斯优化消融则产生了不同的分布转移(KL散度范围为0.043至1.646),对不同模型的能力影响各异。这些发现为研究人员在多样化模型架构上部署消融工具提供了基于证据的选择标准。主要发现指出,数学推理能力对消融干预最敏感,GSM8K变化范围从+1.51个百分点到-18.81个百分点(相对下降26.5%),这取决于所选工具和模型架构的不同。

URL

https://arxiv.org/abs/2512.13655

PDF

https://arxiv.org/pdf/2512.13655.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot