Abstract
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
Abstract (translated)
大型语言模型中的安全对齐机制通过学习拒绝行为来防止对有害查询的响应,然而这些相同的机制也阻碍了包括认知建模、对抗性测试和安全性分析在内的合法研究应用。虽然消融技术能够通过方向正交化实现拒绝始终态的外科手术式移除,但现有实施方法的有效性仍缺乏系统评估。本研究评估了四种消融工具(Heretic、DECCC、ErisForge、FailSpy)在十六个指令微调模型(70亿至140亿参数规模)上的表现,并报告所有16种模型的工具兼容性和子集限定下的定量指标。单步方法在基准测试子集中展示了更强的能力保持效果(三个模型平均GSM8K变化:ErisForge -0.28个百分点;DECCP -0.13个百分点),而贝叶斯优化消融则产生了不同的分布转移(KL散度范围为0.043至1.646),对不同模型的能力影响各异。这些发现为研究人员在多样化模型架构上部署消融工具提供了基于证据的选择标准。主要发现指出,数学推理能力对消融干预最敏感,GSM8K变化范围从+1.51个百分点到-18.81个百分点(相对下降26.5%),这取决于所选工具和模型架构的不同。
URL
https://arxiv.org/abs/2512.13655