Abstract
One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen & Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called "negative log-unlikelihood" (NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).
Abstract (translated)
语言模型“遗忘”(unlearning)的一个目标是减少对选定文本实例的记忆,同时保持模型的一般能力。尽管已提出了多种方法,但在不显著降低模型实用性的情况下减少大型数据集的内存仍是一个挑战。在本文中,我们研究了均值教师算法(Tarvainen & Valpola, 2017),这是一种源自连续学习文献中的简单邻近优化方法,它逐步修改教师模型。我们展示了均值教师可以逼近慢速自然梯度下降(NGD)的轨迹,而这种更新本质上寻求低曲率更新,这些更新不太可能降低模型实用性。虽然慢速NGD可能会遇到消失的梯度问题,但我们引入了一种新的遗忘损失方法,称为“负对数非可能性”(NLUL),它可以避免这个问题。我们展示了均值教师与NLUL结合后,在MUSE基准测试(Shi et al., 2024)中的某些指标有所提高。
URL
https://arxiv.org/abs/2504.13388