Knowledge Diffusion for Distillation

Abstract
Abstract (translated)
URL
PDF

Abstract

The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code will be available at this https URL.

Abstract (translated)

教师和学生之间的表示差距是知识蒸馏(KD)领域的一个新兴话题。为了缩小差距并提高性能，当前的方法常常采用复杂的训练计划、损失函数和特征对齐，这些任务和特征特定的。在本文中，我们指出这些方法的核心是排除噪声信息并蒸馏特征中的有价值信息，并提出了一种新的KD方法称为DiffKD，使用扩散模型来明确消除特征。我们的的方法是基于观察，学生特征通常包含比教师特征更多的噪声，因为学生模型的容量较小。为了解决这一问题，我们提议使用教师特征训练的扩散模型来消除学生特征。这允许我们在 refined clean feature 和教师特征之间的蒸馏任务中更好地进行知识蒸馏。此外，我们介绍了一种轻量级扩散模型，并配置了一个线性自编码器，以降低计算成本，并引入了一种自适应噪声匹配模块，以提高去噪性能。广泛的实验表明，DiffKD 适用于各种特征类型，并在图像分类、对象检测和语义分割任务中实现了最先进的性能。代码将在本链接中提供。

URL

https://arxiv.org/abs/2305.15712

PDF

https://arxiv.org/pdf/2305.15712.pdf