From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Abstract
Abstract (translated)
URL
PDF

Abstract

Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at this https URL.

Abstract (translated)

知识蒸馏(KD)使用教师的预测 logits 作为软标签来指导学生，而自我蒸馏不需要真正的教师来要求软标签。这项工作将通用KD损失分解和重组为 normalized KD(NKD)损失，以及为 both 目标 class(图像类别)和非目标 class(图像)的自定义软标签，名为Universal Self-Knowledge Distillation(USKD)。我们分解了KD损失并发现它使非目标损失迫使学生的非目标 logits 与教师相同，但两个非目标 logits 的总和不同，防止它们相同。NKD 将非目标 logits 标准化以平衡它们的总和。它可以广泛应用于 KD 和自我蒸馏，更好地使用软标签为蒸馏损失。USKD 为 both 目标和非目标 class 生成自定义软标签，它平滑学生的目标 logit 作为软目标标签，使用排名的中间特征生成软非目标标签，以Zipf's law。对于 KD 有教师的情况，我们的NKD 在CIFAR-100 和 ImageNet 数据集上取得了最先进的性能，使用 ResNet-34 教师将 ImageNet 上的 ImageNet 1 准确性从69.90%提高到71.96%。对于自我蒸馏没有教师的情况，USKD 是第一种自我蒸馏方法，可以在 CNN 和 VIT 模型中有效地应用于 both CNN 和 ViT 模型，并几乎没有额外的时间和内存成本，产生新的最先进的结果，例如 MobileNet 和 DeiT-Tiny 在 ImageNet 上的 1.17% 和 0.55% 准确性提高。我们的代码在此https URL上可用。

URL

https://arxiv.org/abs/2303.13005

PDF

https://arxiv.org/pdf/2303.13005.pdf