Paper Reading AI Learner

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

2023-03-23 02:59:36
Zhendong Yang, Ailing Zeng, Zhe Li, Tianke Zhang, Chun Yuan, Yu Li

Abstract

Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at this https URL.

Abstract (translated)

知识蒸馏(KD)使用教师的预测 logits 作为软标签来指导学生,而自我蒸馏不需要真正的教师来要求软标签。这项工作将通用KD损失分解和重组为 normalized KD(NKD)损失,以及为 both 目标 class(图像类别)和非目标 class(图像)的自定义软标签,名为Universal Self-Knowledge Distillation(USKD)。我们分解了KD损失并发现它使非目标损失迫使学生的非目标 logits 与教师相同,但两个非目标 logits 的总和不同,防止它们相同。NKD 将非目标 logits 标准化以平衡它们的总和。它可以广泛应用于 KD 和自我蒸馏,更好地使用软标签为蒸馏损失。USKD 为 both 目标和非目标 class 生成自定义软标签,它平滑学生的目标 logit 作为软目标标签,使用排名的中间特征生成软非目标标签,以Zipf's law。对于 KD 有教师的情况,我们的NKD 在CIFAR-100 和 ImageNet 数据集上取得了最先进的性能,使用 ResNet-34 教师将 ImageNet 上的 ImageNet 1 准确性从69.90%提高到71.96%。对于自我蒸馏没有教师的情况,USKD 是第一种自我蒸馏方法,可以在 CNN 和 VIT 模型中有效地应用于 both CNN 和 ViT 模型,并几乎没有额外的时间和内存成本,产生新的最先进的结果,例如 MobileNet 和 DeiT-Tiny 在 ImageNet 上的 1.17% 和 0.55% 准确性提高。我们的代码在此https URL上可用。

URL

https://arxiv.org/abs/2303.13005

PDF

https://arxiv.org/pdf/2303.13005.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot