A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

Abstract
Abstract (translated)
URL
PDF

Abstract

Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.

Abstract (translated)

知识蒸馏是一种流行的技术，通过模拟从大型教师模型向小型学生模型传输知识来实现。然而，直接对齐教师和学生的特征映射可能强制学生接受过于严格的约束，从而损害学生模型的性能。为了减轻上述特征不匹配的问题，现有的工作主要关注在空间上对齐教师和学生的特征映射，使用像素转换。在本文中，我们新发现，沿着通道维度对齐教师和学生的特征映射也可以有效地解决特征不匹配问题。具体来说，我们提出了一种可学习非线性通道转换来对齐学生和教师模型的特征。基于它，我们进一步提出了一种简单而通用的特征蒸馏框架，只有一个超参数来平衡蒸馏损失和任务特定损失。广泛的实验结果表明，我们的方法在多种计算机视觉任务中实现了显著的性能改进，包括图像分类(+3.28%的 ImageNet-1K 上 Top-1 准确性)、物体检测(+3.9%的 MS COCO 上的 bbox mAP)、实例分割(+2.8%的 ResNet50 上的Mask-RCNN Mask mAP)、语义分割(+4.66%的 ResNet18 上的PSPNet 在 Cityscapes 上的语义分割 mIoU)，这表明了我们方法的有效性和灵活性。代码将公开可用。

URL

https://arxiv.org/abs/2303.13212

PDF

https://arxiv.org/pdf/2303.13212.pdf