Abstract
Large pre-trained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) of a smaller student model addresses their inefficiency, allowing for deployment in resource-constraint environments. KD however remains ineffective, as the student is manually selected from a set of existing options already pre-trained on large corpora, a sub-optimal choice within the space of all possible student architectures. This paper proposes KD-NAS, the use of Neural Architecture Search (NAS) guided by the Knowledge Distillation process to find the optimal student model for distillation from a teacher, for a given natural language task. In each episode of the search process, a NAS controller predicts a reward based on a combination of accuracy on the downstream task and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full downstream task training set. When distilling on the MNLI task, our KD-NAS model produces a 2 point improvement in accuracy on GLUE tasks with equivalent GPU latency with respect to a hand-crafted student architecture available in the literature. Using Knowledge Distillation, this model also achieves a 1.4x speedup in GPU Latency (3.2x speedup on CPU) with respect to a BERT-Base Teacher, while maintaining 97% performance on GLUE Tasks (without CoLA). We also obtain an architecture with equivalent performance as the hand-crafted student model on the GLUE benchmark, but with a 15% speedup in GPU latency (20% speedup in CPU latency) and 0.8 times the number of parameters
Abstract (translated)
大型预训练语言模型已经在许多下游任务中取得了最先进的结果。对小型学生模型的知识蒸馏(KD)解决了它们的效率问题,使得可以部署在资源受限的环境中。然而,KD仍然无效,因为学生是从一组已经针对大型语料库进行预训练的选择中手动选择的,是在所有可能学生架构空间中的次优选择。本文提出了KD-NAS,使用神经网络架构搜索(NAS)的指导知识蒸馏过程来找到从教师中知识蒸馏的最优学生模型,以给定自然语言任务。在搜索过程中,NAS控制器基于下游任务的准确性和推断延迟的组合预测奖励。最终,从教师中蒸馏的最佳学生架构在一个小型代理集合上进行了蒸馏。最后,选择的最高奖励架构被蒸馏到整个下游任务训练集上。在MNLI任务中,我们的KD-NAS模型在GLUE任务中实现了2点的准确性改进,与文献中可用的手工制作学生架构相比,具有与BERT基教师相同的GPU延迟。使用知识蒸馏,该模型还实现了GPU延迟的1.4倍速度提升(CPU延迟的3.2倍速度提升),同时保持了GLUE任务中的97%性能(在没有CoLA的情况下)。我们还获得了与手工制造学生架构相当的性能,但在GPU延迟上实现了15%的速度提升(CPU延迟上20%)和0.8倍参数数量。
URL
https://arxiv.org/abs/2303.09639