Paper Reading AI Learner

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

2023-03-16 20:39:44
Aashka Trivedi, Takuma Udagawa, Michele Merler, Rameswar Panda, Yousef El-Kurdi, Bishwaranjan Bhattacharjee

Abstract

Large pre-trained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) of a smaller student model addresses their inefficiency, allowing for deployment in resource-constraint environments. KD however remains ineffective, as the student is manually selected from a set of existing options already pre-trained on large corpora, a sub-optimal choice within the space of all possible student architectures. This paper proposes KD-NAS, the use of Neural Architecture Search (NAS) guided by the Knowledge Distillation process to find the optimal student model for distillation from a teacher, for a given natural language task. In each episode of the search process, a NAS controller predicts a reward based on a combination of accuracy on the downstream task and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full downstream task training set. When distilling on the MNLI task, our KD-NAS model produces a 2 point improvement in accuracy on GLUE tasks with equivalent GPU latency with respect to a hand-crafted student architecture available in the literature. Using Knowledge Distillation, this model also achieves a 1.4x speedup in GPU Latency (3.2x speedup on CPU) with respect to a BERT-Base Teacher, while maintaining 97% performance on GLUE Tasks (without CoLA). We also obtain an architecture with equivalent performance as the hand-crafted student model on the GLUE benchmark, but with a 15% speedup in GPU latency (20% speedup in CPU latency) and 0.8 times the number of parameters

Abstract (translated)

大型预训练语言模型已经在许多下游任务中取得了最先进的结果。对小型学生模型的知识蒸馏(KD)解决了它们的效率问题,使得可以部署在资源受限的环境中。然而,KD仍然无效,因为学生是从一组已经针对大型语料库进行预训练的选择中手动选择的,是在所有可能学生架构空间中的次优选择。本文提出了KD-NAS,使用神经网络架构搜索(NAS)的指导知识蒸馏过程来找到从教师中知识蒸馏的最优学生模型,以给定自然语言任务。在搜索过程中,NAS控制器基于下游任务的准确性和推断延迟的组合预测奖励。最终,从教师中蒸馏的最佳学生架构在一个小型代理集合上进行了蒸馏。最后,选择的最高奖励架构被蒸馏到整个下游任务训练集上。在MNLI任务中,我们的KD-NAS模型在GLUE任务中实现了2点的准确性改进,与文献中可用的手工制作学生架构相比,具有与BERT基教师相同的GPU延迟。使用知识蒸馏,该模型还实现了GPU延迟的1.4倍速度提升(CPU延迟的3.2倍速度提升),同时保持了GLUE任务中的97%性能(在没有CoLA的情况下)。我们还获得了与手工制造学生架构相当的性能,但在GPU延迟上实现了15%的速度提升(CPU延迟上20%)和0.8倍参数数量。

URL

https://arxiv.org/abs/2303.09639

PDF

https://arxiv.org/pdf/2303.09639.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot