Abstract
Knowledge distillation, a technique for model compression and performance enhancement, has gained significant traction in Neural Machine Translation (NMT). However, existing research primarily focuses on empirical applications, and there is a lack of comprehensive understanding of how student model capacity, data complexity, and decoding strategies collectively influence distillation effectiveness. Addressing this gap, our study conducts an in-depth investigation into these factors, particularly focusing on their interplay in word-level and sequence-level distillation within NMT. Through extensive experimentation across datasets like IWSLT13 En$\rightarrow$Fr, IWSLT14 En$\rightarrow$De, and others, we empirically validate hypotheses related to the impact of these factors on knowledge distillation. Our research not only elucidates the significant influence of model capacity, data complexity, and decoding strategies on distillation effectiveness but also introduces a novel, optimized distillation approach. This approach, when applied to the IWSLT14 de$\rightarrow$en translation task, achieves state-of-the-art performance, demonstrating its practical efficacy in advancing the field of NMT.
Abstract (translated)
知识蒸馏是一种用于模型压缩和性能增强的技术,在自然语言翻译(NMT)领域已经得到了显著的关注。然而,现有的研究主要关注实证应用,并且缺乏对学生在模型容量、数据复杂性和解码策略方面的集体影响的理解。为了填补这一空白,我们的研究对这些因素进行了深入的调查,特别关注它们在NMT中的词级和序列级别蒸馏之间的相互作用。通过在类似IWSLT13 En$\rightarrow$Fr、IWSLT14 En$\rightarrow$De等数据集上的广泛实验,我们通过实验验证了这些因素对知识蒸馏效果的影响。我们的研究不仅阐明了模型容量、数据复杂性和解码策略对蒸馏效果的重要影响,还引入了一种新的优化蒸馏方法。将这种方法应用于IWSLT14 de$\rightarrow$en翻译任务,实现了与最先进性能相当的结果,表明其在推动NMT领域的发展方面具有实际效果。
URL
https://arxiv.org/abs/2312.08585