Abstract
In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across \textbf{8.4 million daily active devices}.
Abstract (translated)
在移动计算时代,将高效的自然语言处理(NLP)模型部署到资源受限的边缘环境中面临着重大挑战,尤其是在需要严格隐私合规、实时响应和多样化多任务处理能力的情况下。这些挑战迫切地要求开发出超紧凑型模型,在满足严格的内存限制的同时,仍能保持强大的跨各种NLP任务性能。为此,我们引入了Edge ultra-lIte BERT框架(EI-BERT)及其创新的交叉蒸馏方法。EI-BERT通过包括硬令牌剪枝、交叉蒸馏和参数量化在内的全面压缩管道高效地缩小模型规模。特别是,交叉蒸馏方法使教师模型能够从学生模型的角度理解问题,并确保通过参数整合与模型间的相互作用进行高效的知识传递。通过广泛的实验,我们成功构建了一个仅1.91 MB的基于BERT的小型模型——迄今为止用于自然语言理解和(NLU)任务中最小的模型之一。这种超紧凑型模型已经在支付宝生态系统内的多个场景下成功部署,并在实际应用中显示出显著改进。例如,该模型自2024年1月以来已集成到支付宝的实时边缘推荐系统中,当前为应用程序的推荐流量服务着**840万台每日活跃设备**。
URL
https://arxiv.org/abs/2507.04636