Abstract
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. As a result, using these models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work, we show how to perform quantization-aware training during the fine tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed by optimizing it to 8bit Integer supporting hardware.
Abstract (translated)
URL
https://arxiv.org/abs/1910.06188