Abstract
We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English.
Abstract (translated)
我们解决这个问题:将预训练的大语言模型扩展到训练时没有见过的新的领域,例如为原始模型看到了很少或几乎没有训练数据的语言添加一门语言。流行的解决方案如微调或低秩适应在领域适应方面是成功的,但它们在正式上并没有增加任何额外的容量,并且削弱了原始领域的性能。我们的论文从数据、架构和训练过程三个方面分析了这个问题,这些方面被共同考虑。特别是,我们改进了适配器,使得在保证网络输出在原始领域几乎不变的情况下,可以学习整个新的语言。为此,我们修改了新的残差块,使得每个新的残差块在原始领域输出接近零。这种中性残差的解决方案借鉴了专家混合的架构组件,是有效的:与在英语上训练的原始模型的学习权重相比,只需增加20%的残差项。这种解决方案在领域适应方面的表现要优于同时采用微调、低秩或原样适配器的其他方法。
URL
https://arxiv.org/abs/2410.02744