Paper Reading AI Learner

Neutral residues: revisiting adapters for model extension

2024-10-03 17:55:17
Franck Signe Talla, Herve Jegou, Edouard Grave

Abstract

We address the problem of extending a pretrained large language model to a new domain that was not seen at training time, like adding a language for which the original model has seen no or little training data. Popular solutions like fine-tuning or low-rank adaptation are successful at domain adaptation, but formally they do not add any extra capacity and degrade the performance in the original domain. Our paper analyzes this extension problem under three angles: data, architecture and training procedure, which are advantageously considered jointly. In particular, we improve adapters and make it possible to learn an entire new language while ensuring that the output of the neural network is almost unchanged in the original domain. For this purpose, we modify the new residual blocks in a way that leads each new residual block to output near-zeros in the original domain. This solution of neutral residues, which borrows architectural components from mixture of experts, is effective: with only 20% extra learnable weights compared to an original model trained on English, we get results that are significantly better than concurrent approaches (fine-tuning, low-rank or vanilla adapters) in terms of the trade-off between learning a new language and not forgetting English.

Abstract (translated)

我们解决这个问题:将预训练的大语言模型扩展到训练时没有见过的新的领域,例如为原始模型看到了很少或几乎没有训练数据的语言添加一门语言。流行的解决方案如微调或低秩适应在领域适应方面是成功的,但它们在正式上并没有增加任何额外的容量,并且削弱了原始领域的性能。我们的论文从数据、架构和训练过程三个方面分析了这个问题,这些方面被共同考虑。特别是,我们改进了适配器,使得在保证网络输出在原始领域几乎不变的情况下,可以学习整个新的语言。为此,我们修改了新的残差块,使得每个新的残差块在原始领域输出接近零。这种中性残差的解决方案借鉴了专家混合的架构组件,是有效的:与在英语上训练的原始模型的学习权重相比,只需增加20%的残差项。这种解决方案在领域适应方面的表现要优于同时采用微调、低秩或原样适配器的其他方法。

URL

https://arxiv.org/abs/2410.02744

PDF

https://arxiv.org/pdf/2410.02744.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot