Paper Reading AI Learner

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

2024-05-05 00:08:00
Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia

Abstract

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

Abstract (translated)

这项工作重点利用和选择广泛的未标注、开源数据集来预训练语言模型。目标是减少后续微调 costly领域特定数据的需求,同时实现所需性能水平。虽然许多数据选择算法是为小型应用设计的,但它们并不适用于我们的情况,一些新兴方法则关注语言数据规模。然而,它们通常优先考虑与目标分布相符的数据。虽然这种策略在从头训练模型时可能有效,但当模型已经在另一个分布上预训练时,效果可能有限。与之前的工作不同,我们的关键想法是选择数据,它推动了预训练分布更接近目标分布。我们证明了这种方法在某些条件下进行微调的优化性。我们在一系列任务(NLU、NLG、零击)上展示了这种方法的有效性,这些任务的模型规模从2.7亿到2.7亿。此外,与现有技术相比,我们的方法显著更快,能够在单个GPU小时内扩展到数百万个样本。我们的代码是开源的(代码仓库:https://anonymous.4open.science/r/DV4LLM-D761/)。尽管微调在多样任务上具有显著的提高潜力,但与这种工作相关的成本通常会限制其广泛的采用;通过这项工作,我们希望为低成本微调打下基础,使其益处更加易于获取。

URL

https://arxiv.org/abs/2405.02774

PDF

https://arxiv.org/pdf/2405.02774.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot