Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Abstract
Abstract (translated)
URL
PDF

Abstract

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

Abstract (translated)

这项工作重点利用和选择广泛的未标注、开源数据集来预训练语言模型。目标是减少后续微调 costly领域特定数据的需求，同时实现所需性能水平。虽然许多数据选择算法是为小型应用设计的，但它们并不适用于我们的情况，一些新兴方法则关注语言数据规模。然而，它们通常优先考虑与目标分布相符的数据。虽然这种策略在从头训练模型时可能有效，但当模型已经在另一个分布上预训练时，效果可能有限。与之前的工作不同，我们的关键想法是选择数据，它推动了预训练分布更接近目标分布。我们证明了这种方法在某些条件下进行微调的优化性。我们在一系列任务（NLU、NLG、零击）上展示了这种方法的有效性，这些任务的模型规模从2.7亿到2.7亿。此外，与现有技术相比，我们的方法显著更快，能够在单个GPU小时内扩展到数百万个样本。我们的代码是开源的（代码仓库：https://anonymous.4open.science/r/DV4LLM-D761/）。尽管微调在多样任务上具有显著的提高潜力，但与这种工作相关的成本通常会限制其广泛的采用；通过这项工作，我们希望为低成本微调打下基础，使其益处更加易于获取。

URL

https://arxiv.org/abs/2405.02774

PDF

https://arxiv.org/pdf/2405.02774.pdf

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Abstract

Abstract (translated)

URL

PDF Copy

PDF