Abstract
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
Abstract (translated)
建立高质量的数据集为专业任务是一个耗时且资源密集的过程,通常需要专业知识。我们提出 Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT),一种生成合成数据的方法,基于少量用户写的微样本,这些微样本展示了要执行的任务。鉴于微样本,我们使用大规模公共网络爬取的语料库和基于相似度的文档检索来寻找其他相关的人类写作文档。最后,指令调整的大语言模型(LLMs)将检索到的文档格式化为自定义任务样本,然后可以用于微调。我们证明了CRAFT可以有效地为四个不同的任务生成大型任务特定的训练数据集:生物学问题回答(QA),医学 QA和常识 QA以及总结。我们的实验结果表明,基于CRAFT的模型在QA任务上优于或达到与一般LLM相当或更好的性能,而基于CRAFT的摘要模型在46个偏好点上优于训练在人类标注数据上的模型。
URL
https://arxiv.org/abs/2409.02098