OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

Abstract
Abstract (translated)
URL
PDF

Abstract

Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at this https URL and our codebase at this https URL.

Abstract (translated)

翻译：对 diverse下游任务的指令微调预训练语言模型已经取得了显著的成功，并吸引了学术界和实践界的广泛关注。为了确保微调后的 LLM 符合人类偏好，出现了诸如 RLHF 和 DPO 等技术。与此同时，对于模型参数数量的需求也在增加。在这项工作中，我们使用 OpenLLaMA 3Bv2 作为基础模型，描述了用于微调 OpenBezoar 模型的食谱。在这个食谱中：我们首先使用一个基于 Falcon-40B 模型，在三个方案（基于 LaMini-LM、WizardLM/Evol-Instruct（使用 databricks-dolly-15k 作为 seed 数据集）和 Orca（使用 Flan Collection 作为 seed 数据集）下生成合成指令微调数据，然后使用 GPT-4 作为人类代理过滤这些世代。接着，我们使用每个方案的成本效益 QLoRA 进行逐步微调。得到的最终checkpoint 进一步通过 HH-RLHF 子集的微调来最小化在使用 DPO 损失之前分布漂移。使用LM Eval Harness任务/指标以及MT-Bench 使用 "LLM-as-a-judge"框架对Claude 2.1进行评估。结果表明，在3B参数级别，"OpenBezoar-HH-RLHF-DPO" 显示的性能优于许多模型，即使在其中一个分类上，也超过了该分类中的顶级模型。我们发布了 "OpenBezoar-SFT"、"OpenBezoar-HH-RLHF-SFT" 和 "OpenBezoar-HH-RLHF-DPO" 检查点，这些检查点与我们的生成数据一起存放在 HuggingFace 的这个链接：https://www.huggingface.co/openbezoar-sft/。

URL

https://arxiv.org/abs/2404.12195

PDF

https://arxiv.org/pdf/2404.12195.pdf

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

Abstract

Abstract (translated)

URL

PDF Copy

PDF